Arxiv今日论文 | 2024-12-03

本篇博文主要展示 2024-12-03 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决虚假新闻检测的问题，其解决方案的关键在于引入上下文感知和网络感知特征，通过分析新闻内容及其传播上下文和用户网络来提升检测效果。具体来说，论文提出了一个名为GETAE（Graph Information Enhanced Deep Neural Network Ensemble Architecture）的新型集成架构，该架构包含文本分支和传播分支。文本分支利用词嵌入和Transformer嵌入，结合前馈和双向循环神经网络（BiRNN）来学习新的上下文特征并生成文本内容嵌入。传播分支则考虑图网络中的信息传播，采用节点嵌入生成新的传播嵌入。最终，GETAE集成这两种嵌入，生成增强的传播内容嵌入，用于分类，从而在Twitter15和Twitter16数据集上显著提升虚假新闻检测的性能，超越了现有最先进模型。

链接: https://arxiv.org/abs/2412.01825
作者: Ciprian-Octavian Truică,Elena-Simona Apostol,Marius Marogel,Adrian Paschke
关键词-EN: today digital age, Text Content Embedding, Content Embedding, Text Branch, Deep Neural Network
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In today’s digital age, fake news has become a major problem that has serious consequences, ranging from social unrest to political upheaval. To address this issue, new methods for detecting and mitigating fake news are required. In this work, we propose to incorporate contextual and network-aware features into the detection process. This involves analyzing not only the content of a news article but also the context in which it was shared and the network of users who shared it, i.e., the information diffusion. Thus, we propose GETAE, \underlineGraph Information \underlineEnhanced Deep Neural Ne\underlinetwork Ensemble \underlineArchitectur\underlineE for Fake News Detection, a novel ensemble architecture that uses textual content together with the social interactions to improve fake news detection. GETAE contains two Branches: the Text Branch and the Propagation Branch. The Text Branch uses Word and Transformer Embeddings and a Deep Neural Network based on feed-forward and bidirectional Recurrent Neural Networks (\textsc[Bi]RNN) for learning novel contextual features and creating a novel Text Content Embedding. The Propagation Branch considers the information propagation within the graph network and proposes a Deep Learning architecture that employs Node Embeddings to create novel Propagation Embedding. GETAE Ensemble combines the two novel embeddings, i.e., Text Content Embedding and Propagation Embedding, to create a novel \textitPropagation-Enhanced Content Embedding which is afterward used for classification. The experimental results obtained on two real-world publicly available datasets, i.e., Twitter15 and Twitter16, prove that using this approach improves fake news detection and outperforms state-of-the-art models.
zh

[NLP-1] COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在预训练过程中由于对比损失（contrastive loss）的全局性质，导致模型过度关注前景物体而忽视图像中其他重要信息的问题。解决方案的关键在于提出了COSMOS框架，该框架通过引入一种新的文本裁剪策略和跨注意力模块（cross-attention module），结合自监督学习框架，实现了对图像和文本的多模态增强（multi-modal augmentations），即创建了图像和文本的全局和局部视图。此外，COSMOS通过跨模态自蒸馏损失（cross-modality self-distillation loss）优化跨模态表示，从而在各种零样本下游任务（如检索、分类和语义分割）中显著超越了之前的强基线模型，甚至在视觉感知和上下文理解任务中优于基于CLIP的模型。

链接: https://arxiv.org/abs/2412.01814
作者: Sanghwan Kim,Rui Xiao,Mariana-Iuliana Georgescu,Stephan Alaniz,Zeynep Akata
关键词-EN: achieved significant advancements, achieved significant, significant advancements, vision and language, contrastive loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.
zh

[NLP-2] A Neurosymbolic Fast and Slow Architecture for Graph Coloring

【速读】：该论文试图解决约束满足问题 (Constraint Satisfaction Problems, CSPs) 在人工智能领域中的复杂性和求解效率问题。解决方案的关键在于构建了一个名为 SOFAI-v2 的增强架构，该架构基于 Daniel Kahneman 的“快与慢思维”认知模型，并引入了精炼的元认知治理机制。SOFAI-v2 结合了基于大型语言模型 (Large Language Models, LLMs) 的快速系统 (System 1, S1) 和由元认知模块控制的审慎系统 (System 2, S2)。S1 提供初始解，而元认知治理机制通过提供针对性的反馈和示例来调整 S1，使其符合 CSP 的要求。如果 S1 失败，元认知策略性地调用 S2，确保准确和可靠的解决方案。实验结果表明，SOFAI-v2 在图着色问题上的成功率提高了 16.98%，并且比符号求解器快 32.42%。

链接: https://arxiv.org/abs/2412.01752
作者: Vedant Khandelwal,Vishal Pallagani,Biplav Srivastava,Francesca Rossi
关键词-EN: present significant challenges, artificial intelligence due, Constraint Satisfaction Problems, Large Language Models, Daniel Kahneman Thinking
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 Pages, 18 Figures, 3 Tables

点击查看摘要

Abstract:Constraint Satisfaction Problems (CSPs) present significant challenges to artificial intelligence due to their intricate constraints and the necessity for precise solutions. Existing symbolic solvers are often slow, and prior research has shown that Large Language Models (LLMs) alone struggle with CSPs because of their complexity. To bridge this gap, we build upon the existing SOFAI architecture (or SOFAI-v1), which adapts Daniel Kahneman’s ‘‘Thinking, Fast and Slow’’ cognitive model to AI. Our enhanced architecture, SOFAI-v2, integrates refined metacognitive governance mechanisms to improve adaptability across complex domains, specifically tailored for solving CSPs like graph coloring. SOFAI-v2 combines a fast System 1 (S1) based on LLMs with a deliberative System 2 (S2) governed by a metacognition module. S1’s initial solutions, often limited by non-adherence to constraints, are enhanced through metacognitive governance, which provides targeted feedback and examples to adapt S1 to CSP requirements. If S1 fails to solve the problem, metacognition strategically invokes S2, ensuring accurate and reliable solutions. With empirical results, we show that SOFAI-v2 for graph coloring problems achieves a 16.98% increased success rate and is 32.42% faster than symbolic solvers.
zh

[NLP-3] owards Resource Efficient and Interpretable Bias Mitigation in Large Language Models NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在训练数据中固有的偏见问题，这些偏见可能会对边缘化群体造成伤害。解决方案的关键在于利用小型偏见和反偏见专家模型来生成一个去偏信号，该信号在解码时被添加到LLM的输出中。这种方法结合了资源效率和可解释性，并可根据具体应用场景优化以减轻特定类型的偏见。实验结果表明，在性别、种族和宗教偏见方面，该方法在多个局部和全局偏见指标上显著减少了偏见，同时保持了语言模型的性能。

链接: https://arxiv.org/abs/2412.01711
作者: Schrasing Tong,Eliott Zemour,Rawisara Lohanimit,Lalana Kagal
关键词-EN: perpetuate unwanted biases, unwanted biases present, range of applications, training data, potentially leading
类目: Computation and Language (cs.CL)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Safe Generative AI Workshop

点击查看摘要

Abstract:Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for marginalized communities. In this paper, we mitigate bias by leveraging small biased and anti-biased expert models to obtain a debiasing signal that will be added to the LLM output at decoding-time. This approach combines resource efficiency with interpretability and can be optimized for mitigating specific types of bias, depending on the target use case. Experiments on mitigating gender, race, and religion biases show a reduction in bias on several local and global bias metrics while preserving language model performance.
zh

[NLP-4] Query Performance Explanation through Large Language Model for HTAP Systems ICDE2025

【速读】：该论文试图解决在混合事务和分析处理（HTAP）系统中，用户难以理解为何不同引擎（OLAP或OLTP）的查询计划性能差异显著的问题。解决方案的关键在于利用大型语言模型（LLMs）和检索增强生成（RAG）框架，构建一个包含历史查询执行记录和专家解释的知识库。通过使用轻量级树卷积神经网络（tree-CNN）对查询计划进行嵌入，实现高效的知识检索，从而使LLM能够生成清晰、上下文感知的性能差异解释。这一方法展示了LLMs在混合引擎系统中的潜力，为数据库优化和用户支持提供了新的途径。

链接: https://arxiv.org/abs/2412.01709
作者: Haibo Xiu,Li Zhang,Tieying Zhang,Jun Yang,Jianjun Chen
关键词-EN: OLAP or OLTP, perform significantly slower, analytical processing, perform significantly, transactional and analytical
类目: Databases (cs.DB); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ICDE 2025

点击查看摘要

Abstract:In hybrid transactional and analytical processing (HTAP) systems, users often struggle to understand why query plans from one engine (OLAP or OLTP) perform significantly slower than those from another. Although optimizers provide plan details via the EXPLAIN function, these explanations are frequently too technical for non-experts and offer limited insights into performance differences across engines. To address this, we propose a novel framework that leverages large language models (LLMs) to explain query performance in HTAP systems. Built on Retrieval-Augmented Generation (RAG), our framework constructs a knowledge base that stores historical query executions and expert-curated explanations. To enable efficient retrieval of relevant knowledge, query plans are embedded using a lightweight tree-CNN classifier. This augmentation allows the LLM to generate clear, context-aware explanations of performance differences between engines. Our approach demonstrates the potential of LLMs in hybrid engine systems, paving the way for further advancements in database optimization and user support.
zh

[NLP-5] Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

【速读】：该论文试图解决生成式 AI (LLM) 在学术同行评审中潜在的滥用和漏洞问题。解决方案的关键在于识别和分析 LLM 生成的评审报告中的操纵风险和固有缺陷。研究通过实验展示了作者如何通过在稿件中插入隐蔽的故意内容来操纵 LLM 评审，导致评审评分虚高，与人类评审结果不一致。此外，研究还揭示了 LLM 在处理不完整论文和知名作者时的偏见问题。这些发现强调了过度依赖 LLM 进行同行评审的风险，并强调了在广泛采用之前需要建立强有力的防护措施。

链接: https://arxiv.org/abs/2412.01708
作者: Rui Ye,Xianghe Pang,Jingyi Chai,Jiaao Chen,Zhenfei Yin,Zhen Xiang,Xiaowen Dong,Jing Shao,Siheng Chen
关键词-EN: Scholarly peer review, Scholarly peer, increasing manuscript submissions, peer review, cornerstone of scientific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 27 pages, 24 figures

点击查看摘要

Abstract:Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs’ susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.
zh

[NLP-6] Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index COLING2025

【速读】：该论文试图解决在资源受限情况下，如何评估和优化提示工程技术的成本效益问题。解决方案的关键在于提出了经济提示指数（Economical Prompting Index, EPI），这一指标综合考虑了准确性评分和令牌消耗，并通过用户指定的成本关注水平进行调整，以反映不同的资源约束。通过对比分析包括思维链（Chain-of-Thought）、自一致性（Self-Consistency）和思维树（Tree of Thoughts）在内的六种高级提示技术，论文发现，尽管某些复杂技术在准确性上略有提升，但其成本显著增加，导致在资源受限场景下，简单技术的EPI表现更优。这一发现促使重新评估复杂提示策略在资源受限环境中的适用性，从而可能重塑未来研究方向，提升终端用户的成本效益。

链接: https://arxiv.org/abs/2412.01690
作者: Tyler McDonald,Anthony Colosimo,Yifeng Li,Ali Emami
关键词-EN: research rapidly evolves, Economical Prompting Index, rapidly evolves, developing cost-effective techniques, prompt engineering research
类目: Computation and Language (cs.CL)
备注: 5 pages (excluding references), accepted to Coling 2025

点击查看摘要

Abstract:As prompt engineering research rapidly evolves, evaluations beyond accuracy are crucial for developing cost-effective techniques. We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption, adjusted by a user-specified cost concern level to reflect different resource constraints. Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts, across 10 widely-used language models and 4 diverse datasets. We demonstrate that approaches such as Self-Consistency often provide statistically insignificant gains while becoming cost-prohibitive. For example, on high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques like Chain-of-Thought (0.72) surpasses more complex methods like Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a reevaluation of complex prompting strategies in resource-constrained scenarios, potentially reshaping future research priorities and improving cost-effectiveness for end-users.
zh

[NLP-7] R-Bot: An LLM -based Query Rewrite System

【速读】：该论文试图解决SQL查询重写中的优化问题，特别是在传统启发式和基于学习的方法在质量和鲁棒性方面的局限性。解决方案的关键在于提出了R-Bot系统，该系统利用大型语言模型（LLMs）的强大自然语言和代码理解能力，通过以下几个关键步骤来提高查询重写的准确性和效率：1) 设计了一个多源重写证据准备管道，生成用于指导LLMs避免幻觉（hallucinations）的重写证据；2) 提出了一种混合结构-语义检索方法，结合结构和语义分析来检索最相关的重写证据；3) 采用逐步LLM重写方法，迭代利用检索到的证据来选择和排列重写规则，并通过自我反思机制确保重写的准确性。实验结果表明，R-Bot系统在广泛使用的基准测试中表现优异，超越了现有的最先进查询重写方法。

链接: https://arxiv.org/abs/2412.01661
作者: Zhaoyan Sun,Xuanhe Zhou,Guoliang Li
关键词-EN: optimizing SQL queries, optimizing SQL, SQL queries, essential for optimizing, queries to improve
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Query rewrite is essential for optimizing SQL queries to improve their execution efficiency without changing their results. Traditionally, this task has been tackled through heuristic and learning-based methods, each with its limitations in terms of inferior quality and low robustness. Recent advancements in LLMs offer a new paradigm by leveraging their superior natural language and code comprehension abilities. Despite their potential, directly applying LLMs like GPT-4 has faced challenges due to problems such as hallucinations, where the model might generate inaccurate or irrelevant results. To address this, we propose R-Bot, an LLM-based query rewrite system with a systematic approach. We first design a multi-source rewrite evidence preparation pipeline to generate query rewrite evidences for guiding LLMs to avoid hallucinations. We then propose a hybrid structure-semantics retrieval method that combines structural and semantic analysis to retrieve the most relevant rewrite evidences for effectively answering an online query. We next propose a step-by-step LLM rewrite method that iteratively leverages the retrieved evidences to select and arrange rewrite rules with self-reflection. We conduct comprehensive experiments on widely used benchmarks, and demonstrate the superior performance of our system, R-Bot, surpassing state-of-the-art query rewrite methods.
zh

[NLP-8] Concept Based Continuous Prompts for Interpretable Text Classification

【速读】：该论文试图解决连续提示（continuous prompts）在自然语言任务中增强性能的机制不明确的问题。解决方案的关键在于提出一种框架，通过将连续提示分解为人类可读的概念（human-readable concepts）来解释其增强效果。具体来说，该框架证明了可以通过找到对应的概念嵌入矩阵（concept embedding matrix）和系数矩阵（coefficient matrix）来替代提示嵌入矩阵（prompt embedding matrix），从而实现分解。随后，利用GPT-4o生成概念池（concept pool），并通过一种新颖的子模块优化算法（submodular optimization algorithm）选择具有区分性和代表性的候选概念。实验结果表明，该框架在仅使用少量概念的情况下，能够达到与原始P-tuning和基于单词的方法相似的结果，同时提供更合理的结果。

链接: https://arxiv.org/abs/2412.01644
作者: Qian Chen,Dongyang Li,Xiaofeng He
关键词-EN: natural language tasks, interpreting continuous prompts, language tasks, Continuous prompts, widely adopted
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous prompts have become widely adopted for augmenting performance across a wide range of natural language tasks. However, the underlying mechanism of this enhancement remains obscure. Previous studies rely on individual words for interpreting continuous prompts, which lacks comprehensive semantic understanding. Drawing inspiration from Concept Bottleneck Models, we propose a framework for interpreting continuous prompts by decomposing them into human-readable concepts. Specifically, to ensure the feasibility of the decomposition, we demonstrate that a corresponding concept embedding matrix and a coefficient matrix can always be found to replace the prompt embedding matrix. Then, we employ GPT-4o to generate a concept pool and choose potential candidate concepts that are discriminative and representative using a novel submodular optimization algorithm. Experiments demonstrate that our framework can achieve similar results as the original P-tuning and word-based approaches using only a few concepts while providing more plausible results. Our code is available at this https URL.
zh

[NLP-9] Using Large Language Models in Automatic Hint Ranking and Generation Tasks

【速读】：该论文试图解决在信息时代如何有效刺激和保持人类认知能力及推理技能的问题。解决方案的关键在于推广使用提示（hints）作为直接答案的替代或补充。论文通过构建一个手动创建的提示数据集WIKIHINT，并微调开源大型语言模型（LLMs）如LLaMA-3.1以在答案感知（answer-aware）和答案无关（answer-agnostic）的上下文中生成提示。研究还评估了提示的有效性，并引入了一种轻量级的评估方法HINTRANK来评估和排序提示。研究发现，数据集有助于生成更有效的提示，包含答案信息的问题通常能提高提示质量，且基于编码器的模型在提示排序中表现优于基于解码器的模型。

链接: https://arxiv.org/abs/2412.01626
作者: Jamshid Mozafari,Florian Gerhold,Adam Jatowt
关键词-EN: Large Language Models, Large Language, increased significantly recently, individuals frequently interacting, Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The use of Large Language Models (LLMs) has increased significantly recently, with individuals frequently interacting with chatbots to receive answers to a wide range of questions. In an era where information is readily accessible, it is crucial to stimulate and preserve human cognitive abilities and maintain strong reasoning skills. This paper addresses such challenges by promoting the use of hints as an alternative or a supplement to direct answers. We first introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000 hints created for 1,000 questions. We then finetune open-source LLMs such as LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We assess the effectiveness of the hints with human participants who try to answer questions with and without the aid of hints. Additionally, we introduce a lightweight evaluation method, HINTRANK, to evaluate and rank hints in both answer-aware and answer-agnostic settings. Our findings show that (a) the dataset helps generate more effective hints, (b) including answer information along with questions generally improves hint quality, and © encoder-based models perform better than decoder-based models in hint ranking.
zh

[NLP-10] CHIMA: Headline-Guided Extractive Summarization for Thai News Articles

【速读】：该论文试图解决低资源语言（如泰语）文本摘要中，传统抽取式摘要模型主要依赖文章正文而忽略标题信息的问题。解决方案的关键在于提出了CHIMA模型，该模型通过引入标题的上下文信息来指导句子选择，从而提高摘要的质量和相关性。CHIMA模型利用预训练语言模型捕捉复杂的语言语义，并结合标题与正文之间的相似度（通过简单平均和调和平均两种策略）来分配句子被包含在摘要中的概率。实验结果表明，CHIMA在ROUGE、BLEU和F1评分上均优于基线模型，显著提升了模型对关键句子的召回能力，尤其是在文章中间或结尾部分的重要句子。

链接: https://arxiv.org/abs/2412.01624
作者: Pimpitchaya Kositcharoensuk,Nakarin Sritrakool,Ploy N. Pratanwanich
关键词-EN: condensing lengthy texts, Thai, process of condensing, condensing lengthy, preserving their essential
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text summarization is a process of condensing lengthy texts while preserving their essential information. Previous studies have predominantly focused on high-resource languages, while low-resource languages like Thai have received less attention. Furthermore, earlier extractive summarization models for Thai texts have primarily relied on the article’s body, without considering the headline. This omission can result in the exclusion of key sentences from the summary. To address these limitations, we propose CHIMA, an extractive summarization model that incorporates the contextual information of the headline for Thai news articles. Our model utilizes a pre-trained language model to capture complex language semantics and assigns a probability to each sentence to be included in the summary. By leveraging the headline to guide sentence selection, CHIMA enhances the model’s ability to recover important sentences and discount irrelevant ones. Additionally, we introduce two strategies for aggregating headline-body similarities, simple average and harmonic mean, providing flexibility in sentence selection to accommodate varying writing styles. Experiments on publicly available Thai news datasets demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1 scores. These results highlight the effectiveness of incorporating the headline-body similarities as model guidance. The results also indicate an enhancement in the model’s ability to recall critical sentences, even those scattered throughout the middle or end of the article. With this potential, headline-guided extractive summarization offers a promising approach to improve the quality and relevance of summaries for Thai news articles.
zh

[NLP-11] NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers COLING2025

【速读】：该论文试图解决大语言模型（LLMs）在深思熟虑的推理能力上的不足问题。解决方案的关键在于引入NYT-Connections基准，这是一个由358个简单的词汇分类谜题组成的集合，源自《纽约时报》的Connections游戏。该基准设计旨在惩罚快速、直觉的“系统1”思维，从而隔离和评估基本的推理技能。通过在单次尝试、多次尝试无提示和多次尝试有上下文提示三种配置下，对六个最新的LLMs、一个简单的机器学习启发式方法以及人类进行评估，研究发现，即使是表现最好的LLMs如GPT-4，其表现也远低于人类，差距接近30%。此外，随着任务难度的增加，高级提示技术如思维链（Chain-of-Thought）和自一致性（Self-Consistency）的效果逐渐减弱。NYT-Connections通过结合语言隔离、抵抗直觉捷径和定期更新以防止数据泄露，提供了一种新颖的工具来评估LLMs的推理能力。

链接: https://arxiv.org/abs/2412.01621
作者: Angel Yahir Loredo Lopez,Tyler McDonald,Ali Emami
关键词-EN: Large Language Models, Large Language, Language Models, York Times Connections, reasoning remains questionable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages (excluding references), accepted to Coling 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive “System 1” thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.
zh

[NLP-12] If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World

【速读】：该论文试图解决的问题是如何利用大型语言模型（LLMs）如ChatGPT来缓解孤独感，同时评估其潜在风险。解决方案的关键在于认识到ChatGPT并非为缓解孤独而设计，因此在使用过程中存在伦理和法律问题，尤其是在处理敏感场景（如自杀倾向或创伤）时表现不佳。论文通过分析用户与ChatGPT的互动，发现用户在孤独情境下寻求建议或验证的比例较高（37%），但ChatGPT在敏感场景中的应对能力不足，且存在较高的有毒内容发生率（35%），尤其是女性用户受到的负面影响更为严重（22倍于男性）。论文强调了这一技术的伦理和法律风险，并提出了针对孤独问题的研究与行业建议。

链接: https://arxiv.org/abs/2412.01617
作者: Adrian de Wynter
关键词-EN: fulfilling relationships, significantly impacts, lack of fulfilling, impacts a person, person mental
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Loneliness, or the lack of fulfilling relationships, significantly impacts a person’s mental and physical well-being and is prevalent worldwide. Previous research suggests that large language models (LLMs) may help mitigate loneliness. However, we argue that the use of widespread LLMs like ChatGPT is more prevalent–and riskier, as they are not designed for this purpose. To explore this, we analysed user interactions with ChatGPT, particularly those outside of its marketed use as task-oriented assistant. In dialogues classified as lonely, users frequently (37%) sought advice or validation, and received good engagement. However, ChatGPT failed in sensitive scenarios, like responding appropriately to suicidal ideation or trauma. We also observed a 35% higher incidence of toxic content, with women being 22 times more likely to be targeted than men. Our findings underscore ethical and legal questions about this technology, and note risks like radicalisation or further isolation. We conclude with recommendations for research and industry to address loneliness.
zh

[NLP-13] Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking

【速读】：该论文试图解决临床决策制定（Clinical Decision Making, CDM）在实际医疗场景中应用的挑战，特别是现有大型语言模型（Large Language Model, LLM）在处理复杂、动态的临床任务时表现不足的问题。解决方案的关键在于提出了MedChain数据集和MedChain-Agent系统。MedChain数据集包含12,163个临床案例，覆盖临床工作流的五个关键阶段，并强调了个性化、交互性和顺序性这三个真实临床实践的关键特征。MedChain-Agent系统通过集成反馈机制和MCase-RAG模块，能够从先前的案例中学习并动态调整其响应，显著提高了在处理顺序临床任务时的适应性和表现。

链接: https://arxiv.org/abs/2412.01605
作者: Jie Liu,Wenxuan Wang,Zizhan Ma,Guolin Huang,Yihang SU,Kao-Jung Chang,Wenting Chen,Haoliang Li,Linlin Shen,Michael Lyu
关键词-EN: dynamic process crucial, Large Language Model, Clinical decision making, artificial intelligence systems, decision making
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches. The relevant dataset and code will be released upon acceptance of this paper.
zh

[NLP-14] Scaling Law for Language Models Training Considering Batch Size

【速读】：该论文试图解决的问题是如何在训练大型语言模型 (LLMs) 时优化全局批量大小 (global batch size) 这一关键超参数的影响。解决方案的关键在于通过实验研究不同批量大小和学习率对模型收敛和泛化能力的影响，并在此基础上建立批量大小在固定计算预算和固定训练数据量两种情况下的缩放规律 (scaling laws)。这些规律通过外推实验验证，为在特定资源约束下优化LLM训练策略提供了指导。

链接: https://arxiv.org/abs/2412.01505
作者: Xian Shuai,Yiding Wang,Yimeng Wu,Xin Jiang,Xiaozhe Ren
关键词-EN: made remarkable advances, Large language models, Large language, recent years, rapid progress
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch size, influences the LLM training prdocess. We begin by training language models ranging from 125 million to 2.6 billion parameters, using up to 300 billion high-quality tokens. Through these experiments, we establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models. Our analysis yields batch size scaling laws under two different cases: with a fixed compute budget, and with a fixed amount of training data. Extrapolation experiments on models of increasing sizes validate our predicted laws, which provides guidance for optimizing LLM training strategies under specific resource constraints.
zh

[NLP-15] Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization

【速读】：该论文试图解决大型语言模型（LLMs）在推理速度上的限制问题，特别是由于模型参数庞大导致的推理速度缓慢。解决方案的关键在于探索早期退出（Early Exit, EE）技术在LLMs中的应用，特别是无需额外输出层和联合优化的EE方法。研究发现，EE是基于Transformer模型的自然能力，但为了提高定位最优EE层位的准确性，仍需通过门控函数进行联合优化。此外，论文还揭示了基于子词视角的EE行为模式，并探讨了基于子层的EE可能性。

链接: https://arxiv.org/abs/2412.01455
作者: Weiqiao Shan,Long Meng,Tong Zheng,Yingfeng Luo,Bei Li,junxin Wang,Tong Xiao,Jingbo Zhu
关键词-EN: exhibit exceptional performance, Large language models, Large language, exhibit exceptional, downstream tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit exceptional performance across various downstream tasks. However, they encounter limitations due to slow inference speeds stemming from their extensive parameters. The early exit (EE) is an approach that aims to accelerate auto-regressive decoding. EE generates outputs from intermediate layers instead of using the whole model, which offers a promising solution to this challenge. However, additional output layers and joint optimization used in conventional EE hinder the application of EE in LLMs. In this paper, we explore the possibility of LLMs EE without additional output layers and joint optimization. Our findings indicate that EE is a natural capability within transformer-based models. While joint optimization does not give model EE capability, it must be employed to address challenges by improving the accuracy of locating the optimal EE layer through gating functions. Additionally, our study reveals patterns in EE behavior from a sub-word perspective based on the LLaMA model and the potential possibility for EE based on sub-layers. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.01455 [cs.CL] (or arXiv:2412.01455v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.01455 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-16] PLD: Accelerating LLM inference by leveraging Language Model Artifacts

【速读】：该论文试图解决自回归大型语言模型（LLM）推理过程中高延迟的问题。解决方案的关键在于提出了一种名为PLD+的算法套件，该套件通过利用输入引导任务（如代码编辑、文本编辑、摘要等）中输出与输入之间的高重叠性，以及推理过程中生成的注意力机制和隐藏状态等中间产物，来加速LLM的推理过程。PLD+不仅在无需额外计算资源和微调的情况下实现了显著的加速效果，而且在贪婪设置下，甚至在四个任务上超越了最先进的依赖微调的方法EAGLE，平均加速比达到了2.31。

链接: https://arxiv.org/abs/2412.01447
作者: Shwetha Somasundaram,Anirudh Phukan,Apoorv Saxena
关键词-EN: verified in parallel, speculative decoding, reduce the latency, future tokens, tokens are drafted
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To reduce the latency associated with autoretrogressive LLM inference, speculative decoding has emerged as a novel decoding paradigm, where future tokens are drafted and verified in parallel. However, the practical deployment of speculative decoding is hindered by its requirements for additional computational resources and fine-tuning, which limits its out-of-the-box usability. To address these challenges, we present PLD+, a suite of novel algorithms developed to accelerate the inference process of LLMs, particularly for input-guided tasks. These tasks, which include code editing, text editing, summarization, etc., often feature outputs with substantial overlap with their inputs-an attribute PLD+ is designed to exploit. PLD+ also leverages the artifacts (attention and hidden states) generated during inference to accelerate inference speed. We test our approach on five input-guided tasks and through extensive experiments we find that PLD+ outperforms all tuning-free approaches. In the greedy setting, it even outperforms the state-of-the-art tuning-dependent approach EAGLE on four of the tasks. (by a margin of upto 2.31 in terms of avg. speedup). Our approach is tuning free, does not require any additional compute and can easily be used for accelerating inference of any LLM.
zh

[NLP-17] Multi-Facet Blending for Faceted Query-by-Example Retrieval

【速读】：该论文试图解决细粒度用户意图匹配的问题，特别是在基于特定方面的查询示例（Query-by-Example, QBE）中，如何有效地检索符合特定方面的相似文档。现有方法主要依赖于基于引用的文档级比较，这限制了其在非引用领域的应用，并无法捕捉方面的复杂约束。论文提出的解决方案是多方面融合（FaBle）增强方法，其关键在于通过模块化分解和重组文档，明确合成特定方面的训练集。具体来说，FaBle方法自动将文档分解为方面单元，并利用大型语言模型（LLMs）的内在区分能力生成相关或不相关对，然后动态重组这些单元以形成方面相关的文档对。这种方法无需预定义的方面知识或标签，并能显著辅助训练以获得方面条件嵌入。此外，论文还发布了一个新的教育考试项目QBE基准数据集，以验证FaBle在新领域的有效性。

链接: https://arxiv.org/abs/2412.01443
作者: Heejin Do,Sangwon Ryu,Jonghwi Kim,Gary Geunbae Lee
关键词-EN: fine-grained user intents, gained recent attention, fit fine-grained user, retrieves similar documents, similar documents conditioned
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs’ intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle’s efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
zh

[NLP-18] A Top-down Graph-based Tool for Modeling Classical Semantic Maps: A Crosslinguistic Case Study of Supplementary Adverbs

【速读】：该论文试图解决手动构建语义地图模型 (Semantic Map Models, SMMs) 过程中的人力资源密集和时间消耗问题。解决方案的关键在于提出了一种基于图的算法，该算法以自上而下的方式自动生成概念空间和语义地图模型。具体而言，算法首先创建一个密集图，然后通过修剪形成最大生成树，并根据提出的评估指标进行选择。这些指标包括内在和外在的度量，考虑了网络结构以及精确度和覆盖率之间的权衡。通过跨语言补充副词的案例研究，证明了该模型在效率和效果上优于人工标注和其他自动化方法。

链接: https://arxiv.org/abs/2412.01423
作者: Zhu Liu,Cunliang Kong,Ying Liu,Maosong Sun
关键词-EN: Semantic map models, Semantic map, construct a network-like, instances or forms, connectivity hypothesis
类目: Computation and Language (cs.CL)
备注: Paper under review

点击查看摘要

Abstract:Semantic map models (SMMs) construct a network-like conceptual space from cross-linguistic instances or forms, based on the connectivity hypothesis. This approach has been widely used to represent similarity and entailment relationships in cross-linguistic concept comparisons. However, most SMMs are manually built by human experts using bottom-up procedures, which are often labor-intensive and time-consuming. In this paper, we propose a novel graph-based algorithm that automatically generates conceptual spaces and SMMs in a top-down manner. The algorithm begins by creating a dense graph, which is subsequently pruned into maximum spanning trees, selected according to metrics we propose. These evaluation metrics include both intrinsic and extrinsic measures, considering factors such as network structure and the trade-off between precision and coverage. A case study on cross-linguistic supplementary adverbs demonstrates the effectiveness and efficiency of our model compared to human annotations and other automated methods. The tool is available at \urlthis https URL.
zh

[NLP-19] Impromptu Cybercrime Euphemism Detection

【速读】：该论文试图解决即时性委婉语（impromptu euphemisms）检测的问题，这是现有方法在内容安全领域中未能有效应对的挑战。解决方案的关键在于引入了一个名为Impromptu Cybercrime Euphemisms Detection (ICED)的数据集，并提出了一种专门针对此问题的检测框架。该框架包括一个粗粒度分类模型和一个细粒度分类模型。粗粒度模型用于剔除大部分无害内容，而细粒度模型则通过上下文增强建模和多轮迭代训练来更准确地预测被屏蔽词的真实含义。实验结果显示，该方法相较于之前的最佳委婉语检测器，性能提升了76倍。

链接: https://arxiv.org/abs/2412.01413
作者: Xiang Li,Yucheng Zhou,Laiping Zhao,Jing Li,Fangming Liu
关键词-EN: social media platforms, existing methods designed, Detecting euphemisms, Impromptu Cybercrime Euphemisms, Cybercrime Euphemisms Detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting euphemisms is essential for content security on various social media platforms, but existing methods designed for detecting euphemisms are ineffective in impromptu euphemisms. In this work, we make a first attempt to an exploration of impromptu euphemism detection and introduce the Impromptu Cybercrime Euphemisms Detection (ICED) dataset. Moreover, we propose a detection framework tailored to this problem, which employs context augmentation modeling and multi-round iterative training. Our detection framework mainly consists of a coarse-grained and a fine-grained classification model. The coarse-grained classification model removes most of the harmless content in the corpus to be detected. The fine-grained model, impromptu euphemisms detector, integrates context augmentation and multi-round iterations training to better predicts the actual meaning of a masked token. In addition, we leverage ChatGPT to evaluate the mode’s capability. Experimental results demonstrate that our approach achieves a remarkable 76-fold improvement compared to the previous state-of-the-art euphemism detector.
zh

[NLP-20] owards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning COLING2025

【速读】：该论文试图解决在低资源语言环境中，特别是音频模态下的在线辱骂内容检测问题。解决方案的关键在于利用预训练的音频表示模型（如Wav2Vec和Whisper）结合少样本学习（Few Shot Learning, FSL）和模型无关元学习（Model-Agnostic Meta-Learning, MAML）框架，来实现跨语言的辱骂检测。通过在ADIMA数据集上进行实验，研究了不同样本量（50-200）对性能的影响，并进行了特征可视化研究以理解模型行为。该方法强调了预训练模型在低资源场景中的泛化能力，并为多语言环境下的辱骂检测提供了有价值的见解。

链接: https://arxiv.org/abs/2412.01408
作者: Aditya Narayan Sankaran,Reza Farahbaksh,Noel Crespi
关键词-EN: Online abusive content, Online abusive, remains underexplored, abusive content detection, audio modality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as part of the proceedings of COLING 2025

点击查看摘要

Abstract:Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in this case, in Indian languages using Few Shot Learning (FSL). Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore cross-lingual abuse detection using the ADIMA dataset with FSL. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.
zh

[NLP-21] CLASSLA-Express: a Train of CLARIN.SI Workshops on Language Resources and Tools with Easily Expanding Route WWW

【速读】：该论文试图解决南斯拉夫语系语言资源的传播和利用问题，解决方案的关键在于创新性地设计了CLASSLA-Express工作坊系列。这一系列工作坊通过两种主要策略实现目标：一是在有兴趣的国家的直接举办工作坊，二是设计易于扩展到新场地的工作坊系列。通过这种方式，CLASSLA-Express工作坊系列不仅分享了语料库查询工具的使用知识，还推广了最近发布的CLASSLA-web语料库，这是南斯拉夫语系中最大的通用语料库。

链接: https://arxiv.org/abs/2412.01386
作者: Nikola Ljubešić,Taja Kuzman,Ivana Filipović Petrović,Jelena Parizoska,Petya Osenova
关键词-EN: CLASSLA Knowledge Centre, http URL infrastructure, disseminating linguistic resources, South Slavic languages, CLASSLA-Express workshop series
类目: Computation and Language (cs.CL)
备注: Published in CLARIN Annual Conference Proceedings 2024 ( this https URL )

点击查看摘要

Abstract:This paper introduces the CLASSLA-Express workshop series as an innovative approach to disseminating linguistic resources and infrastructure provided by the CLASSLA Knowledge Centre for South Slavic languages and the Slovenian this http URL infrastructure. The workshop series employs two key strategies: (1) conducting workshops directly in countries with interested audiences, and (2) designing the series for easy expansion to new venues. The first iteration of the CLASSLA-Express workshop series encompasses 6 workshops in 5 countries. Its goal is to share knowledge on the use of corpus querying tools, as well as the recently-released CLASSLA-web corpora - the largest general corpora for South Slavic languages. In the paper, we present the design of the workshop series, its current scope and the effortless extensions of the workshop to new venues that are already in sight.
zh

[NLP-22] Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

【速读】：该论文试图解决在大语言模型（LLM）推理过程中，由于内存带宽限制导致的性能瓶颈问题。解决方案的关键在于引入了一种名为动态输入剪枝（Dynamic Input Pruning, DIP）的新方法，该方法通过无预测器的动态稀疏化技术，在不显著影响模型准确性的前提下，有效减少了每个token所需的DRAM带宽。DIP的核心创新包括：1) 无需依赖预测器的动态稀疏化，2) 使用轻量级LoRA适配器来恢复因稀疏化导致的性能损失，3) 提出了一种新的缓存感知掩码策略，通过考虑缓存状态和激活值大小来提高缓存命中率，从而提升移动设备上的LLM token生成速率。实验结果表明，DIP在模拟硬件环境下，相比其他方法在准确性、内存和吞吐量之间取得了更好的平衡。

链接: https://arxiv.org/abs/2412.01380
作者: Marco Federici,Davide Belli,Mart van Baalen,Amir Jalalirad,Andrii Skliar,Bence Major,Markus Nagel,Paul Whatmough
关键词-EN: DRAM bandwidth, compute power, effective DRAM bandwidth, DIP, improvements in DRAM
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Main Text: 10 pages, 11 figures. Appendix: 3 pages, 3 figures

点击查看摘要

Abstract:While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which result in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46% reduction in memory and 40% increase in throughput with 0.1 loss in perplexity.
zh

[NLP-23] Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

【速读】：该论文试图解决计算机系统故障和错误管理中传统手动日志分析方法的局限性，特别是在现有使用大型语言模型（LLMs）的解决方案中，由于自然语言与领域特定语言之间的差距限制了其在实际应用中的有效性。解决方案的关键在于通过持续预训练（CPT）将可解释的领域知识集成到开源LLMs中，从而在保留自然语言处理能力的同时，显著提升日志任务的性能。论文提出了一个名为SuperLog的模型，并通过创建包含超过25万对问答的NLPLog数据集进行训练，使其在四个日志分析任务中表现最佳，平均超越第二名模型12.01%。

链接: https://arxiv.org/abs/2412.01377
作者: Yuhe Ji,Yilun Liu,Feiyu Yao,Minggui He,Shimin Tao,Xiaofeng Zhao,Su Chang,Xinhua Yang,Weibin Meng,Yuming Xie,Boxing Chen,Hao Yang
关键词-EN: computer systems necessitates, systems necessitates innovative, necessitates innovative approaches, traditional manual log, error management
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The increasing complexity of computer systems necessitates innovative approaches to fault and error management, going beyond traditional manual log analysis. While existing solutions using large language models (LLMs) show promise, they are limited by a gap between natural and domain-specific languages, which restricts their effectiveness in real-world applications. Our approach addresses these limitations by integrating interpretable domain knowledge into open-source LLMs through continual pre-training (CPT), enhancing performance on log tasks while retaining natural language processing capabilities. We created a comprehensive dataset, NLPLog, with over 250,000 question-answer pairs to facilitate this integration. Our model, SuperLog, trained with this dataset, achieves the best performance across four log analysis tasks, surpassing the second-best model by an average of 12.01%. Our contributions include a novel CPT paradigm that significantly improves model performance, the development of SuperLog with state-of-the-art results, and the release of a large-scale dataset to support further research in this domain.
zh

[NLP-24] Understanding the Worlds Museums through Vision-Language Reasoning

【速读】：该论文试图解决从博物馆展品的图像中进行超越视觉特征的推理问题。解决方案的关键在于：(a) 收集并整理了一个大规模的数据集，包含6500万张图像和2亿个问题-答案对，这些数据以标准博物馆目录格式记录了全球展品的信息；(b) 在收集的数据集上训练大型视觉-语言模型 (Vision-Language Models, VLMs)；© 通过五个视觉问答任务对模型的能力进行基准测试。论文中训练了两种不同类型的VLMs：BLIP模型，具有视觉-语言对齐的嵌入，但缺乏大型语言模型的表达能力；LLaVA模型，是一种强大的指令调优大型语言模型 (Large Language Model, LLM)，增强了视觉-语言推理能力。实验结果表明，大型视觉-语言模型在需要将视觉特征与人类知识库结合的复杂推理任务中表现更优。

链接: https://arxiv.org/abs/2412.01370
作者: Ada-Astrid Balauca,Sanjana Garai,Stefan Balauca,Rasesh Udayakumar Shetty,Naitik Agrawal,Dhwanil Subhashbhai Shah,Yuqian Fu,Xi Wang,Kristina Toutanova,Danda Pani Paudel,Luc Van Gool
关键词-EN: preserving well-documented collections, spanning diverse epochs, historical artifacts spanning, artifacts spanning diverse, diverse epochs
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; © benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: this https URL
zh

[NLP-25] A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls

【速读】：该论文试图解决从英语到韩语的文学机器翻译的细粒度评估问题。解决方案的关键在于提出并验证了一个两阶段的评估框架，该框架能够提供细粒度、可解释的评估指标，适用于文学翻译，并且在与人类判断的相关性上优于传统的机器翻译评估方法。尽管如此，该框架在某些指标（如韩语敬语）上仍未能达到人类之间的共识水平。此外，论文还指出，大型语言模型（LLMs）倾向于偏好其他LLMs生成的翻译，强调了开发更复杂、更精确且文化敏感的评估方法的必要性。

链接: https://arxiv.org/abs/2412.01340
作者: Sheikh Shafayat,Dongkeun Yoon,Woori Jang,Jiwoo Choi,Alice Oh,Seohyon Jung
关键词-EN: evaluate literary machine, two-stage pipeline, English to Korean, machine translation, fine-grained manner
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we propose and evaluate the feasibility of a two-stage pipeline to evaluate literary machine translation, in a fine-grained manner, from English to Korean. The results show that our framework provides fine-grained, interpretable metrics suited for literary translation and obtains a higher correlation with human judgment than traditional machine translation metrics. Nonetheless, it still fails to match inter-human agreement, especially in metrics like Korean Honorifics. We also observe that LLMs tend to favor translations generated by other LLMs, and we highlight the necessity of developing more sophisticated evaluation methods to ensure accurate and culturally sensitive machine translation of literary works.
zh

[NLP-26] Exploring Long-Term Prediction of Type 2 Diabetes Microvascular Complications ML4H ALT

【速读】：该论文试图解决电子健康记录（EHR）数据在不同临床本体（如ICD10、SNOMED）之间映射时可能导致的领域专家依赖和数据损失问题。解决方案的关键在于采用代码无关（code-agnostic）的表示方法，通过微调的预训练临床语言模型将个体EHR编码为文本，从而实现跨系统的数据整合。研究结果表明，代码无关的方法在长期微血管并发症预测任务中优于基于代码的模型，且随着预测窗口的延长，模型性能有所提升，但存在对首次发生并发症的偏倚。该研究强调了上下文长度对模型性能的重要性，并为构建可推广的临床模型提供了起点。

链接: https://arxiv.org/abs/2412.01331
作者: Elizabeth Remfry,Rafael Henkin,Michael R Barnes,Aakanksha Naik
关键词-EN: Electronic healthcare records, Electronic healthcare, huge wealth, Electronic, healthcare records
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 9 pages

点击查看摘要

Abstract:Electronic healthcare records (EHR) contain a huge wealth of data that can support the prediction of clinical outcomes. EHR data is often stored and analysed using clinical codes (ICD10, SNOMED), however these can differ across registries and healthcare providers. Integrating data across systems involves mapping between different clinical ontologies requiring domain expertise, and at times resulting in data loss. To overcome this, code-agnostic models have been proposed. We assess the effectiveness of a code-agnostic representation approach on the task of long-term microvascular complication prediction for individuals living with Type 2 Diabetes. Our method encodes individual EHRs as text using fine-tuned, pretrained clinical language models. Leveraging large-scale EHR data from the UK, we employ a multi-label approach to simultaneously predict the risk of microvascular complications across 1-, 5-, and 10-year windows. We demonstrate that a code-agnostic approach outperforms a code-based model and illustrate that performance is better with longer prediction windows but is biased to the first occurring complication. Overall, we highlight that context length is vitally important for model performance. This study highlights the possibility of including data from across different clinical ontologies and is a starting point for generalisable clinical models.
zh

[NLP-27] he “LLM World of Words” English free association norms generated by large language models

【速读】：该论文试图解决的问题是缺乏大规模的生成式语言模型（LLM）生成的自由联想规范数据，这些数据能够与人类生成的规范数据相媲美，从而阻碍了研究LLM中编码的知识及其潜在偏见的新方向。解决方案的关键在于创建了一个名为“LLM World of Words”（LWOW）的新数据集，该数据集模仿了“Small World of Words”（SWOW）人类生成的规范数据，包含了约12,000个提示词。通过使用Mistral、Llama3和Haiku三个LLM生成与SWOW规范数据相同的提示词，论文构建了可比较的认知网络模型，用于研究人类和LLM的语义记忆中的概念知识，并揭示其中存在的隐性偏见，如普遍存在的性别刻板印象。

链接: https://arxiv.org/abs/2412.01330
作者: Katherine Abramski,Riccardo Improta,Giulio Rossetti,Massimo Stella
关键词-EN: LLM-generated free association, free association norms, psychology and linguistics, linguistics for studying, free association
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures, associated Github page with dataset available at: this https URL

点击查看摘要

Abstract:Free associations have been extensively used in cognitive psychology and linguistics for studying how conceptual knowledge is organized. Recently, the potential of applying a similar approach for investigating the knowledge encoded in LLMs has emerged, specifically as a method for investigating LLM biases. However, the absence of large-scale LLM-generated free association norms that are comparable with human-generated norms is an obstacle to this new research direction. To address this limitation, we create a new dataset of LLM-generated free association norms modeled after the “Small World of Words” (SWOW) human-generated norms consisting of approximately 12,000 cue words. We prompt three LLMs, namely Mistral, Llama3, and Haiku, with the same cues as those in the SWOW norms to generate three novel comparable datasets, the “LLM World of Words” (LWOW). Using both SWOW and LWOW norms, we construct cognitive network models of semantic memory that represent the conceptual knowledge possessed by humans and LLMs. We demonstrate how these datasets can be used for investigating implicit biases in humans and LLMs, such as the harmful gender stereotypes that are prevalent both in society and LLM outputs.
zh

[NLP-28] SiTSE: Sinhala Text Simplification Dataset and Evaluation

【速读】：该论文试图解决低资源语言（如僧伽罗语）文本简化任务中数据集匮乏的问题。解决方案的关键在于构建了一个由人工标注的僧伽罗语文本简化数据集，包含1,000个复杂句子和对应的三种简化版本，共计3,000个简化句子。论文将文本简化任务建模为零资源序列到序列（seq-seq）任务，并利用多语言模型mT5和mBART进行处理。通过引入中间任务迁移学习（ITTL），论文展示了ITTL在零资源文本简化方法中优于先前的方法。此外，论文还强调了评估文本简化系统的挑战，并呼吁改进适用于低资源语言的自动化文本简化系统质量评估指标。

链接: https://arxiv.org/abs/2412.01293
作者: Surangika Ranathunga,Rumesh Sirithunga,Himashi Rathnayake,Lahiru De Silva,Thamindu Aluthwala,Saman Peramuna,Ravi Shekhar
关键词-EN: Text Simplification, minimally explored, Simplification, Text, text simplification systems
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: this https URL
zh

[NLP-29] Shadow of the (Hierarchical) Tree: Reconciling Symbolic and Predictive Components of the Neural Code for Syntax

【速读】：该论文试图解决如何将符号表示与连接主义神经网络这两种截然不同的框架整合在一起，特别是在自然语言语法处理中的应用。解决方案的关键在于提出一种混合神经符号模型，通过神经计算架构ROSE来实现。具体来说，论文提出通过ROSE的高层来处理垂直的短语结构表示，而通过调整低层来解释统计和感知推断的线性语言信息。这种模型预测，人工语言模型将有助于横向形态句法认知神经科学的研究，但在层次组合结构方面贡献较少。论文还讨论了如何通过预测编码机制作为符号振荡相位码和线性化语法统计特征的群体码之间的接口，来整合这两种神经编码。最后，论文提供了一个神经符号数学模型，用于将符号表示注入到编码词汇语义统计特征的神经机制中。

链接: https://arxiv.org/abs/2412.01276
作者: Elliot Murphy
关键词-EN: infamously distinct frameworks, Natural language syntax, connectionist neural networks, Natural language, distinct frameworks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language syntax can serve as a major test for how to integrate two infamously distinct frameworks: symbolic representations and connectionist neural networks. Building on a recent neurocomputational architecture for syntax (ROSE), I discuss the prospects of reconciling the neural code for hierarchical ‘vertical’ syntax with linear and predictive ‘horizontal’ processes via a hybrid neurosymbolic model. I argue that the former can be accounted for via the higher levels of ROSE in terms of vertical phrase structure representations, while the latter can explain horizontal forms of linguistic information via the tuning of the lower levels to statistical and perceptual inferences. One prediction of this is that artificial language models will contribute to the cognitive neuroscience of horizontal morphosyntax, but much less so to hierarchically compositional structures. I claim that this perspective helps resolve many current tensions in the literature. Options for integrating these two neural codes are discussed, with particular emphasis on how predictive coding mechanisms can serve as interfaces between symbolic oscillatory phase codes and population codes for the statistics of linearized aspects of syntax. Lastly, I provide a neurosymbolic mathematical model for how to inject symbolic representations into a neural regime encoding lexico-semantic statistical features.
zh

[NLP-30] MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

【速读】：该论文试图解决多语言图像生成中的成本效益问题。解决方案的关键在于利用预训练于广泛可用且带有噪声的互联网图像-文本对的文本编码器，显著提高文本到图像生成 (T2I) 在多语言环境下的数据效率。具体来说，论文提出了 MuLan（Multi-Language adapter），这是一个轻量级的语言适配器，参数少于20M，与冻结的文本编码器和图像扩散模型联合训练。这一框架不仅降低了训练成本，还实现了在110多种语言中与英语相当的高性能生成能力，并且能够无缝集成到现有的社区工具中，如LoRA、LCM、ControlNet和IP-Adapter，从而扩展了其应用场景。

链接: https://arxiv.org/abs/2412.01271
作者: Sen Xing,Muyan Zhong,Zeqiang Lai,Liangchen Li,Jiawen Liu,Yaohui Wang,Jifeng Dai,Wenhai Wang
关键词-EN: explore a cost-effective, noisy Internet image-text, Internet image-text pairs, noisy Internet, Abstract
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (38.61 for English vs. 37.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.
zh

[NLP-31] CPRM: A LLM -based Continual Pre-training Framework for Relevance Modeling in Commercial Search

【速读】：该论文试图解决商业搜索引擎中查询与商品相关性建模的问题，特别是大语言模型（LLMs）在缺乏领域特定知识、未能充分利用上下文学习潜力以及结构化商品文本未被充分挖掘的问题。解决方案的关键在于提出了一种持续预训练框架（CPRM），该框架包括三个模块：1) 联合使用查询和多字段商品文本进行预训练以增强领域知识；2) 引入一种新的上下文预训练方法，即在相关查询或商品序列上预训练LLMs；3) 通过商品阅读理解生成相关的领域知识和背景信息（如生成摘要和对应查询），以进一步强化LLMs。实验结果表明，CPRM在离线和在线A/B测试中均表现出色。

链接: https://arxiv.org/abs/2412.01269
作者: Kaixin Wu,Yixin Ji,Zeyuan Chen,Qiang Wang,Cunxiang Wang,Hong Liu,Baijun Ji,Jia Xu,Zhongyi Liu,Jinjie Gu,Yuan Zhou,Linjian Mo
关键词-EN: commercial search engines, Relevance modeling, LLM-based relevance modeling, directly affecting, user experience
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Relevance modeling between queries and items stands as a pivotal component in commercial search engines, directly affecting the user experience. Given the remarkable achievements of large language models (LLMs) in various natural language processing (NLP) tasks, LLM-based relevance modeling is gradually being adopted within industrial search systems. Nevertheless, foundational LLMs lack domain-specific knowledge and do not fully exploit the potential of in-context learning. Furthermore, structured item text remains underutilized, and there is a shortage in the supply of corresponding queries and background knowledge. We thereby propose CPRM (Continual Pre-training for Relevance Modeling), a framework designed for the continual pre-training of LLMs to address these issues. Our CPRM framework includes three modules: 1) employing both queries and multi-field item to jointly pre-train for enhancing domain knowledge, 2) applying in-context pre-training, a novel approach where LLMs are pre-trained on a sequence of related queries or items, and 3) conducting reading comprehension on items to produce associated domain knowledge and background information (e.g., generating summaries and corresponding queries) to further strengthen LLMs. Results on offline experiments and online A/B testing demonstrate that our model achieves convincing performance compared to strong baselines.
zh

[NLP-32] Indexing Economic Fluctuation Narratives from Keiki Watchers Survey

【速读】：该论文试图解决的问题是如何更有效地利用经济文本中的信息，特别是因果关系，来预测经济波动。解决方案的关键在于设计了一种基于经济调查文本的波动指数，通过使用先前提出的叙事框架来提取和量化经济文本中的信息。评估结果显示，这些新设计的指数与累积滞后扩散指数的相关性更强，表明其在经济预测中的潜在价值。

链接: https://arxiv.org/abs/2412.01265
作者: Eriko Shigetsugu,Hiroki Sakaji,Itsuki Noda
关键词-EN: fluctuation narratives derived, economic, indices, design indices, economic fluctuation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we design indices of economic fluctuation narratives derived from economic surveys. Companies, governments, and investors rely on key metrics like GDP and industrial production indices to predict economic trends. However, they have yet to effectively leverage the wealth of information contained in economic text, such as causal relationships, in their economic forecasting. Therefore, we design indices of economic fluctuation from economic surveys by using our previously proposed narrative framework. From the evaluation results, it is observed that the proposed indices had a stronger correlation with cumulative lagging diffusion index than other types of diffusion indices.
zh

[NLP-33] Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue?

【速读】：该论文试图解决大型语言模型（LLMs）在任务导向对话（TOD）中表现不佳的问题，特别是在需要推理和访问外部信息的情况下。解决方案的关键在于采用先进的提示策略，如推理和行动（ReAct），以指导LLMs在TOD中的表现。通过在模拟和真实用户环境中评估基于ReAct的LLMs（ReAct-LLMs），研究发现尽管在模拟中ReAct-LLMs的表现不如最先进的方法，但在实际用户评估中，ReAct-LLMs的用户满意度高于手工设计的系统，尽管成功率较低。

链接: https://arxiv.org/abs/2412.01262
作者: Michelle Elizabeth,Morgan Veyret,Miguel Couceiro,Ondrej Dusek,Lina M. Rojas-Barahona
关键词-EN: Large language models, gained immense popularity, immense popularity due, Large language, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. However, they underperform compared to previous approaches in task-oriented dialogue (TOD), wherein reasoning and accessing external information are crucial. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing TOD. We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs seem to underperform state-of-the-art approaches in simulation, human evaluation indicates higher user satisfaction rate compared to handcrafted systems despite having a lower success rate.
zh

[NLP-34] Yi-Lightning Technical Report

【速读】：该论文试图解决如何开发高性能的大型语言模型（LLM），并确保其在实际应用中的安全性和成本效益。解决方案的关键在于采用增强的混合专家（Mixture-of-Experts, MoE）架构，结合先进的专家分割和路由机制以及优化的KV缓存技术。此外，论文提出了一种多阶段的训练策略，包括全面的预训练、监督微调（Supervised Fine-Tuning, SFT）和基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF），并构建了合成数据和奖励模型。为了确保安全性，论文还引入了RAISE（Responsible AI Safety Engine）框架，涵盖预训练、后训练和服务阶段的安全措施。通过这些创新，论文显著降低了训练、部署和推理成本，同时保持了高水平的性能。

链接: https://arxiv.org/abs/2412.01253
作者: 01.AI:Alan Wake,Albert Wang,Bei Chen,C.X. Lv,Chao Li,Chengen Huang,Chenglin Cai,Chujie Zheng,Daniel Cooper,Ethan Dai,Fan Zhou,Feng Hu,Heng Ji,Howard Qiu,Jiangcheng Zhu,Jun Tian,Katherine Su,Lihuan Zhang,Liying Li,Ming Song,Mou Li,Peng Liu,Qichen Hu,Shawn Wang,Shijun Zhou,Shiyong Li,Tianhang Zhu,Wen Xie,Xiang He,Xiaobo Chen,Xiaohui Hu,Xiaoyi Ren,Xinyao Niu,Yanpeng Li,Yongke Zhao,Yongzhen Luo,Yuchi Xu,Yuxuan Sha,Zhaodong Yan,Zhiyuan Liu,Zirui Zhang
关键词-EN: large language model, technical report presents, latest flagship large, flagship large language, report presents Yi-Lightning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks’ utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at this https URL.
zh

[NLP-35] Data Uncertainty-Aware Learning for Multimodal Aspect-based Sentiment Analysis

【速读】：该论文试图解决多模态基于方面的情感分析 (MABSA) 任务中数据不确定性问题，即低质量样本（如低分辨率图像）中的噪声影响情感识别的难题。解决方案的关键在于提出了一个数据不确定性感知的多模态基于方面的情感分析方法 (UA-MABSA)，该方法通过引入一种新的质量评估策略，综合考虑图像质量和基于方面的跨模态相关性，为不同质量的样本赋予不同的损失权重，从而使模型更加关注高质量和具有挑战性的样本。实验结果表明，UA-MABSA 在 Twitter-2015 数据集上达到了最先进的性能，进一步验证了质量评估策略的有效性。

链接: https://arxiv.org/abs/2412.01249
作者: Hao Yang,Zhenyu Zhang,Yanyan Zhao,Bing Qin
关键词-EN: identifying aspect-level sentiment, aspect-level sentiment information, multimodal aspect-based sentiment, aspect-based sentiment analysis, text-image pair
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a fine-grained task, multimodal aspect-based sentiment analysis (MABSA) mainly focuses on identifying aspect-level sentiment information in the text-image pair. However, we observe that it is difficult to recognize the sentiment of aspects in low-quality samples, such as those with low-resolution images that tend to contain noise. And in the real world, the quality of data usually varies for different samples, such noise is called data uncertainty. But previous works for the MABSA task treat different quality samples with the same importance and ignored the influence of data uncertainty. In this paper, we propose a novel data uncertainty-aware multimodal aspect-based sentiment analysis approach, UA-MABSA, which weighted the loss of different samples by the data quality and difficulty. UA-MABSA adopts a novel quality assessment strategy that takes into account both the image quality and the aspect-based cross-modal relevance, thus enabling the model to pay more attention to high-quality and challenging samples. Extensive experiments show that our method achieves state-of-the-art (SOTA) performance on the Twitter-2015 dataset. Further analysis demonstrates the effectiveness of the quality assessment strategy.
zh

[NLP-36] GraphOTTER: Evolving LLM -based Graph Reasoning for Complex Table Question Answering COLING2025

【速读】：该论文试图解决复杂表格问答（Complex Table Question Answering）中现有方法推理过程不明确、难以有效过滤无关信息的问题。解决方案的关键在于提出了一种名为GraphOTTER的方法，该方法通过将复杂表格转换为无向图（undirected graph），并利用预定义的中间推理动作（intermediate reasoning actions）进行逐步推理，从而构建清晰的推理路径并准确识别答案。这种方法的核心优势在于能够高效过滤无关信息，使推理过程聚焦于最相关的数据。

链接: https://arxiv.org/abs/2412.01230
作者: Qianlong Li,Chen Huang,Shuai Li,Yuanxin Xiang,Deng Xiong,Wenqiang Lei
关键词-EN: Answering involves providing, Question Answering involves, Table Question Answering, flexible header locations, involves providing accurate
类目: Computation and Language (cs.CL)
备注: COLING 2025, code is available at this https URL

点击查看摘要

Abstract:Complex Table Question Answering involves providing accurate answers to specific questions based on intricate tables that exhibit complex layouts and flexible header locations. Despite considerable progress having been made in the LLM era, the reasoning processes of existing methods are often implicit, feeding the entire table into prompts, making it difficult to effectively filter out irrelevant information in the table. To this end, we propose GraphOTTER that explicitly establishes the reasoning process to pinpoint the correct answers. In particular, GraphOTTER leverages a graph-based representation, transforming the complex table into an undirected graph. It then conducts step-by-step reasoning on the graph, with each step guided by a set of pre-defined intermediate reasoning actions. As such, it constructs a clear reasoning path and effectively identifies the answer to a given question. Comprehensive experiments on two benchmark datasets and two LLM backbones demonstrate the effectiveness of GraphOTTER. Further analysis indicates that its success may be attributed to the ability to efficiently filter out irrelevant information, thereby focusing the reasoning process on the most pertinent data. Our code and experimental datasets are available at \urlthis https URL.
zh

[NLP-37] MiningGPT – A Domain-Specific Large Language Model for the Mining Industry

【速读】：该论文试图解决生成式大型语言模型（LLMs）在特定领域（如矿业）理解能力不足的问题。解决方案的关键在于开发领域特定的LLMs，即MiningGPT，这是一个针对矿业领域的指令跟随型7B参数LLM模型。通过专门针对矿业领域的训练，MiningGPT在矿业领域知识测试中的得分比其基础模型Mistral 7B instruct高出14%，显著提升了模型在矿业领域的专业理解和应用能力。

链接: https://arxiv.org/abs/2412.01189
作者: Kurukulasooriya Fernando ana Gianluca Demartini
关键词-EN: Large Language Models, exhibited human-like language, human-like language capabilities, Large Language, Recent advancements
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements of generative LLMs (Large Language Models) have exhibited human-like language capabilities but have shown a lack of domain-specific understanding. Therefore, the research community has started the development of domain-specific LLMs for many domains. In this work we focus on discussing how to build mining domain-specific LLMs, as the global mining industry contributes significantly to the worldwide economy. We report on MiningGPT, a mining domain-specific instruction-following 7B parameter LLM model which showed a 14% higher mining domain knowledge test score as compared to its parent model Mistral 7B instruct.
zh

[NLP-38] SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages

【速读】：该论文试图解决评估大型语言模型（LLMs）在东南亚语言（SEA）上的表现和鲁棒性的问题。解决方案的关键在于引入了一个名为SailCompass的可重复和鲁棒的评估基准，该基准涵盖了三种主要SEA语言、八项主要任务（包括14个数据集），涉及生成、多选题和分类三种任务类型。为了提高评估方法的鲁棒性，论文探索了多选题的不同提示配置，并利用校准技术来提高分类任务的忠实度。此外，论文强调了平衡语言分布和采用先进的提示技术（如校准、基于困惑度的排序）对于开发更好的SEA专用LLMs的重要性。

链接: https://arxiv.org/abs/2412.01186
作者: Jia Guo,Longxu Dou,Guangtao Zeng,Stanley Kok,Wei Lu,Qian Liu
关键词-EN: Large Language Models, Southeast Asian Languages, assessing Large Language, Southeast Asian, assessing Large
类目: Computation and Language (cs.CL)
备注: code: this https URL

点击查看摘要

Abstract:In this paper, we introduce SailCompass, a reproducible and robust evaluation benchmark for assessing Large Language Models (LLMs) on Southeast Asian Languages (SEA). SailCompass encompasses three main SEA languages, eight primary tasks including 14 datasets covering three task types (generation, multiple-choice questions, and classification). To improve the robustness of the evaluation approach, we explore different prompt configurations for multiple-choice questions and leverage calibrations to improve the faithfulness of classification tasks. With SailCompass, we derive the following findings: (1) SEA-specialized LLMs still outperform general LLMs, although the gap has narrowed; (2) A balanced language distribution is important for developing better SEA-specialized LLMs; (3) Advanced prompting techniques (e.g., calibration, perplexity-based ranking) are necessary to better utilize LLMs. All datasets and evaluation scripts are public.
zh

[NLP-39] A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

【速读】：该论文试图解决预训练语言模型（PLMs）在语义关系知识方面的评估不全面的问题。解决方案的关键在于引入了一个综合评估框架，涵盖了除上位词（hypernymy）之外的五种语义关系：下位词（hyponymy）、整体词（holonymy）、部分词（meronymy）、反义词（antonymy）和同义词（synonymy）。此外，论文还引入了六项新的评估指标（soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability），并首次公平地比较了人类和模型在同一任务上的表现。研究涉及16个PLMs，包括8个掩码语言模型和8个因果语言模型，揭示了人类与模型在几乎所有语义关系上存在显著的知识差距，其中反义词关系是唯一模型表现良好的例外。总体上，掩码语言模型的表现显著优于因果语言模型。

链接: https://arxiv.org/abs/2412.01131
作者: Zhihan Cao,Hiroaki Yamada,Simone Teufel,Takenobu Tokunaga
关键词-EN: language models, language, models, pretrained language models, semantic
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, much work has concerned itself with the enigma of what exactly PLMs (pretrained language models) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Only one relation was considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that solved by the PLMs. This means that at this point in time, there is only an incomplete view of models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use six metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, asymmetry, prototypicality, and distinguishability and fairly compare humans and models on the same task. Our extensive experiments involve 16 PLMs, eight masked and eight causal language models. Up to now only masked language models had been tested although causal and masked language models treat context differently. Our results reveal a significant knowledge gap between humans and models for almost all semantic relations. Antonymy is the outlier relation where all models perform reasonably well. In general, masked language models perform significantly better than causal language models. Nonetheless, both masked and causal language models are likely to confuse non-antonymy relations with antonymy.
zh

[NLP-40] Enhancing Function-Calling Capabilities in LLM s: Strategies for Prompt Formats Data Integration and Multilingual Translation

【速读】：该论文试图解决大语言模型（LLMs）在零样本工具使用（zero-shot tool usage），即函数调用（function calling）方面的能力提升问题。解决方案的关键在于：(1) 通过指令跟随数据（instruction-following data）提高函数调用的准确性和相关性检测；(2) 引入新的决策标记（Decision Token），结合合成非函数调用数据（synthetic non-function-call data），增强相关性检测；(3) 利用定制的翻译管道（translation pipeline）有效克服多语言限制，特别是在繁体中文（Traditional Chinese）方面取得了显著改进。这些方法共同提升了LLMs在函数调用和多语言应用中的能力。

链接: https://arxiv.org/abs/2412.01130
作者: Yi-Chang Chen,Po-Chun Hsu,Chan-Jan Hsu,Da-shan Shiu
关键词-EN: Large language models, advanced autonomous agents, zero-shot tool usage, significantly advanced autonomous, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced autonomous agents, particularly in zero-shot tool usage, also known as function calling. This research delves into enhancing the function-calling capabilities of LLMs by exploring different approaches, including prompt formats for integrating function descriptions, blending function-calling and instruction-following data, introducing a novel Decision Token for conditional prompts, leveraging chain-of-thought reasoning, and overcoming multilingual challenges with a translation pipeline. Our key findings and contributions are as follows: (1) Instruction-following data improves both function-calling accuracy and relevance detection. (2) The use of the newly proposed Decision Token, combined with synthetic non-function-call data, enhances relevance detection. (3) A tailored translation pipeline effectively overcomes multilingual limitations, demonstrating significant improvements in Traditional Chinese. These insights highlight the potential for improved function-calling capabilities and multilingual applications in LLMs.
zh

[NLP-41] hink-to-Talk or Talk-to-Think? When LLM s Come Up with an Answer in Multi-Step Reasoning

【速读】：该论文试图解决的问题是语言模型在符号多步推理过程中的内部推理机制，特别是链式思维（Chain-of-Thought, CoT）输出是否忠实于模型的内部推理过程。解决方案的关键在于通过因果探针实验（causal probing experiments），在控制的算术推理任务中，系统地分析模型在CoT开始前后的内部决策时间点，以确定模型是遵循事后“思考后说话”（think-to-talk）模式，还是逐步“边说边思考”（talk-to-think）模式。研究发现，简单子问题在CoT开始前解决，而复杂的多跳计算则在CoT过程中进行。

链接: https://arxiv.org/abs/2412.01113
作者: Keito Kudo,Yoichi Aoki,Tatsuki Kuribayashi,Shusaku Sone,Masaya Taniguchi,Ana Brassard,Keisuke Sakaguchi,Kentaro Inui
关键词-EN: symbolic multi-step reasoning, outputs are faithful, study investigates, mechanism of language, symbolic multi-step
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to the model’s internals. Specifically, we inspect when they internally determine their answers, particularly before or after CoT begins, to determine whether models follow a post-hoc “think-to-talk” mode or a step-by-step “talk-to-think” mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models; for example, simple subproblems are solved before CoT begins, and more complicated multi-hop calculations are performed during CoT.
zh

[NLP-42] Revisiting Absence withSymptoms that T Show up Decades Later to Recover Empty Categories

【速读】：该论文试图解决在英语、中文和韩语的Penn树库中处理空元素（null elements）的问题。空元素包含重要的句法和语义信息，但在语言处理任务中通常被视为需要移除的实体，尤其是在成分句法分析中。论文的关键解决方案包括：1) 扩展基于规则的方法，利用语言上下文信息来处理中文，因为过去这种方法仅应用于英语；2) 使用语言无关的序列到序列模型进行神经网络实验，以恢复英语（PTB）、中文（CTB）和韩语（KTB）中的空元素。通过这些方法，论文在中文处理中实现了80.00的F1分数，在神经网络实验中分别在英语、中文和韩语中达到了90.94、85.38和88.79的F1分数。

链接: https://arxiv.org/abs/2412.01109
作者: Emily Chen,Nicholas Huang,Casey Robinson,Kevin Xu,Zihao Huang,Jungyeul Park
关键词-EN: Korean Penn treebanks, Penn treebanks, paper explores null, null elements, Korean Penn
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:This paper explores null elements in English, Chinese, and Korean Penn treebanks. Null elements contain important syntactic and semantic information, yet they have typically been treated as entities to be removed during language processing tasks, particularly in constituency parsing. Thus, we work towards the removal and, in particular, the restoration of null elements in parse trees. We focus on expanding a rule-based approach utilizing linguistic context information to Chinese, as rule based approaches have historically only been applied to English. We also worked to conduct neural experiments with a language agnostic sequence-to-sequence model to recover null elements for English (PTB), Chinese (CTB) and Korean (KTB). To the best of the authors’ knowledge, null elements in three different languages have been explored and compared for the first time. In expanding a rule based approach to Chinese, we achieved an overall F1 score of 80.00, which is comparable to past results in the CTB. In our neural experiments we achieved F1 scores up to 90.94, 85.38 and 88.79 for English, Chinese, and Korean respectively with functional labels.
zh

[NLP-43] Automated Extraction of Acronym-Expansion Pairs from Scientific Papers

【速读】：该论文试图解决数字文本中广泛使用的缩写和首字母缩略词（abbreviations and acronyms）的识别与扩展问题。解决方案的关键在于结合文档预处理、正则表达式（regular expressions）和大型语言模型（GPT-4）来识别缩写并将其映射到相应的扩展。正则表达式用于识别缩写，但在提取扩展时存在局限性，此时利用GPT-4分析缩写周围的文本，通过限制分析范围来减少错误或多个扩展的风险。该方法通过自动化缩写识别和消歧处理，提高了自然语言处理（NLP）技术的精度和效率。

链接: https://arxiv.org/abs/2412.01093
作者: Izhar Ali,Million Haileyesus,Serhiy Hnatyshyn,Jan-Lucas Ott,Vasil Hnatyshin
关键词-EN: project addresses challenges, addresses challenges posed, regular expressions, project addresses, acronyms
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:This project addresses challenges posed by the widespread use of abbreviations and acronyms in digital texts. We propose a novel method that combines document preprocessing, regular expressions, and a large language model to identify abbreviations and map them to their corresponding expansions. The regular expressions alone are often insufficient to extract expansions, at which point our approach leverages GPT-4 to analyze the text surrounding the acronyms. By limiting the analysis to only a small portion of the surrounding text, we mitigate the risk of obtaining incorrect or multiple expansions for an acronym. There are several known challenges in processing text with acronyms, including polysemous acronyms, non-local and ambiguous acronyms. Our approach enhances the precision and efficiency of NLP techniques by addressing these issues with automated acronym identification and disambiguation. This study highlights the challenges of working with PDF files and the importance of document preprocessing. Furthermore, the results of this work show that neither regular expressions nor GPT-4 alone can perform well. Regular expressions are suitable for identifying acronyms but have limitations in finding their expansions within the paper due to a variety of formats used for expressing acronym-expansion pairs and the tendency of authors to omit expansions within the text. GPT-4, on the other hand, is an excellent tool for obtaining expansions but struggles with correctly identifying all relevant acronyms. Additionally, GPT-4 poses challenges due to its probabilistic nature, which may lead to slightly different results for the same input. Our algorithm employs preprocessing to eliminate irrelevant information from the text, regular expressions for identifying acronyms, and a large language model to help find acronym expansions to provide the most accurate and consistent results.
zh

[NLP-44] Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60000 Hours of Synthetic Speech Dialogue Data

【速读】：该论文试图解决实时大型语音语言模型（real-time large speech language models）在汉语领域的研究不足问题。解决方案的关键在于提出了KE-Omni模型，该模型基于Ke-SpeechChat数据集，这是一个包含700万条中英文对话、涵盖42,002名说话者、总计超过60,000小时的高质量合成语音交互数据集。通过利用这一大规模数据集，KE-Omni模型能够实现低延迟和高流畅度的实时语音交互，从而显著提升用户体验，并为该领域的研究和发展提供了重要贡献。

链接: https://arxiv.org/abs/2412.01078
作者: Shuaijiang Zhao,Tingwei Guo,Bajian Xiang,Tongtang Wan,Qiang Niu,Wei Zou,Xiangang Li
关键词-EN: remarkable low latency, stimulate research interest, represents a significant, enabling real-time interaction, large speech language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: KE-Omni, Ke-SpeechChat

点击查看摘要

Abstract:The GPT-4o represents a significant milestone in enabling real-time interaction with large language models (LLMs) through speech, its remarkable low latency and high fluency not only capture attention but also stimulate research interest in the field. This real-time speech interaction is particularly valuable in scenarios requiring rapid feedback and immediate responses, dramatically enhancing user experience. However, there is a notable lack of research focused on real-time large speech language models, particularly for Chinese. In this work, we present KE-Omni, a seamless large speech language model built upon Ke-SpeechChat, a large-scale high-quality synthetic speech interaction dataset consisting of 7 million Chinese and English conversations, featuring 42,002 speakers, and totaling over 60,000 hours, This contributes significantly to the advancement of research and development in this field. The model, dataset, code and demo can be accessed at \urlthis https URL.
zh

[NLP-45] SAUP: Situation Awareness Uncertainty Propagation on LLM Agent

【速读】：该论文试图解决大型语言模型（LLMs）在多步骤代理系统中进行复杂决策时输出缺乏可靠性的问题，特别是现有不确定性估计方法未能充分考虑多步骤决策过程中的累积不确定性和代理与环境之间的动态交互。解决方案的关键是提出了SAUP（Situation Awareness Uncertainty Propagation）框架，该框架通过在LLM代理的每一步推理过程中传播不确定性，并结合情境感知（situational awareness）为每一步的不确定性分配情境权重，从而提供全面且准确的不确定性度量。SAUP兼容多种一步不确定性估计技术，实验结果表明其在基准数据集上的表现显著优于现有最先进方法，AUROC提升可达20%。

链接: https://arxiv.org/abs/2412.01033
作者: Qiwei Zhao,Xujiang Zhao,Yanchi Liu,Wei Cheng,Yiyou Sun,Mika Oishi,Takao Osaki,Katsushi Matsuda,Huaxiu Yao,Haifeng Chen
关键词-EN: Large language models, systems enable complex, Large language, enable complex decision-making, complex decision-making processes
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent’s reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step’s uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.
zh

[NLP-46] Evaluating Automated Radiology Report Quality through Fine-Grained Phrasal Grounding of Clinical Findings

【速读】：该论文试图解决生成式 AI (Generative AI) 报告质量自动评估的问题，特别是针对胸部X光片报告的文本信息。解决方案的关键在于开发了一种新的评估方法，通过提取细粒度的发现模式（包括位置、侧位和严重程度）来捕捉大量临床发现，并进行短语定位以在胸部X光片图像上定位相关解剖区域。随后，将文本和视觉测量结果结合起来，以评估生成报告的质量。该方法在MIMIC数据集上与现有文本评估指标进行了比较，展示了其对事实错误的鲁棒性和敏感性。

链接: https://arxiv.org/abs/2412.01031
作者: Razi Mahmood,Pingkun Yan,Diego Machado Reyes,Ge Wang,Mannudeep K. Kalra,Parisa Kaviani,Joy T. Wu,Tanveer Syeda-Mahmood
关键词-EN: named entity recognition, clinical named entity, entity recognition methods, chest radiographs based, information using lexical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Several evaluation metrics have been developed recently to automatically assess the quality of generative AI reports for chest radiographs based only on textual information using lexical, semantic, or clinical named entity recognition methods. In this paper, we develop a new method of report quality evaluation by first extracting fine-grained finding patterns capturing the location, laterality, and severity of a large number of clinical findings. We then performed phrasal grounding to localize their associated anatomical regions on chest radiograph images. The textual and visual measures are then combined to rate the quality of the generated reports. We present results that compare this evaluation metric with other textual metrics on a gold standard dataset derived from the MIMIC collection and show its robustness and sensitivity to factual errors.
zh

[NLP-47] Detecting Memorization in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练过程中容易记忆部分训练数据的问题，这不仅影响评估指标的准确性，还引发隐私担忧并限制模型的泛化能力。解决方案的关键在于引入一种分析方法，通过检查LLM内部神经元激活模式来精确检测记忆现象。具体来说，该方法通过识别区分记忆和非记忆token的特定激活模式，训练分类探针以达到近乎完美的准确率。此外，该方法还展示了其对其他机制（如重复）的适用性，并通过干预这些激活来抑制记忆现象，同时不降低整体模型性能，从而增强评估的完整性。该研究还支持大规模的token和序列标注，这对于下一代AI模型的训练效率和结果至关重要。

链接: https://arxiv.org/abs/2412.01014
作者: Eduardo Slonski
关键词-EN: raise privacy concerns, natural language processing, Large language models, Large language, achieved impressive results
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations allows us to suppress memorization without degrading overall performance, enhancing evaluation integrity by ensuring metrics reflect genuine generalization. Additionally, our method supports large-scale labeling of tokens and sequences, crucial for next-generation AI models, improving training efficiency and results. Our findings contribute to model interpretability and offer practical tools for analyzing and controlling internal mechanisms in LLMs.
zh

[NLP-48] CoRNStack: High-Quality Contrastive Data for Better Code Ranking

【速读】：该论文试图解决代码检索（code retrieval）在复杂软件系统中的性能问题，特别是在GitHub仓库中的缺陷定位（bug localization）方面。解决方案的关键在于引入了一个大规模、高质量的对比训练数据集（contrastive training dataset），名为CoRNStack。该数据集通过一致性过滤（consistency filtering）消除了噪声正样本，并增加了挖掘的难负样本（hard negatives），从而提高了嵌入模型（embedding models）在复杂检索场景中的泛化能力。通过使用CoRNStack进行对比训练，论文展示了在多种代码检索任务中达到了最先进的性能，并且还探索了代码重排序（code reranking）模型的训练，显著提升了检索结果的排序质量。最终，结合代码检索器和重排序器，论文在GitHub问题的函数定位方面取得了显著改进。

链接: https://arxiv.org/abs/2412.01007
作者: Tarun Suresh,Revanth Gangi Reddy,Yifei Xu,Zach Nussbaum,Andriy Mulyar,Brandon Duderstadt,Heng Ji
关键词-EN: advancing code generation, software systems increase, increase in complexity, plays a crucial, crucial role
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitHub repositories. We hypothesize that a key issue is their reliance on noisy and inconsistent datasets for training, which impedes their ability to generalize to more complex retrieval scenarios. To address these limitations, we introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives, thereby facilitating more effective learning. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks. Furthermore, the dataset can be leveraged for training code reranking models, a largely underexplored area compared to text reranking. Our finetuned code reranking model significantly improves the ranking quality over the retrieved results. Finally, by employing our code retriever and reranker together, we demonstrate significant improvements in function localization for GitHub issues, an important component of real-world software development.
zh

[NLP-49] Competition Dynamics Shape Algorithmic Phases of In-Context Learning

【速读】：该论文试图解决的问题是如何在统一的设置下研究上下文学习 (In-Context Learning, ICL) 的机制，并解释其行为的多样性。解决方案的关键在于提出了一种合成序列建模任务，该任务涉及学习模拟有限混合的马尔可夫链。通过这种任务，研究者能够将ICL的行为分解为四种不同的算法，这些算法结合了模糊检索与推理方法，并使用上下文的一元和二元统计信息。这些算法在模型行为中表现出竞争动态，实验条件的变化会导致不同算法主导模型行为，从而揭示了ICL的瞬态性质。论文的核心观点是，ICL不应被视为单一的能力，而是多种算法的混合，这使得在所有设置下做出普遍适用的结论变得困难。

链接: https://arxiv.org/abs/2412.01003
作者: Core Francisco Park,Ekdeep Singh Lubana,Itamar Pres,Hidenori Tanaka
关键词-EN: large language models, significantly expanded, expanded the general-purpose, large language, ICL
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model’s behavior by decomposing it into four broad algorithms that combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context. These algorithms engage in a competition dynamics to dominate model behavior, with the precise experimental conditions dictating which algorithm ends up superseding others: e.g., we find merely varying context size or amount of training yields (at times sharp) transitions between which algorithm dictates the model behavior, revealing a mechanism that explains the transient nature of ICL. In this sense, we argue ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability. This also implies that making general claims about ICL that hold universally across all settings may be infeasible.
zh

[NLP-50] From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation

【速读】：该论文试图解决低资源语言神经机器翻译（NMT）中的领域适应（DA）问题，特别是在仅有少量宗教文本平行数据、双语词典和单语目标领域语料库的情况下。解决方案的关键在于评估和比较多种方法，结果显示最简单的方法DALI最为有效。尽管如此，后续的人工评估表明，低资源NMT的领域适应仍需进一步深入研究。

链接: https://arxiv.org/abs/2412.00966
作者: Ali Marashian,Enora Rice,Luke Gessler,Alexis Palmer,Katharina von der Wense
关键词-EN: neural machine translation, train high-performing general, high-performing general neural, general neural machine, domain-specific models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many of the world’s languages have insufficient data to train high-performing general neural machine translation (NMT) models, let alone domain-specific models, and often the only available parallel data are small amounts of religious texts. Hence, domain adaptation (DA) is a crucial issue faced by contemporary NMT and has, so far, been underexplored for low-resource languages. In this paper, we evaluate a set of methods from both low-resource NMT and DA in a realistic setting, in which we aim to translate between a high-resource and a low-resource language with access to only: a) parallel Bible data, b) a bilingual dictionary, and c) a monolingual target-domain corpus in the high-resource language. Our results show that the effectiveness of the tested methods varies, with the simplest one, DALI, being most effective. We follow up with a small human evaluation of DALI, which shows that there is still a need for more careful investigation of how to accomplish DA for low-resource NMT.
zh

[NLP-51] LLM s as mirrors of societal moral standards: reflection of cultural divergence and agreement across ethical topics

【速读】：该论文试图解决大型语言模型（LLMs）在反映跨文化道德观点差异和相似性方面的准确性问题。解决方案的关键在于通过三种主要方法评估LLMs的表现：(1) 比较模型生成的道德评分变异与调查数据中的道德评分变异；(2) 进行集群对齐分析，评估模型生成的道德评分国家集群与调查数据中得出的集群之间的对应关系；(3) 使用直接比较提示对LLMs进行探测。这些方法旨在系统性地评估LLMs在理解和反映文化间道德态度差异方面的能力，从而为改进模型以更准确地捕捉这些细微差别提供依据，并强调在全球背景下开发和部署LLMs时减少偏见和促进公平代表性的重要性。

链接: https://arxiv.org/abs/2412.00962
作者: Mijntje Meijer,Hadi Mohammadi,Ayoub Bagheri
关键词-EN: Large language models, Large language, increasingly pivotal, domains due, due the recent
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly pivotal in various domains due the recent advancements in their performance capabilities. However, concerns persist regarding biases in LLMs, including gender, racial, and cultural biases derived from their training data. These biases raise critical questions about the ethical deployment and societal impact of LLMs. Acknowledging these concerns, this study investigates whether LLMs accurately reflect cross-cultural variations and similarities in moral perspectives. In assessing whether the chosen LLMs capture patterns of divergence and agreement on moral topics across cultures, three main methods are employed: (1) comparison of model-generated and survey-based moral score variances, (2) cluster alignment analysis to evaluate the correspondence between country clusters derived from model-generated moral scores and those derived from survey data, and (3) probing LLMs with direct comparative prompts. All three methods involve the use of systematic prompts and token pairs designed to assess how well LLMs understand and reflect cultural variations in moral attitudes. The findings of this study indicate overall variable and low performance in reflecting cross-cultural differences and similarities in moral values across the models tested, highlighting the necessity for improving models’ accuracy in capturing these nuances effectively. The insights gained from this study aim to inform discussions on the ethical development and deployment of LLMs in global contexts, emphasizing the importance of mitigating biases and promoting fair representation across diverse cultural perspectives.
zh

[NLP-52] Large Language Models as Mirrors of Societal Moral Standards

【速读】：该论文试图解决的问题是当前语言模型（PLMs）在跨文化道德规范理解上的局限性，特别是针对“同性恋”和“离婚”等敏感议题。解决方案的关键在于通过使用世界价值观调查（WVS）和皮尤研究中心（PEW）的数据，评估单语和多语种模型的表现，并发现BLOOM模型在这些议题上表现出最佳的正相关性。然而，研究强调，即使是表现最好的模型也未能全面理解不同文化中的道德复杂性，因此，开发能够更好地与普遍人类价值观相一致的文化感知AI系统显得尤为重要。

链接: https://arxiv.org/abs/2412.00956
作者: Evi Papadopoulou,Hadi Mohammadi,Ayoub Bagheri
关键词-EN: represent moral norms, Prior research, limited extent, cultural contexts, demonstrated that language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Prior research has demonstrated that language models can, to a limited extent, represent moral norms in a variety of cultural contexts. This research aims to replicate these findings and further explore their validity, concentrating on issues like ‘homosexuality’ and ‘divorce’. This study evaluates the effectiveness of these models using information from two surveys, the WVS and the PEW, that encompass moral perspectives from over 40 countries. The results show that biases exist in both monolingual and multilingual models, and they typically fall short of accurately capturing the moral intricacies of diverse cultures. However, the BLOOM model shows the best performance, exhibiting some positive correlations, but still does not achieve a comprehensive moral understanding. This research underscores the limitations of current PLMs in processing cross-cultural differences in values and highlights the importance of developing culturally aware AI systems that better align with universal human values.
zh

[NLP-53] Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

【速读】：该论文试图解决在低资源语言（LRLs）环境下，大型语言模型（LLMs）在知识密集型任务和事实准确性评估中的表现问题。解决方案的关键在于创建了一个名为Uhura的新基准，该基准专注于六种类型多样的非洲语言，并通过人工翻译现有的英语基准来构建两个数据集：Uhura-ARC-Easy（包含多项选择科学问题）和Uhura-TruthfulQA（测试模型在健康、法律、金融和政治等主题上的真实性）。论文强调了为LRLs创建包含高度技术内容的基准的挑战，并提出了相应的缓解策略。通过评估，论文揭示了在LRLs环境下，开源模型与专有模型之间存在显著的性能差距，并且所有模型在非洲语言中的表现均不如在英语中的表现。这些结果表明，语言模型在回答科学问题和避免生成错误声明方面存在困难，特别是在低资源语言环境中。因此，论文强调了持续改进多语言语言模型能力以确保在LRLs环境中安全可靠使用的必要性，并开放了Uhura基准和Uhura平台以促进LRLs领域的NLP研究和发展。

链接: https://arxiv.org/abs/2412.00948
作者: Edward Bayes,Israel Abebe Azime,Jesujoba O. Alabi,Jonas Kgomo,Tyna Eloundou,Elizabeth Proehl,Kai Chen,Imaan Khadir,Naome A. Etori,Shamsuddeen Hassan Muhammad,Choice Mpanza,Igneciah Pocia Thete,Dietrich Klakow,David Ifeoluwa Adelani
关键词-EN: Large Language Models, high-resource languages primarily, Large Language, factual accuracy, accuracy often focus
类目: Computation and Language (cs.CL)
备注: working paper

点击查看摘要

Abstract:Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura – a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta’s LLaMA and Google’s Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.
zh

[NLP-54] VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

【速读】：该论文试图解决大型视觉语言模型（LVLMs）在理解图像中视觉信息（即视觉感知错误）时存在的主要问题。解决方案的关键在于引入了一个名为VisOnlyQA的新数据集，该数据集专门用于直接评估LVLMs在科学图表中几何和数值信息方面的视觉感知能力。VisOnlyQA数据集的设计使得能够独立分析LVLMs的视觉感知能力，而不受推理等其他能力的影响。通过1200个多选题的评估集和70k个合成训练实例，研究展示了现有LVLMs在视觉感知任务上的不足，并指出通过合成数据微调可以提升特定任务和模型的视觉感知能力，但整体提升有限。论文强调，改进训练数据和模型架构是提升LVLMs视觉感知能力的关键。

链接: https://arxiv.org/abs/2412.00947
作者: Ryo Kamoi,Yusen Zhang,Sarkar Snigdha Sarathi Das,Ranran Haoran Zhang,Rui Zhang
关键词-EN: Large Vision Language, Large Vision, visual perception errors, visual perception, mistakes in Large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: VisOnlyQA dataset, code, and model responses are provided at this https URL

点击查看摘要

Abstract:Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at this https URL.
zh

[NLP-55] QABISAR: Query-Article Bipartite Interactions for Statutory Article Retrieval COLING2025

【速读】：该论文试图解决法定条款检索中的语义不匹配问题，即在孤立地建模每个查询-条款对时，难以学习到能够有效捕捉多方面信息的表示。解决方案的关键在于引入了一种名为QABISAR的新框架，该框架利用查询与条款之间的二部交互来捕捉它们固有的多样性。此外，通过知识蒸馏将图网络中丰富的查询表示转移到查询双编码器中，以捕捉图表示中的丰富语义，尽管在推理过程中缺乏基于图的监督。实验结果表明，该方法在真实世界专家标注的数据集上表现出色。

链接: https://arxiv.org/abs/2412.00934
作者: T.Y.S.S. Santosh,Hassan Sarwat,Matthias Grabmair
关键词-EN: statutory article retrieval, capture multi-faceted information, effectively capture multi-faceted, semantic mismatch problem, pair in isolation
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:In this paper, we introduce QABISAR, a novel framework for statutory article retrieval, to overcome the semantic mismatch problem when modeling each query-article pair in isolation, making it hard to learn representation that can effectively capture multi-faceted information. QABISAR leverages bipartite interactions between queries and articles to capture diverse aspects inherent in them. Further, we employ knowledge distillation to transfer enriched query representations from the graph network into the query bi-encoder, to capture the rich semantics present in the graph representations, despite absence of graph-based supervision for unseen queries during inference. Our experiments on a real-world expert-annotated dataset demonstrate its effectiveness.
zh

[NLP-56] Lightweight Contenders: Navigating Semi-Supervised Text Mining through Peer Collaboration and Self Transcendence

【速读】：该论文试图解决在半监督学习 (Semi-Supervised Learning, SSL) 策略下，轻量级模型由于训练标签稀缺导致的模型参数受限问题。解决方案的关键在于引入了一种名为 PS-NET 的新框架，该框架通过在线蒸馏 (Online Distillation) 训练轻量级学生模型，使其模仿教师模型，并结合学生模型之间的协同教学 (Ensemble of Student Peers)。此外，PS-NET 还采用了恒定的对抗扰动方案 (Constant Adversarial Perturbation Schema) 进行自增强，以逐步泛化。通过这些创新，PS-NET 在极少标注数据的 SSL 文本分类任务中，显著优于现有的 SOTA 轻量级 SSL 框架如 FLiText 和 DisCo。

链接: https://arxiv.org/abs/2412.00883
作者: Qianren Mao,Weifeng Jiang,Junnan Liu,Chenghua Lin,Qian Li,Xianqing Wen,Jianxin Li,Jinhu Lu
关键词-EN: facilitating cost-effective inference, requires reducing annotated, reducing annotated samples, models requires reducing, cost-effective inference
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The semi-supervised learning (SSL) strategy in lightweight models requires reducing annotated samples and facilitating cost-effective inference. However, the constraint on model parameters, imposed by the scarcity of training labels, limits the SSL performance. In this paper, we introduce PS-NET, a novel framework tailored for semi-supervised text mining with lightweight models. PS-NET incorporates online distillation to train lightweight student models by imitating the Teacher model. It also integrates an ensemble of student peers that collaboratively instruct each other. Additionally, PS-NET implements a constant adversarial perturbation schema to further self-augmentation by progressive generalizing. Our PS-NET, equipped with a 2-layer distilled BERT, exhibits notable performance enhancements over SOTA lightweight SSL frameworks of FLiText and DisCo in SSL text classification with extremely rare labelled data.
zh

[NLP-57] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在解码过程中推理计算和内存占用随输出token生成逐渐增加的问题。解决方案的关键在于提出了一种动态视觉-语言上下文稀疏化框架Dynamic-LLaVA，该框架在预填充（prefill）阶段动态减少视觉上下文冗余，并在解码阶段降低生成的语言上下文的内存和计算开销。Dynamic-LLaVA针对不同的推理模式（预填充、带KV缓存的解码和不带KV缓存的解码）设计了定制的稀疏化推理方案，从而实现MLLMs的高效推理。实验结果表明，Dynamic-LLaVA在预填充阶段可减少约75%的计算消耗，在整个生成过程中，解码阶段在不带KV缓存时减少约50%的计算消耗，在带KV缓存时节省约50%的GPU内存开销，且在理解和生成能力上几乎没有性能下降，甚至有所提升。

链接: https://arxiv.org/abs/2412.00876
作者: Wenxuan Huang,Zijie Zhai,Yunhang Shen,Shaoshen Cao,Fei Zhao,Xiangfeng Xu,Zheyu Ye,Shaohui Lin
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, achieved remarkable success, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by \sim 75% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the \sim 50% computation consumption under decoding without KV cache, while saving \sim 50% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at this https URL .
zh

[NLP-58] Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting COLING2025

【速读】：该论文试图解决生成式 AI (Generative AI) 在完成比例类比（proportional analogies）任务中的表现问题。解决方案的关键在于通过三种知识增强的提示（prompt）设置来提升模型的表现：示例知识（exemplar）、结构化知识（structured）和针对性知识（targeted）。研究结果表明，尽管现有的大型语言模型（LLMs）经过大量训练，但在解决比例类比任务时仍面临挑战，最佳模型的准确率仅为55%。其中，提供针对性知识相较于提供示例或结构化知识集合，更能有效帮助模型完成比例类比任务。

链接: https://arxiv.org/abs/2412.00869
作者: Thilini Wijesiriwardene,Ruwan Wickramarachchi,Sreeram Vennam,Vinija Jain,Aman Chadha,Amitava Das,Ponnurangam Kumaraguru,Amit Sheth
关键词-EN: Making analogies, fundamental to cognition, Making, Large Language Models, Proportional analogies
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:Making analogies is fundamental to cognition. Proportional analogies, which consist of four terms, are often used to assess linguistic and cognitive abilities. For instance, completing analogies like “Oxygen is to Gas as blank is to blank” requires identifying the semantic relationship (e.g., “type of”) between the first pair of terms (“Oxygen” and “Gas”) and finding a second pair that shares the same relationship (e.g., “Aluminum” and “Metal”). In this work, we introduce a 15K Multiple-Choice Question Answering (MCQA) dataset for proportional analogy completion and evaluate the performance of contemporary Large Language Models (LLMs) in various knowledge-enhanced prompt settings. Specifically, we augment prompts with three types of knowledge: exemplar, structured, and targeted. Our results show that despite extensive training data, solving proportional analogies remains challenging for current LLMs, with the best model achieving an accuracy of 55%. Notably, we find that providing targeted knowledge can better assist models in completing proportional analogies compared to providing exemplars or collections of structured knowledge.
zh

[NLP-59] Quantifying perturbation impacts for large language models NEURIPS2024

【速读】：该论文试图解决量化输入扰动对大型语言模型（LLMs）输出影响的问题，这是模型可靠性和事后可解释性的基础任务。关键挑战在于区分模型响应中的有意义变化与LLM输出的内在随机性。解决方案的关键是引入基于分布的扰动分析（Distribution-Based Perturbation Analysis, DBPA）框架，将LLM扰动分析重新表述为频率主义假设检验问题。DBPA通过蒙特卡罗采样在低维语义相似性空间内构建经验性的零假设和备择假设输出分布，利用降维空间中的蒙特卡罗估计进行可行的频率主义推断，无需依赖限制性的分布假设。该框架具有模型无关性，支持对任意黑箱LLM进行任意输入扰动的评估，生成可解释的p值，支持通过控制错误率进行多重扰动测试，并提供所选相似性或距离度量的标量效应大小。

链接: https://arxiv.org/abs/2412.00868
作者: Paulius Rauba,Qiyao Wei,Mihaela van der Schaar
关键词-EN: large language models, post-hoc interpretability, large language, fundamental task, reliability and post-hoc
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: Statistical Foundations of LLMs and Foundation Models Workshop at NeurIPS 2024

点击查看摘要

Abstract:We consider the problem of quantifying how an input perturbation impacts the outputs of large language models (LLMs), a fundamental task for model reliability and post-hoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the reduced dimensionality space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is model-agnostic, supports the evaluation of arbitrary input perturbations on any black-box LLM, yields interpretable p-values, supports multiple perturbation testing via controlled error rates, and provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.
zh

[NLP-60] K-UD: Revising Korean Universal Dependencies Guidelines

【速读】：该论文试图解决现有韩语通用依存关系（Universal Dependencies, UDs）语言标注框架中关于句法关系定义的争议问题。解决方案的关键在于通过建立一个语言共识模型，细化韩语UDs中句法依存关系的定义，以期不仅在UDs内部达成一致，还能在UDs框架之外获得对韩语句子依存结构分析的广泛认可。

链接: https://arxiv.org/abs/2412.00856
作者: Kyuwon Kim,Yige Chen,Eunkyul Leah Jo,KyungTae Lim,Jungyeul Park,Chulwoo Park
关键词-EN: Korean Universal Dependencies, Universal Dependencies, Critique has surfaced, existing linguistic annotation, Korean Universal
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Critique has surfaced concerning the existing linguistic annotation framework for Korean Universal Dependencies (UDs), particularly in relation to syntactic relationships. In this paper, our primary objective is to refine the definition of syntactic dependency of UDs within the context of analyzing the Korean language. Our aim is not only to achieve a consensus within UDs but also to garner agreement beyond the UD framework for analyzing Korean sentences using dependency structure, by establishing a linguistic consensus model.
zh

[NLP-61] Does chat change LLM s mind? Impact of Conversation on Psychological States of LLM s

【速读】：该论文试图解决的问题是：在多代理系统中，大型语言模型（LLMs）的心理状态是否会在对话过程中发生变化，以及这些变化如何受到对话深度、话题和发言者的影响。解决方案的关键在于通过实验研究10个LLMs在开放领域对话中的行为，采用14个问卷和话题分析方法，从人格、人际关系、动机和情感四个方面评估LLMs的心理状态变化。研究结果揭示了由对话深度和话题引起的心理趋势的显著差异，并观察到不同LLM家族和参数大小之间的显著变化。

链接: https://arxiv.org/abs/2412.00804
作者: Junhyuk Choi,Yeseon Hong,Minju Kim,Bugeun Kim
关键词-EN: large language models, language models, enabled more authentic, recent growth, growth of large
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:The recent growth of large language models (LLMs) has enabled more authentic, human-centered interactions through multi-agent systems. However, investigation into how conversations affect the psychological states of LLMs is limited, despite the impact of these states on the usability of LLM-based systems. In this study, we explored whether psychological states change during multi-agent interactions, focusing on the effects of conversation depth, topic, and speaker. We experimentally investigated the behavior of 10 LLMs in open-domain conversations. We employed 14 questionnaires and a topic-analysis method to examine the behavior of LLMs across four aspects: personality, interpersonal relationships, motivation, and emotion. The results revealed distinct psychological trends influenced by conversation depth and topic, with significant variations observed between different LLM families and parameter sizes.
zh

[NLP-62] Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting

【速读】：该论文试图解决源域数据不可用的跨域少样本学习（Cross-Domain Few-Shot Learning, CD-FSL）问题，即在目标域中利用极少样本进行模型迁移，而不依赖源域数据。解决方案的关键在于利用文本模态增强训练样本的多样性，通过引入多样性提示（diversity prompts）和多样性描述（diversity descriptions）来增加输入样本的多样性，并通过深度提示调优（deep prompt tuning）提升模型的迁移能力。具体来说，论文提出了SeGD-VPT框架，分为两个阶段：第一阶段通过多样性提示增加特征多样性，第二阶段利用生成的增强特征训练分类器。实验结果表明，该方法在源域数据不可用的CD-FSL设置下达到了最佳性能，并与利用源域数据的SOTA模型相媲美。

链接: https://arxiv.org/abs/2412.00767
作者: Linhai Zhuo,Zheng Wang,Yuqian Fu,Tianwen Qian
关键词-EN: source domain data, target domains utilizing, domains utilizing minimal, utilizing minimal samples, domain data
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The source-free cross-domain few-shot learning (CD-FSL) task aims to transfer pretrained models to target domains utilizing minimal samples, eliminating the need for source domain data. Addressing this issue requires models to have robust generalization abilities and strong feature representation, aligning with the characteristics of large-scale pretrained models. However, large-scale models tend to lose representational ability in cross-domain scenarios due to limited sample diversity. \zlhGiven the abundant diversity provided by semantic modality, this paper leverages textual modality to enhance training sample diversity with CLP model, meanwhile improving model transfer efficiency. Specifically, we propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity. Furthermore, we use diversity descriptions of classes to guide semantically meaningful learning of diversity prompts, proposing random combinations and selections of texts to increase textual diversity. Additionally, deep prompt tuning is introduced to enhance the model’s transfer capability. After training of the first step, support samples with different diversity prompts are input into the CLIP backbone to generate enhanced features. After generation, the second phase trains classifiers using the generated features. Extensive experimental results across several benchmarks verify our method is comparable to SOTA source-utilized models and attain the best performance under the source-free CD-FSL setting.
zh

[NLP-63] SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

【速读】：该论文试图解决传统方法在评估大型语言模型（LLMs）鲁棒性时依赖标准化基准测试所带来的成本高昂和跨领域评估受限的问题。解决方案的关键在于引入一种自主评估框架，通过结合精炼的对抗性提示和领域约束的知识图谱（knowledge graphs）来生成描述性句子，从而形成对抗性提示。这些提示由LLM自身生成并针对其自身的鲁棒性进行评估，经过严格的筛选和精炼过程，确保文本流畅性和语义准确性。这种自评估机制使得LLM能够在无需外部基准测试的情况下评估其鲁棒性，并通过在多种模型（如ChatGPT、Llama-3.1、Phi-3和Mistral）上的广泛测试验证了其有效性。

链接: https://arxiv.org/abs/2412.00765
作者: Aihua Pei,Zehua Yang,Shunan Zhu,Ruoxi Cheng,Ju Jia
关键词-EN: large language models, Traditional methods, large language, rely on standardized, escalate costs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional methods for evaluating the robustness of large language models (LLMs) often rely on standardized benchmarks, which can escalate costs and limit evaluations across varied domains. This paper introduces a novel framework designed to autonomously evaluate the robustness of LLMs by incorporating refined adversarial prompts and domain-constrained knowledge guidelines in the form of knowledge graphs. Our method systematically generates descriptive sentences from domain-constrained knowledge graph triplets to formulate adversarial prompts, enhancing the relevance and challenge of the evaluation. These prompts, generated by the LLM itself and tailored to evaluate its own robustness, undergo a rigorous filtering and refinement process, ensuring that only those with high textual fluency and semantic fidelity are used. This self-evaluation mechanism allows the LLM to evaluate its robustness without the need for external benchmarks. We assess the effectiveness of our framework through extensive testing on both proprietary models like ChatGPT and open-source models such as Llama-3.1, Phi-3, and Mistral. Results confirm that our approach not only reduces dependency on conventional data but also provides a targeted and efficient means of evaluating LLM robustness in constrained domains.
zh

[NLP-64] PGSO: Prompt-based Generative Sequence Optimization Network for Aspect-based Sentiment Analysis

【速读】：该论文试图解决生成式预训练模型在基于方面的情感分析 (Aspect-based Sentiment Analysis, ABSA) 任务中处理复杂长文本时，由于位置嵌入机制难以提取隐含的长距离关系（如方面-观点关系）的问题。解决方案的关键在于引入了两种序列优化策略：基于规则的静态优化和基于分数的动态优化。前者依赖于手工优先级的依赖关系来重新排序上下文，后者通过神经网络计算词位置分数来动态调整上下文序列。基于动态优化结构，论文进一步提出了统一的基于提示的生成序列优化网络 (Prompt-based Generative Sequence Optimization network, PGSO)，该网络通过提示构建和序列调节两个组件，联合优化训练目标和生成模型，从而显著提升了ABSA任务的性能，平均F1分数提高了3.52%。

链接: https://arxiv.org/abs/2412.00763
作者: Hao Dong,Wei Wei
关键词-EN: Aspect-based Sentiment Analysis, Sentiment Analysis, Aspect-based Sentiment, demonstrated remarkable results, generative pre-training based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, generative pre-training based models have demonstrated remarkable results on Aspect-based Sentiment Analysis (ABSA) task. However, previous works overemphasize crafting various templates to paraphrase training targets for enhanced decoding, ignoring the internal optimizations on generative models. Despite notable results achieved by these target-oriented optimization methods, they struggle with the complicated long texts since the implicit long-distance relation, e.g., aspect-opinion relation, is difficult to extract under the position embedding mechanism in generative models. Thus, in this paper, we first clarify the causes of the problem and introduce two sequence optimization strategies: the rule-based static optimization and the score-based dynamic optimization. The rule-based approach relies on handcraft priority of dependency relation to reorder the context, while the score-based algorithm dynamically regulates the contextual sequence by calculating word position scores using neural network. Based on the dynamic optimization structure, we further propose a unified Prompt-based Generative Sequence Optimization network (named PGSO), which jointly optimizes the training target as well as the generative model. Specifically, PGSO contains two components, namely, prompt construction and sequence regulator. The former constructs a task-specific prompt based on unsupervised training objects to fully utilize the pre-trained model. The latter jointly leverages semantic, syntactic and original-sequence information to dynamically regulate contextual sequence. Our experiments conducted on four ABSA tasks across multiple benchmarks indicate that PGSO outperforms state-of-the-art methods, with an average improvement of 3.52% in F1 score.
zh

[NLP-65] Multi-View Incongruity Learning for Multimodal Sarcasm Detection COLING2025

【速读】：该论文试图解决多模态讽刺检测 (Multimodal Sarcasm Detection, MSD) 中存在的依赖虚假关联 (spurious correlations) 的问题。现有方法往往错误地优先考虑非关键特征，导致模型在训练环境之外的泛化能力较差。论文提出的解决方案之关键是引入了一种新的方法——通过对比学习整合多模态不一致性 (Multimodal Incongruities via Contrastive Learning, MICL)。具体来说，MICL 利用不一致性驱动从三个视角（token-patch、entity-object 和 sentiment）进行多视角学习，并通过广泛的数据增强来减轻文本模态的偏差学习。此外，论文构建了一个包含潜在虚假关联的测试集 SPMSD，以评估模型的泛化能力。实验结果表明，MICL 在基准数据集上表现优越，并有效缓解了虚假关联的影响。

链接: https://arxiv.org/abs/2412.00756
作者: Diandian Guo,Cong Cao,Fangfang Yuan,Yanbing Liu,Guangjie Zeng,Xiaoyan Yu,Hao Peng,Philip S. Yu
关键词-EN: Multimodal sarcasm detection, downstream tasks, Existing MSD methods, MSD methods tend, Existing MSD
类目: Computation and Language (cs.CL)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Multimodal sarcasm detection (MSD) is essential for various downstream tasks. Existing MSD methods tend to rely on spurious correlations. These methods often mistakenly prioritize non-essential features yet still make correct predictions, demonstrating poor generalizability beyond training environments. Regarding this phenomenon, this paper undertakes several initiatives. Firstly, we identify two primary causes that lead to the reliance of spurious correlations. Secondly, we address these challenges by proposing a novel method that integrate Multimodal Incongruities via Contrastive Learning (MICL) for multimodal sarcasm detection. Specifically, we first leverage incongruity to drive multi-view learning from three views: token-patch, entity-object, and sentiment. Then, we introduce extensive data augmentation to mitigate the biased learning of the textual modality. Additionally, we construct a test set, SPMSD, which consists potential spurious correlations to evaluate the the model’s generalizability. Experimental results demonstrate the superiority of MICL on benchmark datasets, along with the analyses showcasing MICL’s advancement in mitigating the effect of spurious correlation.
zh

[NLP-66] owards Adaptive Mechanism Activation in Language Agent COLING2025

【速读】：该论文试图解决当前语言代理（Language Agent）在任务完成过程中依赖固定机制或预定义顺序机制的问题，这些机制限制了代理对不同任务解决方案结构的适应性。解决方案的关键在于提出了自适应语言代理机制激活学习与自我探索（Adaptive Language Agent Mechanism Activation Learning with Self-Exploration, ALAMA）方法。该方法通过构建一个统一的代理框架（UniAct），将不同机制通过动作（Actions）统一起来，并利用基于自我探索的高效优化方法，使UniAct能够根据任务的潜在特征自适应地激活适当的机制。实验结果表明，这种方法显著提升了下游代理任务的表现，验证了其在促进更动态和情境敏感的机制激活方面的有效性。

链接: https://arxiv.org/abs/2412.00722
作者: Ziyang Huang,Jun Zhao,Kang Liu
关键词-EN: textbf, autonomous task accomplishment, Language Agent, Language, task accomplishment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLING2025

点击查看摘要

Abstract:Language Agent could be endowed with different mechanisms for autonomous task accomplishment. Current agents typically rely on fixed mechanisms or a set of mechanisms activated in a predefined order, limiting their adaptation to varied potential task solution structures. To this end, this paper proposes \textbfAdaptive \textbfLanguage \textbfAgent \textbfMechanism \textbfActivation Learning with Self-Exploration (\textbfALAMA), which focuses on optimizing mechanism activation adaptability without reliance on expert models. Initially, it builds a harmonized agent framework (\textbfUniAct) to \textbfUnify different mechanisms via \textbfActions. Then it leverages a training-efficient optimization method based on self-exploration to enable the UniAct to adaptively activate the appropriate mechanisms according to the potential characteristics of the task. Experimental results demonstrate significant improvements in downstream agent tasks, affirming the effectiveness of our approach in facilitating more dynamic and context-sensitive mechanism activation.
zh

[NLP-67] A Comparative Study of LLM -based ASR and Whisper in Low Resource and Code Switching Scenario

【速读】：该论文试图解决在低资源环境下和普通话-英语代码转换场景下的自动语音识别（ASR）问题。解决方案的关键在于探索大型语言模型（LLMs）在低资源ASR中的应用潜力，并将其性能与Whisper模型进行比较。研究结果表明，基于LLM的ASR系统在低资源ASR中相对于Whisper模型实现了12.8%的相对增益，而在普通话-英语代码转换ASR中，Whisper模型表现更优。

链接: https://arxiv.org/abs/2412.00721
作者: Zheshu Song,Ziyang Ma,Yifan Yang,Jianheng Zhuo,Xie Chen
关键词-EN: Large Language Models, diverse NLP tasks, Large Language, Automatic Speech Recognition, NLP tasks
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in low resource settings remains underexplored. Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. We also evaluate and compare the recognition performance of LLM-based ASR systems against Whisper model. Extensive experiments demonstrate that LLM-based ASR yields a relative gain of 12.8% over the Whisper model in low resource ASR while Whisper performs better in Mandarin-English code switching ASR. We hope that this study could shed light on ASR for low resource scenarios.
zh

[NLP-68] Multi-Agent Collaboration in Incident Response with Large Language Models

【速读】：该论文试图解决在网络安全事件响应（Incident Response, IR）中如何通过多智能体协作提高决策效率和响应效果的问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）作为智能代理，通过模拟不同团队结构（集中式、分散式和混合式）在Backdoors & Breaches框架下的协作，分析智能体间的互动和表现，从而优化多智能体协作机制，提升IR过程中的决策能力、适应性和流程效率。

链接: https://arxiv.org/abs/2412.00652
作者: Zefang Liu
关键词-EN: address cyberattacks effectively, requiring rapid decision-making, requiring rapid, cyberattacks effectively, critical aspect
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Incident response (IR) is a critical aspect of cybersecurity, requiring rapid decision-making and coordinated efforts to address cyberattacks effectively. Leveraging large language models (LLMs) as intelligent agents offers a novel approach to enhancing collaboration and efficiency in IR scenarios. This paper explores the application of LLM-based multi-agent collaboration using the Backdoors Breaches framework, a tabletop game designed for cybersecurity training. We simulate real-world IR dynamics through various team structures, including centralized, decentralized, and hybrid configurations. By analyzing agent interactions and performance across these setups, we provide insights into optimizing multi-agent collaboration for incident response. Our findings highlight the potential of LLMs to enhance decision-making, improve adaptability, and streamline IR processes, paving the way for more effective and coordinated responses to cyber threats.
zh

[NLP-69] ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

【速读】：该论文试图解决任务特定指令调优（task-specific instruction tuning）中数据选择的问题，即如何有效地选择训练数据以最小化指令调优损失并提高目标任务的性能。传统方法主要依赖于手工设计的相似性度量来选择与测试数据分布对齐的训练数据，但研究发现指令调优损失（如交叉熵损失）与实际任务性能之间并不总是呈现单调关系，这导致了现有数据选择方法的效果不佳。论文提出的解决方案是引入一种名为ROSE（Reward-Oriented inStruction data sElection）的新方法，该方法利用成对偏好损失（pairwise preference loss）作为奖励信号，通过适应影响公式（influence formulation）来近似训练数据点相对于少量样本偏好验证集的影响，从而选择与任务最相关的训练数据点。实验结果表明，使用ROSE选择仅5%的训练数据即可达到与全数据集微调相当的性能，并超越了其他最先进的数据选择方法。

链接: https://arxiv.org/abs/2412.00631
作者: Yang Wu,Huayi Zhang,Yizheng Jiao,Lin Ma,Xiaozhong Liu,Jinhong Yu,Dongyu Zhang,Dezhi Yu,Wei Xu
关键词-EN: task-specific instruction tuning, Instruction tuning, task-specific instruction, instruction tuning loss, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human-controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a novel Reward-Oriented inStruction data sElection method which leverages pairwise preference loss as a reward signal to optimize data selection for task-specific instruction tuning. Specifically, ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points. Experimental results show that by selecting just 5% of the training data using ROSE, our approach can achieve competitive results compared to fine-tuning with the full training dataset, and it surpasses other state-of-the-art data selection methods for task-specific instruction tuning. Our qualitative analysis further confirms the robust generalizability of our method across multiple benchmark datasets and diverse model architectures.
zh

[NLP-70] DynRank: Improving Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification COLING2025

【速读】：该论文试图解决开放领域问答系统中段落检索的适应性问题，传统方法依赖于静态提示和预定义模板，可能导致模型在不同问题和上下文中的适应性受限。解决方案的关键在于引入动态提示机制 (dynamic prompting mechanism)，通过预训练的问题分类模型 (question classification model) 将问题细分为不同类型，并根据这些分类生成上下文相关的提示 (contextually relevant prompts)，从而提高段落检索的有效性。

链接: https://arxiv.org/abs/2412.00600
作者: Abdelrahman Abdallah,Jamshid Mozafari,Bhawna Piryani,Mohammed M.Abdelgwad,Adam Jatowt
关键词-EN: open-domain question-answering systems, paper presents DynRank, paper presents, open-domain question-answering, question-answering systems
类目: Computation and Language (cs.CL)
备注: Accepted at Coling2025

点击查看摘要

Abstract:This paper presents DynRank, a novel framework for enhancing passage retrieval in open-domain question-answering systems through dynamic zero-shot question classification. Traditional approaches rely on static prompts and pre-defined templates, which may limit model adaptability across different questions and contexts. In contrast, DynRank introduces a dynamic prompting mechanism, leveraging a pre-trained question classification model that categorizes questions into fine-grained types. Based on these classifications, contextually relevant prompts are generated, enabling more effective passage retrieval. We integrate DynRank into existing retrieval frameworks and conduct extensive experiments on multiple QA benchmark datasets.
zh

[NLP-71] Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment

【速读】：该论文试图解决大型语言模型（LLMs）在非英语医疗领域任务中的应用问题，特别是波兰语医疗考试中的表现。解决方案的关键在于创建了一个基于波兰医疗执照和专业考试（LEK, LDEK, PES）的新型基准数据集，该数据集包含超过24,000道考试题目，其中一部分是波兰语-英语平行语料库。通过这一数据集，论文系统地评估了包括通用、领域特定和波兰语特定模型在内的多种LLMs，并与人类医学生进行性能比较。研究发现，尽管某些模型如GPT-4o接近人类表现，但在跨语言翻译和领域特定理解方面仍存在显著挑战，这突显了在临床实践中部署LLMs的局限性和伦理考量。

链接: https://arxiv.org/abs/2412.00559
作者: Łukasz Grzybowski,Jakub Pokrywka,Michał Ciesiółka,Jeremi I. Kaczmarek,Marek Kubis
关键词-EN: handling specialized tasks, Large Language Models, demonstrated significant potential, Large Language, specialized tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in handling specialized tasks, including medical problem-solving. However, most studies predominantly focus on English-language contexts. This study introduces a novel benchmark dataset based on Polish medical licensing and specialization exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing doctors pursuing specialization. The dataset was web-scraped from publicly available resources provided by the Medical Examination Center and the Chief Medical Chamber. It comprises over 24,000 exam questions, including a subset of parallel Polish-English corpora, where the English portion was professionally translated by the examination center for foreign candidates. By creating a structured benchmark from these existing exam questions, we systematically evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and Polish-specific models, and compare their performance against human medical students. Our analysis reveals that while models like GPT-4o achieve near-human performance, significant challenges persist in cross-lingual translation and domain-specific understanding. These findings underscore disparities in model performance across languages and medical specialties, highlighting the limitations and ethical considerations of deploying LLMs in clinical practice.
zh

[NLP-72] Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective COLING2025

【速读】：该论文试图解决大型语言模型（LLMs）在实际医疗任务中的性能问题，特别是关于人口统计学公平性的挑战。解决方案的关键在于评估当前最先进的LLMs在三种流行学习框架下，针对六种多样化的医疗任务的表现，并发现LLMs在实际应用中的显著局限性和持续的公平性问题。论文指出，明确提供人口统计信息的效果不一，而LLMs推断这些细节的能力引发了关于偏见健康预测的担忧。此外，将LLMs作为自主代理并提供最新指南并不能保证性能的提升。这些发现揭示了LLMs在医疗公平性方面的关键限制，并强调了在这一领域进行专门研究的迫切需求。

链接: https://arxiv.org/abs/2412.00554
作者: Yue Zhou,Barbara Di Eugenio,Lu Cheng
关键词-EN: large language models, real-world healthcare tasks, solving real-world healthcare, healthcare tasks, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the main conference of COLING 2025

点击查看摘要

Abstract:This paper studies the performance of large language models (LLMs), particularly regarding demographic fairness, in solving real-world healthcare tasks. We evaluate state-of-the-art LLMs with three prevalent learning frameworks across six diverse healthcare tasks and find significant challenges in applying LLMs to real-world healthcare tasks and persistent fairness issues across demographic groups. We also find that explicitly providing demographic information yields mixed results, while LLM’s ability to infer such details raises concerns about biased health predictions. Utilizing LLMs as autonomous agents with access to up-to-date guidelines does not guarantee performance improvement. We believe these findings reveal the critical limitations of LLMs in healthcare fairness and the urgent need for specialized research in this area.
zh

[NLP-73] SeQwen at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains COLING2025

【速读】：该论文试图解决金融领域中的虚假信息检测问题，解决方案的关键在于结合大型语言模型（LLMs）如Qwen、Mistral和Gemma-2，并通过预处理和序列学习方法，不仅识别欺诈性金融内容，还生成连贯且简洁的解释以阐明分类依据。该方法在分类任务中取得了0.8283的F1-score，在解释生成任务中取得了0.7253的ROUGE-1分数，展示了LLMs在金融应用中对抗虚假信息和提升透明度的潜力。

链接: https://arxiv.org/abs/2412.00549
作者: Jebish Purbey,Siddhant Gupta,Nikhil Manali,Siddartha Pullakhandam,Drishti Sharma,Ashay Srivastava,Ram Mohan Rao Kadiyala
关键词-EN: FMD challenge, paper presents, presents the system, system description, fraudulent financial content
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注: 6 pages, 9 figures, Submitted to FinNLP-FNP-LLMFinLegal @ COLING 2025

点击查看摘要

Abstract:This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.
zh

[NLP-74] Evaluating the Consistency of LLM Evaluators COLING2025

【速读】：该论文试图解决大语言模型（LLMs）作为评估工具的可靠性问题，特别是其在评估过程中的一致性问题。解决方案的关键在于对两种一致性——自我一致性（Self-Consistency, SC）和跨尺度一致性（Inter-scale Consistency, IC）——进行深入研究，并在不同的评分尺度和标准粒度下，对比开源模型和专有模型的表现。研究结果表明，即使是强大的专有模型，在一致性方面也可能存在不足，这强调了在评估LLM评估能力时考虑一致性的重要性。

链接: https://arxiv.org/abs/2412.00543
作者: Noah Lee,Jiwoo Hong,James Thorne
关键词-EN: Large language models, Large language, speed and cost, shown potential, potential as general
类目: Computation and Language (cs.CL)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Large language models (LLMs) have shown potential as general evaluators along with the evident benefits of speed and cost. While their correlation against human annotators has been widely studied, consistency as evaluators is still understudied, raising concerns about the reliability of LLM evaluators. In this paper, we conduct extensive studies on the two aspects of consistency in LLM evaluations, Self-Consistency (SC) and Inter-scale Consistency (IC), on different scoring scales and criterion granularity with open-source and proprietary models. Our comprehensive analysis demonstrates that strong proprietary models are not necessarily consistent evaluators, highlighting the importance of considering consistency in assessing the capability of LLM evaluators.
zh

[NLP-75] xtClass Benchmark: A Continuous Elo Rating of LLM s in Social Sciences

【速读】：该论文试图解决的问题是如何在社会科学领域中对大型语言模型（LLMs）和变换器（transformers）进行全面、公平且动态的评估，特别是在文本分类任务上。解决方案的关键在于引入了一个持续进行的基准测试项目——TextClass Benchmark，它通过一个定制的Elo评分系统来展示模型在不同领域和语言中的性能指标和相对排名。每次基准测试周期都会更新模型评级，并引入新的模型和数据集，以测试模型的泛化能力。此外，Meta-Elo评分系统将不同领域的评分进行综合和加权，从而提供一个跨任务的综合评估。该项目的持续性使得它能够不断适应新的模型和技术进步，确保评估的动态性和时效性。

链接: https://arxiv.org/abs/2412.00539
作者: Bastián González-Bustamante
关键词-EN: TextClass Benchmark project, TextClass Benchmark, Elo rating system, provide a comprehensive, Benchmark project
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Working paper: 6 pages, 2 figures

点击查看摘要

Abstract:The TextClass Benchmark project is an ongoing, continuous benchmarking process that aims to provide a comprehensive, fair, and dynamic evaluation of LLMs and transformers for text classification tasks. This evaluation spans various domains and languages in social sciences disciplines engaged in NLP and text-as-data approach. The leaderboards present performance metrics and relative ranking using a tailored Elo rating system. With each leaderboard cycle, novel models are added, fixed test sets can be replaced for unseen, equivalent data to test generalisation power, ratings are updated, and a Meta-Elo leaderboard combines and weights domain-specific leaderboards. This article presents the rationale and motivation behind the project, explains the Elo rating system in detail, and estimates Meta-Elo across different classification tasks in social science disciplines. We also present a snapshot of the first cycle of classification tasks on incivility data in Chinese, English, German and Russian. This ongoing benchmarking process includes not only additional languages such as Arabic, Hindi, and Spanish but also a classification of policy agenda topics, misinformation, among others.
zh

[NLP-76] ChemTEB: Chemical Text Embedding Benchmark an Overview of Embedding Models Performance Efficiency on a Specific Domain

链接: https://arxiv.org/abs/2412.00532
作者: Ali Shiraee Kasmaee,Mohammad Khodadad,Mohammad Arshi Saloot,Nick Sherck,Stephen Dokas,Hamidreza Mahyar,Soheila Samiee
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-77] Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and GPT-simulated raters

【速读】：该论文试图解决的问题是如何准确评估和解释人类创造力，特别是通过比较人类和GPT-3.5对故事创造力的评价。解决方案的关键在于使用文本形式思维网络（TFMN）提取故事的语义和情感特征，并结合可解释人工智能（XAI）和XGBoost模型分析这些特征对创造力评价的影响。研究发现，GPT-3.5的评价与人类评价在特征模式上存在显著差异，尤其是在评价人类故事和GPT-3.5自身生成的故事时。网络特征对人类评价更具预测性，而情感特征在GPT-3.5评价其自身故事时起更大作用。这些结果强调了GPT-3.5在捕捉人类创造力复杂性方面的局限性，并建议在使用GPT-3.5评估和生成创意内容时应持谨慎态度。

链接: https://arxiv.org/abs/2412.00530
作者: Edith Haim,Natalie Fischer,Salvatore Citraro,Giulio Rossetti,Massimo Stella
关键词-EN: Explainable Artificial Intelligence, human, ratings, stories, human stories
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Creativity is a fundamental skill of human cognition. We use textual forma mentis networks (TFMN) to extract network (semantic/syntactic associations) and emotional features from approximately one thousand human- and GPT3.5-generated stories. Using Explainable Artificial Intelligence (XAI), we test whether features relative to Mednick’s associative theory of creativity can explain creativity ratings assigned by humans and GPT-3.5. Using XGBoost, we examine three scenarios: (i) human ratings of human stories, (ii) GPT-3.5 ratings of human stories, and (iii) GPT-3.5 ratings of GPT-generated stories. Our findings reveal that GPT-3.5 ratings differ significantly from human ratings not only in terms of correlations but also because of feature patterns identified with XAI methods. GPT-3.5 favours ‘its own’ stories and rates human stories differently from humans. Feature importance analysis with SHAP scores shows that: (i) network features are more predictive for human creativity ratings but also for GPT-3.5’s ratings of human stories; (ii) emotional features played a greater role than semantic/syntactic network structure in GPT-3.5 rating its own stories. These quantitative results underscore key limitations in GPT-3.5’s ability to align with human assessments of creativity. We emphasise the need for caution when using GPT-3.5 to assess and generate creative content, as it does not yet capture the nuanced complexity that characterises human creativity.
zh

[NLP-78] GloCOM: A Short Text Neural Topic Model via Global Clustering Context

链接: https://arxiv.org/abs/2412.00525
作者: Quang Duc Nguyen,Tung Nguyen,Duc Anh Nguyen,Linh Ngo Van,Sang Dinh,Thien Huu Nguyen
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-79] Video-3D LLM : Learning Position-Aware Video Representation for 3D Scene Understanding

链接: https://arxiv.org/abs/2412.00493
作者: Duo Zheng,Shijia Huang,Liwei Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 14 pages, 4 figures

点击查看摘要

[NLP-80] Node Importance Estimation Leveraging LLM s for Semantic Augmentation in Knowledge Graphs

链接: https://arxiv.org/abs/2412.00478
作者: Xinyu Lin,Tianyu Zhang,Chengbin Hou,Jinbao Wang,Jianye Xue,Hairong Lv
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

[NLP-81] Non-native speakers of English or ChatGPT: Who thinks better?

链接: https://arxiv.org/abs/2412.00457
作者: Mohammed Q. Shormani
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures

点击查看摘要

[NLP-82] Few-Shot Domain Adaptation for Named-Entity Recognition via Joint Constrained k-Means and Subspace Selection COLING2025

链接: https://arxiv.org/abs/2412.00426
作者: Ayoub Hammal,Benno Uthayasooriyar,Caio Corro
关键词-EN:
类目: Computation and Language (cs.CL)
备注: COLING 2025

点击查看摘要

[NLP-83] Was that Sarcasm?: A Literature Survey on Sarcasm Detection

链接: https://arxiv.org/abs/2412.00425
作者: Harleen Kaur Bagga,Jasmine Bernard,Sahil Shaheen,Sarthak Arora
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-84] Does Self-Attention Need Separate Weights in Transformers?

链接: https://arxiv.org/abs/2412.00359
作者: Md Kowsher,Nusrat Jahan Prottasha,Chun-Nam Yu
关键词-EN:
类目: Computation and Language (cs.CL)
备注: Preprint paper

点击查看摘要

[NLP-85] Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection COLING2025

链接: https://arxiv.org/abs/2412.00353
作者: Shanu Kumar,Saish Mendke,Karody Lubna Abdul Rahman,Santosh Kurasa,Parag Agrawal,Sandipan Dandapat
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in COLING 2025

点击查看摘要

[NLP-86] Cognitive Biases in Large Language Models : A Survey and Mitigation Experiments

链接: https://arxiv.org/abs/2412.00323
作者: Yasuaki Sumita,Koh Takeuchi,Hisashi Kashima
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The extended abstract of this paper is presented at the 40th ACM/SIGAPP Symposium on Applied Computing (SAC 2025)

点击查看摘要

[NLP-87] Clinical Document Corpora and Assorted Domain Proxies: A Survey of Diversity in Corpus Design with Focus on German Text Data

链接: https://arxiv.org/abs/2412.00230
作者: Udo Hahn
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-88] N"ushuRescue: Revitalization of the endangered N"ushu Language with AI COLING2025

链接: https://arxiv.org/abs/2412.00218
作者: Ivory Yang,Weicheng Ma,Soroush Vosoughi
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to COLING 2025

点击查看摘要

[NLP-89] rain Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet Extraction

链接: https://arxiv.org/abs/2412.00208
作者: Xinmeng Hou,Lingyue Fu,Chenhao Meng,Hai Hu
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-90] o Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models

链接: https://arxiv.org/abs/2412.00166
作者: Fouad Trad,Ali Chehab
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in 4th International Conference on Intelligent Systems and Pattern Recognition (ISPR24)

点击查看摘要

[NLP-91] Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

链接: https://arxiv.org/abs/2412.00142
作者: Chancharik Mitra,Brandon Huang,Tianning Chai,Zhiqiu Lin,Assaf Arbelle,Rogerio Feris,Leonid Karlinsky,Trevor Darrell,Deva Ramanan,Roei Herzig
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-92] Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

链接: https://arxiv.org/abs/2412.00127
作者: Siqi Kou,Jiachun Jin,Chang Liu,Ye Ma,Jian Jia,Quan Chen,Peng Jiang,Zhijie Deng
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-93] Efficient Learning Content Retrieval with Knowledge Injection

链接: https://arxiv.org/abs/2412.00125
作者: Batuhan Sariturk,Rabia Bayraktar,Merve Elmas Erdem
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-94] ElectroVizQA: How well do Multi-modal LLM s perform in Electronics Visual Question Answering?

链接: https://arxiv.org/abs/2412.00102
作者: Pragati Shuddhodhan Meshram,Swetha Karthikeyan,Bhavya,Suma Bhat
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-95] Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

链接: https://arxiv.org/abs/2412.00098
作者: Zhyar Rzgar K Rostam,Gábor Kertész
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 7 tables

点击查看摘要

[NLP-96] Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

链接: https://arxiv.org/abs/2412.00090
作者: Zuguang Li,Shaohua Wu,Liang Li,Songge Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 5 pages, 6 figures

点击查看摘要

[NLP-97] Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

链接: https://arxiv.org/abs/2412.00074
作者: Avinash Amballa,Durga Sandeep Saluru,Gayathri Akkinapalli,Abhishek Sureddy,Akshay Kumar Sureddy
关键词-EN:
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

[NLP-98] COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

链接: https://arxiv.org/abs/2412.00071
作者: Jinqi Xiao,Shen Sang,Tiancheng Zhi,Jing Liu,Qing Yan,Linjie Luo,Bo Yuan
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[NLP-99] Condense Dont Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

链接: https://arxiv.org/abs/2412.00069
作者: Mingyu Cao,Gen Li,Jie Ji,Jiaqi Zhang,Xiaolong Ma,Shiwei Liu,Lu Yin
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-100] Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

链接: https://arxiv.org/abs/2412.00061
作者: Zhuofan Wen,Shangtong Gui,Yang Feng
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-101] LeMoLE: LLM -Enhanced Mixture of Linear Experts for Time Series Forecasting

链接: https://arxiv.org/abs/2412.00053
作者: Lingzheng Zhang,Lifeng Shen,Yimin Zheng,Shiyuan Piao,Ziyue Li,Fugee Tsung
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-102] Random Tree Model of Meaningful Memory

链接: https://arxiv.org/abs/2412.01806
作者: Weishun Zhong,Tankut Can,Antonis Georgiou,Ilya Shnayderman,Mikhail Katkov,Misha Tsodyks
关键词-EN:
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 4 figures

点击查看摘要

[NLP-103] First numerical observation of the Berezinskii-Kosterlitz-Thouless transition in language models

链接: https://arxiv.org/abs/2412.01212
作者: Yuma Toji,Jun Takahashi,Vwani Roychowdhury,Hideyuki Miyahara
关键词-EN:
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-104] Automating Feedback Analysis in Surgical Training: Detection Categorization and Assessment ALT

链接: https://arxiv.org/abs/2412.00760
作者: Firdavs Nasriddinov,Rafal Kocielnik,Arushi Gupta,Cherine Yang,Elyssa Wong,Anima Anandkumar,Andrew Hung
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted as a proceedings paper at Machine Learning for Health 2024

点击查看摘要

[NLP-105] High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR

链接: https://arxiv.org/abs/2412.00055
作者: Sourav Banerjee,Ayushi Agarwal,Promila Ghosh
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 15 pages

点击查看摘要

计算机视觉

[CV-0] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

【速读】：该论文试图解决现有解码器仅视觉自回归（AR）模型在图像生成过程中依赖预定义生成顺序的问题。解决方案的关键在于引入“位置指令标记”（position instruction token），在每个待预测的图像标记前插入，以表示下一个图像标记的空间位置。这种设计使得模型能够在任意标记顺序下进行训练和生成，从而消除了预定义生成顺序的归纳偏置（inductive bias）。通过在随机排列的标记序列上进行训练，RandAR模型不仅在性能上与传统光栅顺序（raster-order）模型相当，还获得了新的生成能力，如并行解码（parallel decoding）和零样本（zero-shot）的图像修复、外扩和分辨率外推。

链接: https://arxiv.org/abs/2412.01827
作者: Ziqi Pang,Tianyuan Zhang,Fujun Luan,Yunze Man,Hao Tan,Kai Zhang,William T. Freeman,Yu-Xiong Wang
关键词-EN: capable of generating, arbitrary token orders, RandAR, decoder-only visual autoregressive, token
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders. Unlike previous decoder-only AR models that rely on a predefined generation order, RandAR removes this inductive bias, unlocking new capabilities in decoder-only generation. Our essential design enables random order by inserting a “position instruction token” before each image token to be predicted, representing the spatial location of the next image token. Trained on randomly permuted token sequences – a more challenging task than fixed-order generation, RandAR achieves comparable performance to its conventional raster-order counterpart. More importantly, decoder-only transformers trained from random orders acquire new capabilities. For the efficiency bottleneck of AR models, RandAR adopts parallel decoding with KV-Cache at inference time, enjoying 2.5x acceleration without sacrificing generation quality. Additionally, RandAR supports inpainting, outpainting and resolution extrapolation in a zero-shot manner. We hope RandAR inspires new directions for decoder-only visual generation models and broadens their applications across diverse scenarios. Our project page is at this https URL.
zh

[CV-1] RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

【速读】：该论文试图解决在长视频中进行视觉查询定位的挑战性任务，提出了一种无需训练的基线方法RELOCATE。解决方案的关键在于利用预训练视觉模型生成的区域表示，通过以下步骤实现高效定位：1) 识别每帧中的所有对象；2) 将对象与给定查询进行比较，选择最相似的对象；3) 进行双向跟踪以获取时空响应。此外，论文还提出了关键增强措施，如对选定对象进行精确定位和生成额外的视觉查询以捕捉视觉变化，从而有效处理小对象、杂乱场景、部分可见性和外观变化等问题。在Ego4D Visual Query 2D Localization数据集上的评估显示，RELOCATE在时空平均精度上相对现有方法提升了49%。

链接: https://arxiv.org/abs/2412.01826
作者: Savya Khosla,Sethuraman T V,Alexander Schwing,Derek Hoiem
关键词-EN: simple training-free baseline, training-free baseline designed, present RELOCATE, simple training-free, long videos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
zh

[CV-2] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

【速读】：该论文试图解决大语言模型（LLMs）在一般图像生成任务中的上下文学习潜力未被充分探索的问题。解决方案的关键在于引入X-Prompt，这是一个纯自回归的大视觉语言模型（VLM），旨在通过统一的上下文学习框架，在已见和未见的图像生成任务中提供竞争性表现。X-Prompt的核心设计包括高效压缩上下文示例中的有价值特征，支持更长的上下文令牌序列，并增强对未见任务的泛化能力。通过统一的文本和图像预测训练任务，X-Prompt能够从上下文示例中获得增强的任务意识，从而更好地处理一般图像生成任务。

链接: https://arxiv.org/abs/2412.01824
作者: Zeyi Sun,Ziyang Chu,Pan Zhang,Tong Wu,Xiaoyi Dong,Yuhang Zang,Yuanjun Xiong,Dahua Lin,Jiaqi Wang
关键词-EN: open-task generalization capability, large language models’, open-task generalization, generalization capability, key component
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: code: this https URL

点击查看摘要

Abstract:In-context generation is a key component of large language models’ (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model’s performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
zh

[CV-3] HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

【速读】：该论文试图解决神经渲染中2D高斯样条（2D Gaussian Splatting, 2DGS）在任意视角下渲染时面临的反走样（anti-aliasing）和高分辨率纹理细节保留的问题。解决方案的关键在于：1) 将2D样条与纹理图对齐，并结合每射线深度排序（per-ray depth sorting）和基于Fisher信息的剪枝（fisher-based pruning），以提高渲染的一致性和效率；2) 设计基于视锥体（frustum）的采样方法，以减轻走样伪影。通过这些方法，论文显著提升了在不同视角下细节保留和反走样的能力，实验结果表明其方法在细节保留和反走样方面超越了现有技术。

链接: https://arxiv.org/abs/2412.01823
作者: Yunzhou Song,Heguang Lin,Jiahui Lei,Lingjie Liu,Kostas Daniilidis
关键词-EN: Gaussian Splatting, Recent advancements, jointly reconstructing fine, reconstructing fine appearance, Gaussian surfels
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in neural rendering, particularly 2D Gaussian Splatting (2DGS), have shown promising results for jointly reconstructing fine appearance and geometry by leveraging 2D Gaussian surfels. However, current methods face significant challenges when rendering at arbitrary viewpoints, such as anti-aliasing for down-sampled rendering, and texture detail preservation for high-resolution rendering. We proposed a novel method to align the 2D surfels with texture maps and augment it with per-ray depth sorting and fisher-based pruning for rendering consistency and efficiency. With correct order, per-surfel texture maps significantly improve the capabilities to capture fine details. Additionally, to render high-fidelity details in varying viewpoints, we designed a frustum-based sampling method to mitigate the aliasing artifacts. Experimental results on benchmarks and our custom texture-rich dataset demonstrate that our method surpasses existing techniques, particularly in detail preservation and anti-aliasing.
zh

[CV-4] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

【速读】：该论文试图解决在资源受限设备（如移动平台和机器人）上部署视觉语言模型 (VLM) 时面临的计算挑战。解决方案的关键在于提出了一种名为 VLsI (Verbalized Layers-to-Interactions) 的新型 VLM 系列，该系列在 2B 和 7B 模型规模下优先考虑效率而不牺牲准确性。VLsI 通过独特的逐层蒸馏过程，引入中间的“verbalizers”，将每一层的特征映射到自然语言空间，使较小的 VLM 能够灵活地与较大 VLM 的推理过程对齐。这种方法不仅缓解了输出模仿中常见的训练不稳定性，还通过将小 VLM 的逐层进展与大 VLM 对齐，超越了典型的最终层调优。

链接: https://arxiv.org/abs/2412.01822
作者: Byung-Kwan Lee,Ryo Hachiuma,Yu-Chiang Frank Wang,Yong Man Ro,Yueh-Hua Wu
关键词-EN: high-quality visual instruction, visual instruction tuning, instruction tuning samples, recent surge, surge in high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate “verbalizers” that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.
zh

[CV-5] World-consistent Video Diffusion with Explicit 3D Modeling

【速读】：该论文试图解决现有扩散模型在生成3D一致性内容方面的不足。解决方案的关键在于提出了世界一致性视频扩散模型 (World-consistent Video Diffusion, WVD)，该框架通过引入XYZ图像（编码每个像素的全局3D坐标）进行显式的3D监督。具体来说，WVD训练一个扩散变换器来学习RGB和XYZ帧的联合分布，并通过灵活的修复策略实现多任务适应性。这种方法不仅能够从真实RGB帧估计XYZ帧，还能根据指定的相机轨迹生成新的RGB帧，从而统一了单图像到3D生成、多视图立体和相机控制视频生成等任务。WVD在多个基准测试中展示了竞争性能，提供了一种可扩展的解决方案，能够通过单一预训练模型实现3D一致性的视频和图像生成。

链接: https://arxiv.org/abs/2412.01821
作者: Qihang Zhang,Shuangfei Zhai,Miguel Angel Bautista,Kevin Miao,Alexander Toshev,Joshua Susskind,Jiatao Gu
关键词-EN: enabling realistic visual, realistic visual synthesis, Recent advancements, enabling realistic, multi-frame contexts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.
zh

[CV-6] owards Universal Soccer Video Understanding SOCC

【速读】：该论文试图解决足球视频理解的问题，提出了一种综合的多模态框架。解决方案的关键在于：(i) 引入了目前最大的多模态足球数据集 SoccerReplay-1988，包含1,988场完整比赛视频及详细标注，并采用自动化标注流程；(ii) 开发了足球领域首个视觉-语言基础模型 MatchVision，该模型利用足球视频中的时空信息，在多种下游任务中表现优异；(iii) 通过广泛的实验和消融研究，在动作分类、评论生成和多视角犯规识别等任务上展示了最先进的性能，显著优于现有模型，证明了所提出数据和模型的优越性。

链接: https://arxiv.org/abs/2412.01820
作者: Jiayuan Rao,Haoning Wu,Hao Jiang,Ya Zhang,Yanfeng Wang Weidi Xie
关键词-EN: attracted widespread interest, globally celebrated sport, globally celebrated, attracted widespread, widespread interest
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:As a globally celebrated sport, soccer has attracted widespread interest from fans over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition, and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research. The code and model will be publicly available for reproduction.
zh

[CV-7] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

【速读】：该论文试图解决文本到图像生成 (Text-to-Image Generation) 中现有自回归模型 (AR models) 在收敛性和性能上的不足问题。解决方案的关键在于提出了一种尺度变换的Transformer模型 (Switti)，通过以下几个关键创新点来提升性能：1) 对现有自回归模型的架构进行修改，以改善其收敛性和整体性能；2) 观察到预训练的尺度变换自回归模型中的自注意力图 (self-attention maps) 对前一尺度的依赖性较弱，因此提出了一种非自回归的对应模型，从而实现约11%的采样速度提升和更低的内存消耗，同时略微提高了生成质量；3) 发现高分辨率尺度上的无分类器引导 (classifier-free guidance) 往往是不必要的，甚至可能损害性能，通过禁用这些尺度的引导，进一步加速了约20%的采样速度，并改善了细粒度细节的生成。最终，Switti在人类偏好研究和自动化评估中表现优异，不仅超越了现有的文本到图像自回归模型，还与最先进的文本到图像扩散模型竞争，同时速度提升高达7倍。

链接: https://arxiv.org/abs/2412.01819
作者: Anton Voronov,Denis Kuznedelev,Mikhail Khoroshikh,Valentin Khrulkov,Dmitry Baranchuk
关键词-EN: work presents Switti, work presents, presents Switti, scale-wise transformer, generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 21 figures

点击查看摘要

Abstract:This work presents Switti, a scale-wise transformer for text-to-image generation. Starting from existing next-scale prediction AR models, we first explore them for T2I generation and propose architectural modifications to improve their convergence and overall performance. We then observe that self-attention maps of our pretrained scale-wise AR model exhibit weak dependence on preceding scales. Based on this insight, we propose a non-AR counterpart facilitating \sim11% faster sampling and lower memory usage while also achieving slightly better generation this http URL, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. %may be not only unnecessary but potentially detrimental. By disabling guidance at these scales, we achieve an additional sampling acceleration of \sim20% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7\times faster.
zh

[CV-8] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster FAST

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在与大型语言模型（Large Language Models, LLMs）交互时，由于依赖大量视觉标记（visual tokens）而导致推理效率低下的问题。解决方案的关键在于提出了一种名为FasterVLM的训练无关的视觉标记剪枝方法，该方法通过利用视觉编码器中[CLS]标记与图像标记之间的交叉注意力（cross-attentions）来更准确地评估视觉标记的重要性。FasterVLM在视觉编码器输出后立即消除冗余视觉标记，确保这些标记不与LLMs交互，从而显著加速VLM推理。该方法在保持高剪枝率（高达95%）的同时，仍能维持90%的LLaVA-1.5-7B性能，显著优于基于文本-视觉注意力的现有方法。

链接: https://arxiv.org/abs/2412.01818
作者: Qizhe Zhang,Aosong Cheng,Ming Lu,Zhiyong Zhuo,Minqi Wang,Jiajun Cao,Shaobo Guo,Qi She,Shanghang Zhang
关键词-EN: Large vision-language models, large language models, visual tokens, Large vision-language, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 11 figures, code: this https URL , project page: this https URL

点击查看摘要

Abstract:Large vision-language models (VLMs) often rely on a substantial number of visual tokens when interacting with large language models (LLMs), which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95% of visual tokens while maintaining 90% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at this https URL.
zh

[CV-9] Efficient Semantic Communication Through Transformer-Aided Compression

【速读】：该论文试图解决在无线通信系统中，如何在有限的带宽条件下高效传输多分辨率图像数据的问题。解决方案的关键在于引入了一个基于语义通信的通道感知自适应框架，通过使用视觉Transformer（Vision Transformers）来解释注意力掩码（attention mask），从而根据图像块的语义内容动态调整压缩率。具体来说，该方法根据瞬时通道带宽对图像块进行动态分类，并相应地调整编码分辨率，以确保在带宽受限的环境中，关键信息得以保留，从而提高通信效率。

链接: https://arxiv.org/abs/2412.01817
作者: Matin Mortaheb,Mohammad A. Amir Khojastepour,Sennur Ulukus
关键词-EN: elements within complex, proven highly effective, attention mechanisms, proven highly, critical elements
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Transformers, known for their attention mechanisms, have proven highly effective in focusing on critical elements within complex data. This feature can effectively be used to address the time-varying channels in wireless communication systems. In this work, we introduce a channel-aware adaptive framework for semantic communication, where different regions of the image are encoded and compressed based on their semantic content. By employing vision transformers, we interpret the attention mask as a measure of the semantic contents of the patches and dynamically categorize the patches to be compressed at various rates as a function of the instantaneous channel bandwidth. Our method enhances communication efficiency by adapting the encoding resolution to the content’s relevance, ensuring that even in highly constrained environments, critical information is preserved. We evaluate the proposed adaptive transmission framework using the TinyImageNet dataset, measuring both reconstruction quality and accuracy. The results demonstrate that our approach maintains high semantic fidelity while optimizing bandwidth, providing an effective solution for transmitting multi-resolution data in limited bandwidth conditions.
zh

[CV-10] V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

【速读】：该论文试图解决在车辆到一切 (V2X) 技术中，单帧协同感知方法忽略时间线索和时间任务（如时间感知和预测）的问题。解决方案的关键在于设计了一步和多步通信策略（何时传输），并结合早期、晚期和中间融合策略（传输内容），以及提出了一种新的中间融合框架 V2XPnP，该框架采用统一的基于 Transformer 的架构，有效建模复杂的时空关系，涵盖时间帧、空间代理和高清地图。此外，论文还引入了支持所有 V2X 合作模式的 V2XPnP 序列数据集，以解决现有真实世界数据集的局限性。实验结果表明，该框架在感知和预测任务中均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.01812
作者: Zewei Zhou,Hao Xiang,Zhaoliang Zheng,Seth Z. Zhao,Mingyue Lei,Yun Zhang,Tianhui Cai,Xinyi Liu,Johnson Liu,Maheswari Bajji,Jacob Pham,Xin Xia,Zhiyu Huang,Bolei Zhou,Jiaqi Ma
关键词-EN: perception and prediction, technologies offer, single-vehicle systems, offer a promising, promising paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website link: this https URL

点击查看摘要

Abstract:Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents’ information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on temporal perception and prediction tasks in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with various fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatiotemporal relationships across temporal per-frame, spatial per-agent, and high-definition map. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X cooperation modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate our framework outperforms state-of-the-art methods in both perception and prediction tasks.
zh

[CV-11] Occams LGS: A Simple Approach for Language Gaussian Splatting

【速读】：该论文试图解决在3D场景表示中，如何高效地进行语义视觉-语言特征的聚合问题。现有的方法通常依赖于复杂的聚合技术，导致高计算成本和训练时间。论文提出的解决方案关键在于简化这一过程，采用奥卡姆剃刀原则（Occam’s razor），通过标准渲染过程中导出的权重进行加权多视图特征聚合，并结合简单的启发式噪声高斯滤波。这种方法不仅显著提升了计算速度（两个数量级的加速），还实现了最先进的结果，同时支持直接在语言特征中进行推理和场景操作，如对象插入。

链接: https://arxiv.org/abs/2412.01807
作者: Jiahuan Cheng,Jan-Nico Zaech,Luc Van Gool,Danda Pani Paudel
关键词-EN: Gaussian Splatting, widely adopted approach, widely adopted, Gaussian, Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:TL;DR: Gaussian Splatting is a widely adopted approach for 3D scene representation that offers efficient, high-quality 3D reconstruction and rendering. A major reason for the success of 3DGS is its simplicity of representing a scene with a set of Gaussians, which makes it easy to interpret and adapt. To enhance scene understanding beyond the visual representation, approaches have been developed that extend 3D Gaussian Splatting with semantic vision-language features, especially allowing for open-set tasks. In this setting, the language features of 3D Gaussian Splatting are often aggregated from multiple 2D views. Existing works address this aggregation problem using cumbersome techniques that lead to high computational cost and training time. In this work, we show that the sophisticated techniques for language-grounded 3D Gaussian Splatting are simply unnecessary. Instead, we apply Occam’s razor to the task at hand and perform weighted multi-view feature aggregation using the weights derived from the standard rendering process, followed by a simple heuristic-based noisy Gaussian filtration. Doing so offers us state-of-the-art results with a speed-up of two orders of magnitude. We showcase our results in two commonly used benchmark datasets: LERF and 3D-OVS. Our simple approach allows us to perform reasoning directly in the language features, without any compression whatsoever. Such modeling in turn offers easy scene manipulation, unlike the existing methods – which we illustrate using an application of object insertion in the scene. Furthermore, we provide a thorough discussion regarding the significance of our contributions within the context of the current literature. Project Page: this https URL Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.01807 [cs.CV] (or arXiv:2412.01807v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01807 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

【速读】：该论文试图解决大规模3D场景生成中的可控生成和编辑问题。解决方案的关键在于提出了SceneFactor，一种基于扩散的方法，通过分解扩散公式（factored diffusion formulation）来利用潜在语义和几何流形进行任意大小的3D场景生成。SceneFactor不仅支持文本引导的3D场景合成，还通过生成由语义3D框组成的代理语义空间（proxy semantic space），实现了对生成场景的可控编辑，包括添加、移除和调整语义3D框的大小，从而指导高保真、一致的3D几何编辑。实验结果表明，该方法在实现高保真3D场景合成的同时，提供了有效的可控编辑功能。

链接: https://arxiv.org/abs/2412.01801
作者: Alexey Bokhovkin,Quan Meng,Shubham Tulsiani,Angela Dai
关键词-EN: present SceneFactor, enables, controllable generation, SceneFactor enables text-guided, enables controllable generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figures; this https URL

点击查看摘要

Abstract:We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.
zh

[CV-13] PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

【速读】：该论文试图解决视频大型语言模型（Video LLMs）在物理常识理解方面的不足问题。解决方案的关键在于提出了PhysGame基准，用于评估游戏视频中的物理常识违规现象，并开发了PhysInstruct和PhysDPO两个数据集，分别用于指令调优和偏好优化。PhysVLM作为物理知识增强的视频LLM，通过这些数据集的训练，显著提升了在物理常识理解和一般视频理解任务中的表现。

链接: https://arxiv.org/abs/2412.01800
作者: Meng Cao,Haoran Tang,Haoze Zhao,Hangyu Guo,Jiaheng Liu,Ge Zhang,Ruyang Liu,Qiang Sun,Ian Reid,Xiaodan Liang
关键词-EN: large language models, dynamic visual content, video-based large language, interpret dynamic visual, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.
zh

[CV-14] SEAL: Semantic Attention Learning for Long Video Representation

【速读】：该论文试图解决长视频理解中的高计算复杂性和冗余时间信息问题。解决方案的关键在于引入了一种名为SEmantic Attention Learning (SEAL)的新型统一表示方法。SEAL通过将长视频分解为场景、对象和动作三种语义实体，显著降低了计算复杂性，并提出了一个注意力学习模块来优化子集选择，从而有效处理冗余信息。这一方法不仅提高了长视频理解任务的性能，还在多个基准测试中显著超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.01798
作者: Lan Wang,Yujia Chen,Wen-Sheng Chu,Vishnu Boddeti,Du Tran
关键词-EN: presents challenges due, inherent high computational, redundant temporal information, understanding presents challenges, Long video understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must process such redundancy efficiently while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a handful of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity formulated as a subset selection optimization problem. Our representation is versatile, enabling applications across various long video understanding tasks. Extensive experiments show that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks and benchmarks including LVBench, MovieChat-1K, and Ego4D.
zh

[CV-15] IQA-Adapter: Exploring Knowledge Transfer from Image Quality Assessment to Diffusion-based Generative Models

【速读】：该论文试图解决基于扩散模型的条件图像生成中，生成高质量图像的挑战，特别是缺乏感知质量条件机制的问题。解决方案的关键在于提出了IQA-Adapter架构，该架构通过学习图像与质量评分之间的关系，将目标质量水平作为生成条件，从而使生成图像的分布向高质量子域偏移。这一方法不仅在多个客观评价指标上提升了生成图像的质量（最高可达10%的改进），还通过主观研究得到了验证，同时保持了生成多样性和内容。此外，IQA-Adapter还可以逆向用作降质模型，生成质量逐渐降低的图像。

链接: https://arxiv.org/abs/2412.01794
作者: Khaled Abud,Sergey Lavrushkin,Alexey Kirillov,Dmitriy Vatolin
关键词-EN: achieving unprecedented fidelity, recently transformed conditional, transformed conditional image, semantically accurate images, achieving unprecedented
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: GitHub repo: this https URL

点击查看摘要

Abstract:Diffusion-based models have recently transformed conditional image generation, achieving unprecedented fidelity in generating photorealistic and semantically accurate images. However, consistently generating high-quality images remains challenging, partly due to the lack of mechanisms for conditioning outputs on perceptual quality. In this work, we propose methods to integrate image quality assessment (IQA) models into diffusion-based generators, enabling quality-aware image generation. First, we experiment with gradient-based guidance to optimize image quality directly and show this approach has limited generalizability. To address this, we introduce IQA-Adapter, a novel architecture that conditions generation on target quality levels by learning the relationship between images and quality scores. When conditioned on high target quality, IQA-Adapter shifts the distribution of generated images towards a higher-quality subdomain. This approach achieves up to a 10% improvement across multiple objective metrics, as confirmed by a subjective study, while preserving generative diversity and content. Additionally, IQA-Adapter can be used inversely as a degradation model, generating progressively more distorted images when conditioned on lower quality scores. Our quality-aware methods also provide insights into the adversarial robustness of IQA models, underscoring the potential of quality conditioning in generative modeling and the importance of robust IQA methods.
zh

[CV-16] CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

【速读】：该论文试图解决动态3D场景中编辑的可控性和一致性问题。解决方案的关键在于引入了一个新颖的框架，首先对InstructPix2Pix模型进行微调，然后基于可变形3D高斯进行两阶段优化。通过微调，模型能够从单个编辑参考图像中“学习”编辑能力，将复杂的动态场景编辑任务简化为简单的2D图像编辑过程。这种方法通过直接学习编辑区域和风格，实现了局部编辑的一致性和精确性，无需跟踪所需的编辑区域，从而有效解决了动态场景编辑中的关键挑战。随后，通过设计的编辑图像缓冲区进行两阶段优化，加速收敛并提高时间一致性，相较于现有最先进的方法，提供了更灵活和可控的局部场景编辑，实现了高质量和一致的结果。

链接: https://arxiv.org/abs/2412.01792
作者: Kai He,Chin-Hsuan Wu,Igor Gilitschenski
关键词-EN: Neural Radiance Fields, Neural Radiance, Radiance Fields, greatly improved realistic, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to “learn” the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.
zh

[CV-17] Pretrained Reversible Generation as Unsupervised Visual Representation Learning

【速读】：该论文试图解决生成式模型在判别任务中的应用问题，特别是如何有效利用生成模型的强大能力来提取适用于下游任务的鲁棒特征。解决方案的关键在于提出了一种名为“预训练可逆生成 (Pretrained Reversible Generation, PRG)”的方法，通过反转预训练连续流模型的生成过程来提取无监督表示。PRG方法充分利用了生成模型的高容量，使其能够作为下游任务的鲁棒且可泛化的特征提取器，从而在多个基准测试中显著超越了先前的生成模型方法，包括在ImageNet上达到78%的top-1准确率。

链接: https://arxiv.org/abs/2412.01787
作者: Rongkun Xue,Jinouwen Zhang,Yazhe Niu,Dazhong Shen,Bingqi Ma,Yu Liu,Jing Yang
关键词-EN: significantly advanced generation, tasks remains underexplored, Recent generative models, Recent generative, discriminative tasks remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent generative models based on score matching and flow matching have significantly advanced generation tasks, but their potential in discriminative tasks remains underexplored. Previous approaches, such as generative classifiers, have not fully leveraged the capabilities of these models for discriminative tasks due to their intricate designs. We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous flow model. PRG effectively reuses unsupervised generative models, leveraging their high capacity to serve as robust and generalizable feature extractors for downstream tasks. Our method consistently outperforms prior approaches across multiple benchmarks, achieving state-of-the-art performance among generative model-based methods, including 78% top-1 accuracy on ImageNet. Extensive ablation studies further validate the effectiveness of our approach.
zh

[CV-18] Identifying Reliable Predictions in Detection Transformers

【速读】：该论文试图解决DETR模型在对象检测中生成大量预测的问题，特别是这些预测的可靠性和可信度问题。解决方案的关键在于提出了一种新的评估方法——对象级校准误差（Object-level Calibration Error, OCE），用于评估不同模型及同一模型不同配置下的校准质量。此外，论文还引入了一个事后不确定性量化（Uncertainty Quantification, UQ）框架，通过对比正负预测的平均置信度得分，来评估DETR模型在每张测试图像上的可靠性。这些方法共同解决了现有性能和校准指标（如平均精度）在确定DETR预测子集可信度方面的不足。

链接: https://arxiv.org/abs/2412.01782
作者: Young-Jin Park,Carson Sobolewski,Navid Azizan
关键词-EN: DEtection TRansformer, promising architecture, DETR, predictions, DETR predictions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:DEtection TRansformer (DETR) has emerged as a promising architecture for object detection, offering an end-to-end prediction pipeline. In practice, however, DETR generates hundreds of predictions that far outnumber the actual number of objects present in an image. This raises the question: can we trust and use all of these predictions? Addressing this concern, we present empirical evidence highlighting how different predictions within the same image play distinct roles, resulting in varying reliability levels across those predictions. More specifically, while multiple predictions are often made for a single object, our findings show that most often one such prediction is well-calibrated, and the others are poorly calibrated. Based on these insights, we demonstrate identifying a reliable subset of DETR’s predictions is crucial for accurately assessing the reliability of the model at both object and image levels. Building on this viewpoint, we first tackle the shortcomings of widely used performance and calibration metrics, such as average precision and various forms of expected calibration error. Specifically, they are inadequate for determining which subset of DETR’s predictions should be trusted and utilized. In response, we present Object-level Calibration Error (OCE), which is capable of assessing the calibration quality both across different models and among various configurations within a specific model. As a final contribution, we introduce a post hoc Uncertainty Quantification (UQ) framework that predicts the accuracy of the model on a per-image basis. By contrasting the average confidence scores of positive (i.e., likely to be matched) and negative predictions determined by OCE, the framework assesses the reliability of the DETR model for each test image. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.01782 [cs.CV] (or arXiv:2412.01782v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01782 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-19] XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

【速读】：该论文试图解决图像标记器（image tokenizers）在生成模型中的性能问题，特别是如何通过改进量化技术和训练方法来提升图像重建和下游生成任务的质量。解决方案的关键在于提出了XQ-GAN框架，该框架集成了多种先进的量化技术，包括向量量化（VQ）、残差量化（RQ）、多尺度残差量化（MSVQ）、乘积量化（PQ）、无查找量化（LFQ）和二进制球面量化（BSQ），并在高度灵活和可定制的训练环境中进行整合。通过在ImageNet 256x256基准测试中实现0.64的rFID，显著超越了MAGVIT-v2和VAR的表现，同时提升了gFID指标，证明了XQ-GAN在图像重建和生成任务中的有效性。

链接: https://arxiv.org/abs/2412.01762
作者: Xiang Li,Kai Qiu,Hao Chen,Jason Kuen,Jiuxiang Gu,Jindong Wang,Zhe Lin,Bhiksha Raj
关键词-EN: play a critical, critical role, role in shaping, shaping the performance, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR’s 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.
zh

[CV-20] Continuous-Time Human Motion Field from Events

【速读】：该论文试图解决从事件流中估计连续时间人体运动场的问题。现有的人体网格恢复 (Human Mesh Recovery, HMR) 方法主要依赖于基于帧的方法，这些方法容易受到混叠和运动模糊的影响，因为它们的时间分辨率有限。论文提出的解决方案之关键是利用循环前馈神经网络直接从事件中预测潜在空间中的人体运动，从而生成一个连续时间的人体运动场，该运动场被表示为时间隐函数。这种方法不仅避免了传统离散时间预测的局限性，还支持在任意时间分辨率下进行并行姿态查询，从而显著提高了计算效率和估计精度。与之前依赖于高帧率下固定姿态数量优化的方法相比，该方法在新的高速度人体运动数据集上将关节误差减少了23.8%，同时计算时间减少了69%。

链接: https://arxiv.org/abs/2412.01747
作者: Ziyun Wang,Ruijun Zhang,Zi-Yan Liu,Yufu Wang,Kostas Daniilidis
关键词-EN: Human Mesh Recovery, human motion field, human motion, continuous-time human motion, human
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field directly from events by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discrete-time predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high-speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill this gap. On this new data, our method improves joint errors by 23.8% compared to previous event human methods while reducing the computational time by 69%.
zh

[CV-21] Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes

【速读】：该论文试图解决空中和街景图像的无缝集成问题，这一问题在神经场景重建和渲染中仍然是一个重大挑战。现有方法主要集中在单一领域，限制了其在需要广泛自由视角探索和大幅度视角变化的沉浸式环境中的应用。论文提出的解决方案是引入了一种名为 Horizon-GS 的新方法，该方法基于高斯喷射技术 (Gaussian Splatting)，旨在统一空中和街景的重建和渲染。其关键在于采用了一种新的训练策略，克服了视角差异，从而生成高保真度的场景。此外，论文还创建了一个高质量的从空中到地面视角的数据集，涵盖了合成和真实世界的场景，以推动进一步的研究。实验结果表明，该方法在多种城市场景数据集上均表现出色。

链接: https://arxiv.org/abs/2412.01745
作者: Lihan Jiang,Kerui Ren,Mulin Yu,Linning Xu,Junting Dong,Tao Lu,Feng Zhao,Dahua Lin,Bo Dai
关键词-EN: Seamless integration, view images remains, street view images, images remains, remains a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Seamless integration of both aerial and street view images remains a significant challenge in neural scene reconstruction and rendering. Existing methods predominantly focus on single domain, limiting their applications in immersive environments, which demand extensive free view exploration with large view changes both horizontally and vertically. We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. Our method addresses the key challenges of combining these perspectives with a new training strategy, overcoming viewpoint discrepancies to generate high-fidelity scenes. We also curate a high-quality aerial-to-ground views dataset encompassing both synthetic and real-world scene to advance further research. Experiments across diverse urban scene datasets confirm the effectiveness of our method.
zh

[CV-22] Automated Toll Management System Using RFID and Image Processing

【速读】：该论文试图解决通过收费站时导致的交通拥堵问题，解决方案的关键在于通过电子收费系统 (Electronic Toll Collection, ETC) 结合RFID标签和车牌验证技术来提高收费效率和安全性。具体来说，论文采用了图像处理技术和卷积神经网络 (CNN) 分类器来识别车辆的车牌号码，并基于此向车主发送通知邮件以确保在规定时间内完成收费，同时实时从车主账户中自动扣除费用。这一解决方案不仅消除了排队支付的需要，减少了延误，还提高了出行的便利性。

链接: https://arxiv.org/abs/2412.01728
作者: Raihan Ahmed,Shahed Chowdhury Omi,Md. Sadman Rahman,Niaz Rahman Bhuiyan
关键词-EN: recent studies, identified in recent, Electronic Toll Collection, number plate verification, Traveling
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traveling through toll plazas is one of the primary causes of congestion, as identified in recent studies. Electronic Toll Collection (ETC) systems can mitigate this problem. This experiment focuses on enhancing the security of ETC using RFID tags and number plate verification. For number plate verification, image processing is employed, and a CNN classifier is implemented to detect vehicle registration numbers. Based on the registered number, a notification email is sent to the respective owner for toll fee payment within a specific timeframe to avoid fines. Additionally, toll fees are automatically deducted in real-time from the owner’s balance. This system benefits travelers by eliminating the need to queue for toll payment, thereby reducing delays and improving convenience.
zh

[CV-23] Attacks on multimodal models

【速读】：该论文旨在研究多模态聊天模型在实际应用中可能面临的攻击风险，特别是这些模型中包含的开源组件是否继承了其固有的漏洞，以及这些漏洞在工业应用中的潜在危害。解决方案的关键在于深入分析现代视觉语言模型（如LLaVA、BLIP等）中使用的预训练组件，特别是CLIP架构及其图像编码器（CLIP-ViT），并评估针对这些组件的各种补丁攻击（patch attack）变体的泛化能力。通过这种方式，论文试图揭示这些模型在面对攻击时的脆弱性，并为提高其安全性提供理论依据。

链接: https://arxiv.org/abs/2412.01725
作者: Viacheslav Iablochnikov,Alexander Rogachev
关键词-EN: gaining increasing popularity, increasing popularity, capable of working, modalities simultaneously, chat format
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Today, models capable of working with various modalities simultaneously in a chat format are gaining increasing popularity. Despite this, there is an issue of potential attacks on these models, especially considering that many of them include open-source components. It is important to study whether the vulnerabilities of these components are inherited and how dangerous this can be when using such models in the industry. This work is dedicated to researching various types of attacks on such models and evaluating their generalization capabilities. Modern VLM models (LLaVA, BLIP, etc.) often use pre-trained parts from other models, so the main part of this research focuses on them, specifically on the CLIP architecture and its image encoder (CLIP-ViT) and various patch attack variations for it.
zh

[CV-24] BroadTrack: Broadcast Camera Tracking for Soccer

【速读】：该论文试图解决足球广播中摄像机校准与定位的问题，特别是针对体育场内架设在高处的三脚架摄像机的跟踪任务。解决方案的关键在于结合现有的开源足球场检测器与精心设计的摄像机和三脚架模型，开发出一种高效、鲁棒且精确的摄像机跟踪系统，名为BroadTrack。该系统通过降低平均重投影误差率并提高Jaccard指数，显著优于现有的最先进方法，尤其在SoccerNet数据集上的表现尤为突出。此外，论文还展示了在较长广播片段（20分钟）上的定性结果，以证明系统的鲁棒性和有效性。

链接: https://arxiv.org/abs/2412.01721
作者: Floriane Magera,Thomas Hoyoux,Olivier Barnich,Marc Van Droogenbroeck
关键词-EN: augmented reality graphics, simply named camera, named camera calibration, Camera calibration, refereeing purposes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 3 tables, 60 references

点击查看摘要

Abstract:Camera calibration and localization, sometimes simply named camera calibration, enables many applications in the context of soccer broadcasting, for instance regarding the interpretation and analysis of the game, or the insertion of augmented reality graphics for storytelling or refereeing purposes. To contribute to such applications, the research community has typically focused on single-view calibration methods, leveraging the near-omnipresence of soccer field markings in wide-angle broadcast views, but leaving all temporal aspects, if considered at all, to general-purpose tracking or filtering techniques. Only a few contributions have been made to leverage any domain-specific knowledge for this tracking task, and, as a result, there lacks a truly performant and off-the-shelf camera tracking system tailored for soccer broadcasting, specifically for elevated tripod-mounted cameras around the stadium. In this work, we present such a system capable of addressing the task of soccer broadcast camera tracking efficiently, robustly, and accurately, outperforming by far the most precise methods of the state-of-the-art. By combining the available open-source soccer field detectors with carefully designed camera and tripod models, our tracking system, BroadTrack, halves the mean reprojection error rate and gains more than 15% in terms of Jaccard index for camera calibration on the SoccerNet dataset. Furthermore, as the SoccerNet dataset videos are relatively short (30 seconds), we also present qualitative results on a 20-minute broadcast clip to showcase the robustness and the soundness of our system.
zh

[CV-25] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

【速读】：该论文试图解决多模态信息检索任务日益复杂化的问题，现有的方法主要依赖于针对特定任务的视觉-语言模型微调，这些模型通常通过图像-文本对比学习进行训练。论文提出了一种新的解决方案，即重新利用生成式大型多模态模型 (Generative Large Multimodal Models, LMMs) 进行检索，从而将所有检索任务统一在一个框架下，并能够在无需额外训练的情况下处理未见过的检索任务。解决方案的关键在于：(i) 引入了一个名为 LamRA 的多功能框架，赋予 LMMs 高级的检索和重排序能力；(ii) 采用两阶段训练策略，包括仅语言预训练和多模态指令微调，逐步提升 LMMs 的检索性能；(iii) 通过联合训练点对和列表重排序，提供两种不同的方式进一步增强检索性能；(iv) 实验结果表明，该方法在处理超过十种检索任务时表现出色，在监督和零样本设置下均显示出强大的性能，包括处理未见过的检索任务。

链接: https://arxiv.org/abs/2412.01720
作者: Yikun Liu,Pingan Chen,Jiayin Cai,Xiaolong Jiang,Yao Hu,Jiangchao Yao,Yanfeng Wang,Weidi Xie
关键词-EN: increasingly complex retrieval, Large Multimodal Models, multimodal information retrieval, retrieval tasks, retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM’s retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.
zh

[CV-26] HUGSIM: A Real-Time Photo-Realistic and Closed-Loop Simulator for Autonomous Driving

【速读】：该论文试图解决现有自动驾驶算法评估方法的局限性，即仅评估单个组件无法全面反映整个系统的性能。解决方案的关键在于开发了HUGSIM，这是一个闭环、逼真且实时的模拟器，用于全面评估自动驾驶算法。HUGSIM通过将捕获的2D RGB图像提升到3D空间（使用3D高斯溅射技术），提高了闭环场景中的渲染质量，并构建了闭环环境。此外，HUGSIM解决了闭环场景中新视角合成的挑战，包括视角外推和360度车辆渲染，并实现了完整的闭环模拟，动态更新自我车辆和演员的状态及观测数据。HUGSIM还提供了一个全面的基准，涵盖了来自多个数据集的70多个序列和400多种不同场景，为现有自动驾驶算法提供了一个公平且逼真的评估平台。

链接: https://arxiv.org/abs/2412.01718
作者: Hongyu Zhou,Longzhong Lin,Jiabao Wang,Yichong Lu,Dongfeng Bai,Bingbing Liu,Yue Wang,Andreas Geiger,Yiyi Liao
关键词-EN: made significant progress, autonomous driving algorithms, past few decades, progress in perception, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Our project page is at this https URL

点击查看摘要

Abstract:In the past few decades, autonomous driving algorithms have made significant progress in perception, planning, and control. However, evaluating individual components does not fully reflect the performance of entire systems, highlighting the need for more holistic assessment methods. This motivates the development of HUGSIM, a closed-loop, photo-realistic, and real-time simulator for evaluating autonomous driving algorithms. We achieve this by lifting captured 2D RGB images into the 3D space via 3D Gaussian Splatting, improving the rendering quality for closed-loop scenarios, and building the closed-loop environment. In terms of rendering, We tackle challenges of novel view synthesis in closed-loop scenarios, including viewpoint extrapolation and 360-degree vehicle rendering. Beyond novel view synthesis, HUGSIM further enables the full closed simulation loop, dynamically updating the ego and actor states and observations based on control commands. Moreover, HUGSIM offers a comprehensive benchmark across more than 70 sequences from KITTI-360, Waymo, nuScenes, and PandaSet, along with over 400 varying scenarios, providing a fair and realistic evaluation platform for existing autonomous driving algorithms. HUGSIM not only serves as an intuitive evaluation benchmark but also unlocks the potential for fine-tuning autonomous driving algorithms in a photorealistic closed-loop setting.
zh

[CV-27] Driving Scene Synthesis on Free-form Trajectories with Generative Prior

【速读】：该论文试图解决在自由形式轨迹上进行驾驶场景合成的问题，这是为了在驾驶模拟中实现端到端驾驶策略的闭环评估。现有方法在已记录轨迹上的新视角合成方面表现出色，但在处理新轨迹时面临挑战，主要是因为驾驶视频的视角有限和驾驶环境的广阔性。解决方案的关键在于提出了一种名为DriveX的新方法，通过利用视频生成先验来优化跨多种轨迹的3D模型。具体来说，论文设计了一个逆问题，使视频扩散模型能够作为先验用于参数化3D模型（如高斯喷射）的多轨迹优化。通过在优化过程中迭代进行这一过程，最终模型能够生成超出记录轨迹的高保真虚拟驾驶环境，从而实现自由形式轨迹的驾驶模拟。此外，DriveX还可以用于从AI生成的视频中模拟虚拟驾驶世界。

链接: https://arxiv.org/abs/2412.01717
作者: Zeyu Yang,Zijie Pan,Yuankun Yang,Xiatian Zhu,Li Zhang
关键词-EN: enable closed-loop evaluation, Driving, closed-loop evaluation, driving view synthesis, driving policies
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving scene synthesis along free-form trajectories is essential for driving simulations to enable closed-loop evaluation of end-to-end driving policies. While existing methods excel at novel view synthesis on recorded trajectories, they face challenges with novel trajectories due to limited views of driving videos and the vastness of driving environments. To tackle this challenge, we propose a novel free-form driving view synthesis approach, dubbed DriveX, by leveraging video generative prior to optimize a 3D model across a variety of trajectories. Concretely, we crafted an inverse problem that enables a video diffusion model to be utilized as a prior for many-trajectory optimization of a parametric 3D model (e.g., Gaussian splatting). To seamlessly use the generative prior, we iteratively conduct this process during optimization. Our resulting model can produce high-fidelity virtual driving environments outside the recorded trajectory, enabling free-form trajectory driving simulation. Beyond real driving scenes, DriveX can also be utilized to simulate virtual driving worlds from AI-generated videos.
zh

[CV-28] Uncertainty-Aware Regularization for Image-to-Image Translation WACV2025

【速读】：该论文试图解决深度网络在医学图像到图像（Image-to-Image, I2I）翻译任务中不确定性量化的问题。解决方案的关键在于提出了一个结合偶然不确定性（aleatoric uncertainty）和不确定性感知正则化（Uncertainty-Aware Regularization, UAR）的方法。通过利用简单的先验知识对参数进行约束，该方法能够生成更鲁棒的不确定性图，从而精确指示网络在处理困难区域时的不确定性，同时减少噪声的影响。实验结果表明，UAR不仅提升了翻译性能，还显著改善了在噪声和伪影存在情况下的不确定性估计。

链接: https://arxiv.org/abs/2412.01705
作者: Anuja Vats,Ivar Farup,Marius Pedersen,Kiran Raja
关键词-EN: reliable real-world applications, real-world applications, importance of quantifying, paramount for reliable, reliable real-world
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted WACV 2025

点击查看摘要

Abstract:The importance of quantifying uncertainty in deep networks has become paramount for reliable real-world applications. In this paper, we propose a method to improve uncertainty estimation in medical Image-to-Image (I2I) translation. Our model integrates aleatoric uncertainty and employs Uncertainty-Aware Regularization (UAR) inspired by simple priors to refine uncertainty estimates and enhance reconstruction quality. We show that by leveraging simple priors on parameters, our approach captures more robust uncertainty maps, effectively refining them to indicate precisely where the network encounters difficulties, while being less affected by noise. Our experiments demonstrate that UAR not only improves translation performance, but also provides better uncertainty estimations, particularly in the presence of noise and artifacts. We validate our approach using two medical imaging datasets, showcasing its effectiveness in maintaining high confidence in familiar regions while accurately identifying areas of uncertainty in novel/ambiguous scenarios.
zh

[CV-29] Deep Guess acceleration for explainable image reconstruction in sparse-view CT

【速读】：该论文试图解决稀疏视图计算机断层扫描（Sparse-view Computed Tomography, CT）中由于数据稀疏导致的重建图像质量差的问题。传统滤波反投影算法（Filtered Back Projection）在稀疏数据下会产生严重伪影，而基于模型的迭代重建算法（Model-Based Iterative Reconstruction, MBIR）虽然能通过正则化减少噪声，但其计算成本过高，不适合临床应用。论文提出的解决方案之关键是引入了一种名为“深度猜测加速方案”（Deep Guess acceleration scheme）的新技术，利用训练好的神经网络来加速正则化MBIR算法，并提高重建精度。具体来说，该方案通过深度学习工具生成一个智能初始猜测，用于近端算法（proximal algorithm）求解非凸模型，从而在少量迭代中计算出可解释的解图像。实验结果表明，该方法在极稀疏的断层扫描协议中表现出色，优于传统的变分方法和许多最先进的数据驱动方法。此外，论文还探讨了无真值实现的鲁棒性测试。

链接: https://arxiv.org/abs/2412.01703
作者: Elena Loli Piccolomini,Davide Evangelista,Elena Morotti
关键词-EN: Sparse-view Computed Tomography, reduce X-ray dose, X-ray dose radiation, Sparse-view Computed, Computed Tomography
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-view Computed Tomography (CT) is an emerging protocol designed to reduce X-ray dose radiation in medical imaging. Traditional Filtered Back Projection algorithm reconstructions suffer from severe artifacts due to sparse data. In contrast, Model-Based Iterative Reconstruction (MBIR) algorithms, though better at mitigating noise through regularization, are too computationally costly for clinical use. This paper introduces a novel technique, denoted as the Deep Guess acceleration scheme, using a trained neural network both to quicken the regularized MBIR and to enhance the reconstruction accuracy. We integrate state-of-the-art deep learning tools to initialize a clever starting guess for a proximal algorithm solving a non-convex model and thus computing an interpretable solution image in a few iterations. Experimental results on real CT images demonstrate the Deep Guess effectiveness in (very) sparse tomographic protocols, where it overcomes its mere variational counterpart and many data-driven approaches at the state of the art. We also consider a ground truth-free implementation and test the robustness of the proposed framework to noise.
zh

[CV-30] FathomVerse: A community science dataset for ocean animal discovery

【速读】：该论文试图解决计算机视觉在深海生物识别中的应用问题，特别是针对那些人类在陆地上很少接触到的深海生物。解决方案的关键在于提出了FathomVerse v0检测数据集，该数据集包含3843张图像和8092个边界框，涵盖12种不同的深海生物形态组，这些生物在计算机视觉领域是全新的挑战。数据集中的图像展示了复杂的视觉场景，如章鱼与海星的交织，以及难以分类的物种如吸血乌贼和海蜘蛛。这一数据集的提出旨在推动细粒度迁移学习、新类别发现、物种分布建模和碳循环分析等研究，这些研究对于地球的生态保护和管理至关重要。

链接: https://arxiv.org/abs/2412.01701
作者: Genevieve Patterson,Joost Daniels,Benjamin Woodward,Kevin Barnard,Giovanna Sainz,Lonny Lundsten,Kakani Katija
关键词-EN: explore the ocean, computer vision, vision, computer, cs.CV
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 10 pages, 14 figures

点击查看摘要

Abstract:Can computer vision help us explore the ocean? The ultimate challenge for computer vision is to recognize any visual phenomena, more than only the objects and animals humans encounter in their terrestrial lives. Previous datasets have explored everyday objects and fine-grained categories humans see frequently. We present the FathomVerse v0 detection dataset to push the limits of our field by exploring animals that rarely come in contact with people in the deep sea. These animals present a novel vision challenge. The FathomVerse v0 dataset consists of 3843 images with 8092 bounding boxes from 12 distinct morphological groups recorded at two locations on the deep seafloor that are new to computer vision. It features visually perplexing scenarios such as an octopus intertwined with a sea star, and confounding categories like vampire squids and sea spiders. This dataset can push forward research on topics like fine-grained transfer learning, novel category discovery, species distribution modeling, and carbon cycle analysis, all of which are important to the care and husbandry of our planet. Comments: 10 pages, 14 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2412.01701 [cs.CV] (or arXiv:2412.01701v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01701 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-31] Unlocking Video-LLM via Agent -of-Thoughts Distillation

【速读】：该论文试图解决视频问答 (VideoQA) 任务中的多步推理和空间-时间动态理解问题。现有的大规模视频-语言模型虽然在基准测试中表现良好，但往往缺乏可解释性和空间-时间定位能力。解决方案的关键在于提出了一种名为“思维链蒸馏 (Agent-of-Thoughts Distillation, AoTD)”的方法，通过在指令微调过程中引入自动生成的思维链 (Chain-of-Thoughts, CoTs) 来增强模型。具体来说，AoTD 利用基于代理的系统将复杂问题分解为子任务，并使用专门的视觉模型处理这些子任务，中间结果作为推理链。此外，引入大型语言模型 (LLM) 进行验证机制，以确保生成的 CoTs 的可靠性。实验结果表明，AoTD 在多项选择和开放式基准测试中显著提升了性能。

链接: https://arxiv.org/abs/2412.01694
作者: Yudi Shi,Shangzhe Di,Qirui Chen,Weidi Xie
关键词-EN: video question answering, requires multi-step reasoning, tackles the problem, problem of video, requires multi-step
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.
zh

[CV-32] Diffusion Models with Anisotropic Gaussian Splatting for Image Inpainting

【速读】：该论文试图解决图像修复（Image Inpainting）中在大面积缺失区域下保持结构连续性和生成连贯纹理的挑战。解决方案的关键在于结合扩散模型（Diffusion Models）与各向异性高斯溅射（Anisotropic Gaussian Splatting），通过各向异性高斯函数适应局部图像梯度来建模缺失区域，从而为基于扩散的修复网络提供结构指导。高斯溅射图被整合到扩散过程中，增强了模型生成高保真和结构连贯修复结果的能力。

链接: https://arxiv.org/abs/2412.01682
作者: Jacob Fein-Ashley,Benjamin Fein-Ashley
关键词-EN: aiming to restore, fundamental task, images realistically, computer vision, restore missing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image inpainting is a fundamental task in computer vision, aiming to restore missing or corrupted regions in images realistically. While recent deep learning approaches have significantly advanced the state-of-the-art, challenges remain in maintaining structural continuity and generating coherent textures, particularly in large missing areas. Diffusion models have shown promise in generating high-fidelity images but often lack the structural guidance necessary for realistic inpainting. We propose a novel inpainting method that combines diffusion models with anisotropic Gaussian splatting to capture both local structures and global context effectively. By modeling missing regions using anisotropic Gaussian functions that adapt to local image gradients, our approach provides structural guidance to the diffusion-based inpainting network. The Gaussian splat maps are integrated into the diffusion process, enhancing the model’s ability to generate high-fidelity and structurally coherent inpainting results. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques, producing visually plausible results with enhanced structural integrity and texture realism. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.01682 [cs.CV] (or arXiv:2412.01682v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01682 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-33] Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

【速读】：该论文试图解决自监督学习（Self-supervised Learning, SSL）中数据增强方法的局限性问题，特别是在特定领域（如病理学）中缺乏大规模图像-文本数据集的情况下。解决方案的关键在于引入了一种名为Gen-SIS的扩散模型增强技术，该技术完全基于未标记的图像数据进行训练，无需依赖外部监督源（如文本标注）。具体步骤包括：首先使用手工设计的增强方法训练一个初始的SSL编码器，然后基于该编码器的嵌入训练一个扩散模型。训练完成后，该扩散模型能够根据源图像的嵌入生成多样化的视图，从而促进更强大的SSL编码器的训练。此外，论文还提出了一种新的预训练任务，即在编码器潜在空间中插值合成图像时，解耦两个源图像的特征。实验证明，这种方法在自然图像和数字病理图像的下游任务中均能显著提升性能。

链接: https://arxiv.org/abs/2412.01672
作者: Varun Belagali,Srikar Yellapragada,Alexandros Graikos,Saarthak Kapse,Zilinghan Li,Tarak Nath Nandi,Ravi K Madduri,Prateek Prasanna,Joel Saltz,Dimitris Samaras
关键词-EN: Self-supervised learning, strong visual representation, visual representation learners, SSL, SSL encoder
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations’, i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS’s effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.
zh

[CV-34] Robust and Transferable Backdoor Attacks Against Deep Image Compression With Selective Frequency Prior

【速读】：该论文试图解决在基于深度学习的图像压缩模型中引入后门攻击的问题，特别是在压缩模型的DCT（离散余弦变换）域中嵌入触发器，以实现多种触发器的后门攻击。解决方案的关键在于设计了一种频率域的触发器注入模型，并通过动态损失函数和两阶段训练策略来优化攻击目标，包括降低压缩质量、针对特定任务（如人脸识别和语义分割）的攻击，以及提高攻击对防御性预处理的抵抗能力。此外，通过调整攻击损失中的分类边界，增强了攻击在跨模型和跨域任务中的可转移性。

链接: https://arxiv.org/abs/2412.01646
作者: Yi Yu,Yufei Wang,Wenhan Yang,Lanqing Guo,Shijian Lu,Ling-Yu Duan,Yap-Peng Tan,Alex C. Kot
关键词-EN: surpassed traditional methods, Recent advancements, learning-based compression techniques, deep learning-based compression, traditional methods
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted by IEEE TPAMI

点击查看摘要

Abstract:Recent advancements in deep learning-based compression techniques have surpassed traditional methods. However, deep neural networks remain vulnerable to backdoor attacks, where pre-defined triggers induce malicious behaviors. This paper introduces a novel frequency-based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models. Inspired by the widely used DCT in compression codecs, triggers are embedded in the DCT domain. We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit-rate and reconstruction accuracy; 2) targeting task-driven measures like face recognition and semantic segmentation. To improve training efficiency, we propose a dynamic loss function that balances loss terms with fewer hyper-parameters, optimizing attack objectives effectively. For advanced scenarios, we evaluate the attack’s resistance to defensive preprocessing and propose a two-stage training schedule with robust frequency selection to enhance resilience. To improve cross-model and cross-domain transferability for downstream tasks, we adjust the classification boundary in the attack loss during training. Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model, demonstrating strong performance and versatility. (*Due to the notification of arXiv “The Abstract field cannot be longer than 1,920 characters”, the appeared Abstract is shortened. For the full Abstract, please download the Article.)
zh

[CV-35] AVS-Net: Audio-Visual Scale Net for Self-supervised Monocular Metric Depth Estimation

【速读】：该论文试图解决单目视频深度预测中的泛化问题和尺度校正问题。解决方案的关键在于利用声学回声（Echoes）作为辅助信息，以改善深度预测的准确性和尺度一致性。具体来说，论文展示了如何将声学回声与视觉信息结合，用于监督学习中的度量深度学习，以及作为自监督训练中的尺度校正监督信号。这种方法不仅提升了现有最先进方法的预测性能，还能在自监督深度学习框架中实现尺度校正。

链接: https://arxiv.org/abs/2412.01637
作者: Xiaohu Liu,Sascha Hornauer,Fabien Moutarde,Jialiang Lu
关键词-EN: monocular videos suffers, suffers from bad, bad generalization, generalization between datasets, datasets and requires
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Metric depth prediction from monocular videos suffers from bad generalization between datasets and requires supervised depth data for scale-correct training. Self-supervised training using multi-view reconstruction can benefit from large scale natural videos but not provide correct scale, limiting its benefits. Recently, reflecting audible Echoes off objects is investigated for improved depth prediction and was shown to be sufficient to reconstruct objects at scale even without a visual signal. Because Echoes travel at fixed speed, they have the potential to resolve ambiguities in object scale and appearance. However, predicting depth end-to-end from sound and vision cannot benefit from unsupervised depth prediction approaches, which can process large scale data without sound annotation. In this work we show how Echoes can benefit depth prediction in two ways: When learning metric depth learned from supervised data and as supervisory signal for scale-correct self-supervised training. We show how we can improve the predictions of several state-of-the-art approaches and how the method can scale-correct a self-supervised depth approach.
zh

[CV-36] Image Forgery Localization via Guided Noise and Multi-Scale Feature Aggregation

【速读】：该论文试图解决图像篡改定位 (Image Forgery Localization, IFL) 技术中存在的特征退化问题，特别是在使用多层卷积或自注意力机制进行训练时，现有方法在检测小篡改区域和对抗后处理操作方面的表现不佳。解决方案的关键在于提出了一种引导式多尺度特征聚合网络。具体来说，论文开发了一个有效的噪声提取模块，以引导方式全面学习不同类型篡改下的噪声特征。随后，设计了一个特征聚合模块 (Feature Aggregation Module, FAM)，利用动态卷积自适应地聚合多尺度的RGB和噪声特征。此外，提出了一个空洞残差金字塔模块 (Atrous Residual Pyramid Module, ARPM)，通过不同感受野增强特征表示，捕捉全局和局部特征，从而提高篡改定位的准确性和鲁棒性。实验结果表明，该模型在多个公开数据集上优于现有的最先进方法，特别是在小区域篡改图像的检测上表现突出。

链接: https://arxiv.org/abs/2412.01622
作者: Yakun Niu,Pei Chen,Lei Zhang,Lei Tan,Yingjian Chen
关键词-EN: technology aims, digital forensics, aims to detect, detect and locate, Forgery Localization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 36 pages, 6 figures

点击查看摘要

Abstract:Image Forgery Localization (IFL) technology aims to detect and locate the forged areas in an image, which is very important in the field of digital forensics. However, existing IFL methods suffer from feature degradation during training using multi-layer convolutions or the self-attention mechanism, and perform poorly in detecting small forged regions and in robustness against post-processing. To tackle these, we propose a guided and multi-scale feature aggregated network for IFL. Spectifically, in order to comprehensively learn the noise feature under different types of forgery, we develop an effective noise extraction module in a guided way. Then, we design a Feature Aggregation Module (FAM) that uses dynamic convolution to adaptively aggregate RGB and noise features over multiple scales. Moreover, we propose an Atrous Residual Pyramid Module (ARPM) to enhance features representation and capture both global and local features using different receptive fields to improve the accuracy and robustness of forgery localization. Expensive experiments on 5 public datasets have shown that our proposed model outperforms several the state-of-the-art methods, specially on small region forged image.
zh

[CV-37] CRAYM: Neural Field Optimization via Camera RAY Matching NEURIPS2024

【速读】：该论文试图解决从多视角图像中联合优化相机姿态和神经场以实现新视角合成 (Novel View Synthesis, NVS) 和三维几何重建的问题。解决方案的关键在于引入相机射线匹配 (Camera Ray Matching, CRAYM)，通过优化特征体积 (Feature Volume) 来参数化相机射线，使其携带几何和光度信息。这种方法能够自然地将多视角一致性和场景渲染整合到联合优化和网络训练中，从而施加物理上有意义的约束，提高几何重建和真实感渲染的质量。论文通过聚焦于穿过输入图像中关键点的相机射线，来提升场景对应关系的效率和准确性，并通过沿特征体积累积射线特征来减轻错误射线匹配的约束。

链接: https://arxiv.org/abs/2412.01618
作者: Liqiang Lin,Wenpeng Wu,Chi-Wing Fu,Hao Zhang,Hui Huang
关键词-EN: camera rays, introduce camera ray, camera, matching camera rays, camera ray matching
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Published in NeurIPS 2024

点击查看摘要

Abstract:We introduce camera ray matching (CRAYM) into the joint optimization of camera poses and neural fields from multi-view images. The optimized field, referred to as a feature volume, can be “probed” by the camera rays for novel view synthesis (NVS) and 3D geometry reconstruction. One key reason for matching camera rays, instead of pixels as in prior works, is that the camera rays can be parameterized by the feature volume to carry both geometric and photometric information. Multi-view consistencies involving the camera rays and scene rendering can be naturally integrated into the joint optimization and network training, to impose physically meaningful constraints to improve the final quality of both the geometric reconstruction and photorealistic rendering. We formulate our per-ray optimization and matched ray coherence by focusing on camera rays passing through keypoints in the input images to elevate both the efficiency and accuracy of scene correspondences. Accumulated ray features along the feature volume provide a means to discount the coherence constraint amid erroneous ray matching. We demonstrate the effectiveness of CRAYM for both NVS and geometry reconstruction, over dense- or sparse-view settings, with qualitative and quantitative comparisons to state-of-the-art alternatives.
zh

[CV-38] OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

【速读】：该论文试图解决生成式 AI (Generative AI) 在图像编辑应用中带来的数字内容真实性和完整性风险问题。现有水印技术在篡改定位精度和视觉质量之间存在权衡，且在AIGC编辑下的版权提取准确性不足。解决方案的关键在于提出了一种名为OmniGuard的新型增强型多功能水印方法，该方法结合了主动嵌入和被动盲提取技术，以实现强大的版权保护和篡改定位。OmniGuard采用混合取证框架，允许灵活选择定位水印，并引入了一个退化感知篡改提取网络，以在挑战性条件下实现精确的定位。此外，设计了一个轻量级的AIGC编辑模拟层，以增强全局和局部编辑的鲁棒性。实验结果表明，OmniGuard在保真度、鲁棒性和灵活性方面均优于现有的最先进方法EditGuard。

链接: https://arxiv.org/abs/2412.01615
作者: Xuanyu Zhang,Zecheng Tang,Zhipei Xu,Runyi Li,Youmin Xu,Bin Chen,Feng Gao,Jian Zhang
关键词-EN: digital content, rapid growth, growth of generative, widespread application, risks have emerged
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:With the rapid growth of generative AI and its widespread application in image editing, new risks have emerged regarding the authenticity and integrity of digital content. Existing versatile watermarking approaches suffer from trade-offs between tamper localization precision and visual quality. Constrained by the limited flexibility of previous framework, their localized watermark must remain fixed across all images. Under AIGC-editing, their copyright extraction accuracy is also unsatisfactory. To address these challenges, we propose OmniGuard, a novel augmented versatile watermarking approach that integrates proactive embedding with passive, blind extraction for robust copyright protection and tamper localization. OmniGuard employs a hybrid forensic framework that enables flexible localization watermark selection and introduces a degradation-aware tamper extraction network for precise localization under challenging conditions. Additionally, a lightweight AIGC-editing simulation layer is designed to enhance robustness across global and local editing. Extensive experiments show that OmniGuard achieves superior fidelity, robustness, and flexibility. Compared to the recent state-of-the-art approach EditGuard, our method outperforms it by 4.25dB in PSNR of the container image, 20.7% in F1-Score under noisy conditions, and 14.8% in average bit accuracy.
zh

[CV-39] Arabic Handwritten Document OCR Solution with Binarization and Adaptive Scale Fusion Detection

【速读】：该论文试图解决阿拉伯手写文本识别 (Arabic Handwritten Text Recognition, AHTR) 的问题，这一问题由于多样化的书写风格和有限的标注数据而具有挑战性。解决方案的关键在于构建一个完整的OCR流水线，其中包括使用可微分二值化 (Differentiable Binarization) 和自适应尺度融合 (Adaptive Scale Fusion) 技术进行行分割，以确保准确检测文本行。随后，采用CNN-BiLSTM-CTC架构进行字符识别。该系统在阿拉伯多字体数据集 (Arabic Multi-Fonts Dataset, AMFDS) 上训练，实现了在包含7到10个字符的单字样本中字符识别率 (Character Recognition Rate, CRR) 达到99.20%，词识别率 (Word Recognition Rate, WRR) 达到93.75%，以及句子识别率CRR达到83.76%。这些结果展示了系统在处理阿拉伯文字方面的强大性能，为AHTR系统设立了新的基准。

链接: https://arxiv.org/abs/2412.01601
作者: Alhossien Waly,Bassant Tarek,Ali Feteha,Rewan Yehia,Gasser Amr,Ahmed Fares
关键词-EN: widely researched topic, academia and industry, handwritten Text Recognation, Adaptive Scale Fusion, problem of converting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting styles and limited labeled data. In this paper we present a complete OCR pipeline that starts with line segmentation using Differentiable Binarization and Adaptive Scale Fusion techniques to ensure accurate detection of text lines. Following segmentation, a CNN-BiLSTM-CTC architecture is applied to recognize characters. Our system, trained on the Arabic Multi-Fonts Dataset (AMFDS), achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters, along with a CRR of 83.76% for sentences. These results demonstrate the system’s strong performance in handling Arabic scripts, establishing a new benchmark for AHTR systems.
zh

[CV-40] FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection

【速读】：该论文试图解决现代机器学习模型在处理分布外（Out-of-Distribution, OOD）样本时过度自信的问题，导致在开放环境中行为不可预测。解决方案的关键在于识别并消除自由能评分（free energy score）在OOD检测中的固有脆弱性。具体来说，论文发现当特征空间差异向量位于神经网络分类器最后一层的零空间时，分布内和分布外样本可能具有相同的自由能评分，尽管它们的特征表示不同。为解决这一问题，论文提出了一种名为FEVER-OOD的技术，通过探索低维特征空间以减少零空间的影响，并引入新的正则化方法来最大化最后一层线性层的最低奇异值，从而增强样本间的自由能分离。实验结果表明，FEVER-OOD技术在Imagenet-100数据集上实现了最先进的OOD检测性能。

链接: https://arxiv.org/abs/2412.01596
作者: Brian K.S. Isaac-Medina,Mauricio Che,Yona F.A. Gaus,Samet Akcay,Toby P. Breckon
关键词-EN: Modern machine learning, computer vision tasks, Modern machine, free energy score, free energy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Modern machine learning models, that excel on computer vision tasks such as classification and object detection, are often overconfident in their predictions for Out-of-Distribution (OOD) examples, resulting in unpredictable behaviour for open-set environments. Recent works have demonstrated that the free energy score is an effective measure of uncertainty for OOD detection given its close relationship to the data distribution. However, despite free energy-based methods representing a significant empirical advance in OOD detection, our theoretical analysis reveals previously unexplored and inherent vulnerabilities within the free energy score formulation such that in-distribution and OOD instances can have distinct feature representations yet identical free energy scores. This phenomenon occurs when the vector direction representing the feature space difference between the in-distribution and OOD sample lies within the null space of the last layer of a neural-based classifier. To mitigate these issues, we explore lower-dimensional feature spaces to reduce the null space footprint and introduce novel regularisation to maximize the least singular value of the final linear layer, hence enhancing inter-sample free energy separation. We refer to these techniques as Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection (FEVER-OOD). Our experiments show that FEVER-OOD techniques achieve state of the art OOD detection in Imagenet-100, with average OOD false positive rate (at 95% true positive rate) of 35.83% when used with the baseline Dream-OOD model.
zh

[CV-41] Epipolar Attention Field Transformers for Birds Eye View Semantic Segmentation WACV2025

【速读】：该论文试图解决基于Transformer的方法在从多摄像头提取鸟瞰图（BEV）时对学习到的位置编码（positional encodings）的依赖问题。解决方案的关键在于利用极线几何约束（epipolar geometric constraints）来建模摄像头与BEV之间的关系，通过引入极线注意力场（Epipolar Attention Fields）作为注意力机制中的一个新颖属性项，替代学习到的位置编码。实验结果表明，该方法EAFormer在地图语义分割任务中比之前的BEV方法提高了2%的mIoU，并展示了优于隐式学习摄像头配置的泛化能力。

链接: https://arxiv.org/abs/2412.01595
作者: Christian Witte,Jens Behley,Cyrill Stachniss,Marvin Raaijmakers
关键词-EN: safe driving decisions, key capability needed, enable safe driving, driving decisions, key capability
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision-based solutions have gained increasing research interest. In particular, approaches extracting a bird’s eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer-based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.
zh

[CV-42] NCDD: Nearest Centroid Distance Deficit for Out-Of-Distribution Detection in Gastrointestinal Vision

【速读】：该论文试图解决深度学习工具在胃肠镜图像诊断中存在的过度自信预测问题，尤其是在面对未见或新出现的疾病模式时，这些工具的可靠性受到严重影响。解决方案的关键在于将问题框架化为分布外（Out-of-Distribution, OOD）检测问题，并提出了一种新的最近质心距离缺陷（Nearest-Centroid Distance Deficit, NCCD）评分方法。该方法利用特征空间中分布内（In-Distribution, ID）和OOD样本与各类质心距离的差异，通过计算样本到最近质心的距离来区分ID和OOD样本。实验结果表明，NCCD评分在多个深度学习架构和两个公开基准数据集（Kvasir2和Gastrovision）上的表现优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.01590
作者: Sandesh Pokhrel,Sanjay Bhandari,Sharib Ali,Tryphon Lambrou,Anh Nguyen,Yash Raj Shrestha,Angus Watson,Danail Stoyanov,Prashnna Gyawali,Binod Bhattarai
关键词-EN: advancements in diagnosis, patient care, holds the potential, potential for significant, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of deep learning tools in gastrointestinal vision holds the potential for significant advancements in diagnosis, treatment, and overall patient care. A major challenge, however, is these tools’ tendency to make overconfident predictions, even when encountering unseen or newly emerging disease patterns, undermining their reliability. We address this critical issue of reliability by framing it as an out-of-distribution (OOD) detection problem, where previously unseen and emerging diseases are identified as OOD examples. However, gastrointestinal images pose a unique challenge due to the overlapping feature representations between in- Distribution (ID) and OOD examples. Existing approaches often overlook this characteristic, as they are primarily developed for natural image datasets, where feature distinctions are more apparent. Despite the overlap, we hypothesize that the features of an in-distribution example will cluster closer to the centroids of their ground truth class, resulting in a shorter distance to the nearest centroid. In contrast, OOD examples maintain an equal distance from all class centroids. Based on this observation, we propose a novel nearest-centroid distance deficit (NCCD) score in the feature space for gastrointestinal OOD detection. Evaluations across multiple deep learning architectures and two publicly available benchmarks, Kvasir2 and Gastrovision, demonstrate the effectiveness of our approach compared to several state-of-the-art methods. The code and implementation details are publicly available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.01590 [cs.CV] (or arXiv:2412.01590v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-43] 3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

【速读】：该论文试图解决复杂3D场景编辑过程中存在的编辑精度不足和实时性差的问题。解决方案的关键在于提出了一种基于高斯溅射（Gaussian Splatting）的全3D编辑范式，称为3DSceneEditor。该方法通过直接在3D空间中操作高斯分布，实现了高效、高质量的场景编辑，包括对象的添加、重定位、重新着色、替换和删除。3DSceneEditor的关键创新点包括：（i）集成预训练的实例分割模型进行语义标注；（ii）采用零样本基础（zero-shot grounding）方法与CLIP模型结合，以用户提示对齐目标对象；（iii）直接在高斯分布上应用场景修改，从而显著提升了编辑精度和速度，超越了当前最先进的3D场景编辑方法。

链接: https://arxiv.org/abs/2412.01583
作者: Ziyang Yan,Lei Li,Yihua Shao,Siyu Chen,Wuzong Kai,Jenq-Neng Hwang,Hao Zhao,Fabio Remondino
关键词-EN: assets and environments, labor-intensive and costly, requiring designers, meticulously configure, designers to meticulously
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The creation of 3D scenes has traditionally been both labor-intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text-to-3D and image-to-3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi-step, 2D-to-3D projection methods and diffusion-based techniques, which often lack precision in control and hamper real-time performance. In this work, we propose 3DSceneEditor, a fully 3D-based paradigm for real-time, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct manipulation of Gaussians for efficient, high-quality edits based on input this http URL proposed framework (i) integrates a pre-trained instance segmentation model for semantic labeling; (ii) employs a zero-shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and deletion directly on Gaussians. Extensive experimental results show that 3DSceneEditor achieves superior editing precision and speed with respect to current SOTA 3D scene editing approaches, establishing a new benchmark for efficient and interactive 3D scene customization.
zh

[CV-44] Detection Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle KR

【速读】：该论文试图解决多人体姿态估计（Human Pose Estimation）在多人场景下的实例分离问题。现有方法主要依赖于检测到的边界框（bounding boxes）或自底向上的姿态估计，但忽略了分割掩码（segmentation masks）与关键点估计之间的关联。论文的关键解决方案是利用分割掩码来条件化姿态估计模型，从而改善实例分离效果。此外，论文提出了BBox-Mask-Pose (BMP)框架，将检测、分割和姿态估计整合到一个自我改进的反馈循环中，通过实例掩码条件化检测器和姿态估计模型，并使用Segment Anything模型作为姿态到掩码的转换模型，形成闭环。该方法在OCHuman和COCO数据集上均表现优异，结合了自顶向下和无检测器方法的优点，达到了当前最先进水平。

链接: https://arxiv.org/abs/2412.01562
作者: Miroslav Purkrabek,Jiri Matas
关键词-EN: Human pose estimation, Human pose, pose estimation, separated people, people but struggle
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code is available on this https URL.
zh

[CV-45] Adaptive High-Pass Kernel Prediction for Efficient Video Deblurring WACV2025

【速读】：该论文试图解决视频去模糊过程中高频信息（HF）恢复不足的问题。解决方案的关键在于通过动态预测自适应高通滤波器核（adaptive high-pass kernels），从高通基核（high-pass basis kernels）的线性组合中提取高频特征。这种方法不仅提高了高频细节的恢复效果，还实现了低内存占用和快速推理，从而在低预算模型中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.01559
作者: Bo Ji,Angela Yao
关键词-EN: sharpened video frames, recover sharpened video, video deblurring methods, deep network architectures, video frames
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:State-of-the-art video deblurring methods use deep network architectures to recover sharpened video frames. Blurring especially degrades high-frequency (HF) information, yet this aspect is often overlooked by recent models that focus more on enhancing architectural design. Recovering these fine details is challenging, partly due to the spectral bias of neural networks, which are inclined towards learning low-frequency functions. To address this, we enforce explicit network structures to capture the fine details and edges. We dynamically predict adaptive high-pass kernels from a linear combination of high-pass basis kernels to extract high-frequency features. This strategy is highly efficient, resulting in low-memory footprints for training and fast run times for inference, all while achieving state-of-the-art when compared to low-budget models. The code is available at this https URL.
zh

[CV-46] VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

【速读】：该论文试图解决视频高光检测与时刻检索（HD/MR）任务中存在的跨任务动态关系和视频-文本对齐问题，以及现有模型在多模态特征融合和任务间依赖性捕捉方面的不足。解决方案的关键在于提出了一种名为VideoLights的新框架，该框架通过以下五个核心模块来解决这些问题：(i) 卷积投影与特征细化模块结合对齐损失，以增强视频与文本特征的对齐；(ii) 双向跨模态融合网络，用于生成强耦合的查询感知片段表示；(iii) 单向联合任务反馈机制，通过任务间的关联性来增强两个任务的表现；(iv) 引入硬正/负样本损失，实现自适应误差惩罚和学习效果的提升；(v) 利用大型视觉-语言模型（LVLMs）如BLIP-2进行多模态特征的增强集成，并通过LVLMs生成的合成数据进行智能预训练。

链接: https://arxiv.org/abs/2412.01558
作者: Dhiman Paul,Md Rizwan Parvez,Nabeel Mohammed,Shafin Rahman
关键词-EN: Video Highlight Detection, Moment Retrieval, Highlight Detection, Detection and Moment, Video Highlight
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at this https URL .
zh

[CV-47] Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

【速读】：该论文试图解决RGB-热成像显著目标检测中由于模态缺陷导致的噪声问题。解决方案的关键在于提出了ConTriNet，这是一种采用分而治之策略的稳健汇聚三流网络。ConTriNet包含三个流：两个模态特定流分别探索RGB和热成像模态的线索，第三个模态互补流则整合来自两种模态的线索。关键创新包括：在模态共享的联合编码器中引入模态诱导特征调制器（Modality-induced Feature Modulator）以减少模态间差异并缓解缺陷样本的影响；在分离流中使用基础残差空洞空间金字塔模块（Residual Atrous Spatial Pyramid Module）以扩大感受野，捕捉多尺度上下文信息；在模态互补流中引入模态感知动态聚合模块（Modality-aware Dynamic Aggregation Module），动态聚合来自模态特定流的显著性相关线索。通过并行三流框架和流协同融合策略，最终生成高质量、全分辨率的显著性图。

链接: https://arxiv.org/abs/2412.01556
作者: Hao Tang,Zechao Li,Dong Zhang,Shengfeng He,Jinhui Tang
关键词-EN: Salient Object Detection, RGB-Thermal Salient Object, Object Detection aims, pinpoint prominent objects, Salient Object
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by IEEE TPAMI. Project page: this https URL

点击查看摘要

Abstract:RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.
zh

[CV-48] Optimizing Domain-Specific Image Retrieval: A Benchmark of FAISS and Annoy with Fine-Tuned Features

【速读】：该论文旨在解决大规模数据检索中的近似最近邻搜索（Approximate Nearest Neighbor search, ANN）问题，这是许多应用中高性能数据检索的关键。解决方案的关键在于通过微调ResNet50模型，结合多种ANN方法（如FAISS和Annoy），实现特征提取与ANN索引之间的桥梁。论文评估了这些系统在索引时间、内存使用、查询时间、精确度（precision）、召回率（recall）、F1分数和Recall@5等方面的性能。结果显示，FAISS的产品量化（Product Quantization）在低内存使用（0.24 MB索引大小）下可达到98.40%的精确度，而Annoy则在查询速度上最快（平均查询时间为0.00015秒），但精度略有牺牲。这些结果揭示了速度、准确性和内存效率之间的权衡，并为基于特征的图像检索系统的优化提供了实际操作的见解。

链接: https://arxiv.org/abs/2412.01555
作者: MD Shaikh Rahman,Syed Maudud E Rabbi,Muhammad Mahbubur Rashid
关键词-EN: Approximate Nearest Neighbor, Nearest Neighbor search, Approximate Nearest, Nearest Neighbor, Neighbor search
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Approximate Nearest Neighbor search is one of the keys to high-scale data retrieval performance in many applications. The work is a bridge between feature extraction and ANN indexing through fine-tuning a ResNet50 model with various ANN methods: FAISS and Annoy. We evaluate the systems with respect to indexing time, memory usage, query time, precision, recall, F1-score, and Recall@5 on a custom image dataset. FAISS’s Product Quantization can achieve a precision of 98.40% with low memory usage at 0.24 MB index size, and Annoy is the fastest, with average query times of 0.00015 seconds, at a slight cost to accuracy. These results reveal trade-offs among speed, accuracy, and memory efficiency and offer actionable insights into the optimization of feature-based image retrieval systems. This study will serve as a blueprint for constructing actual retrieval pipelines and be built on fine-tuned deep learning networks and associated ANN methods.
zh

[CV-49] SfM-Free 3D Gaussian Splatting via Hierarchical Training

【速读】：该论文试图解决在没有已知相机姿态和结构从运动（SfM）预处理的情况下，如何从视频输入中生成高质量的3D高斯喷射（3DGS）模型的问题。解决方案的关键在于提出了一种无需SfM的3D高斯喷射（SFGS）方法，通过引入分层训练策略，将多个针对特定场景区域优化的3D高斯表示合并为一个统一的3DGS模型。此外，该方法利用视频帧插值模型来补偿大范围相机运动，并结合多源监督来减少过拟合并增强表示。实验结果表明，该方法在Tanks and Temples和CO3D-V2数据集上显著优于现有的SfM-free新视角合成方法。

链接: https://arxiv.org/abs/2412.01553
作者: Bo Ji,Angela Yao
关键词-EN: sparse point cloud, Gaussian Splatting, pre-computed camera poses, point cloud, initialize and grow
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations – each optimized for specific scene regions – into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. The code is available at this https URL.
zh

[CV-50] GFreeDet: Exploiting Gaussian Splatting and Foundation Models for Model-free Unseen Object Detection in the BOP Challenge 2024

【速读】：该论文旨在解决BOP 2024挑战赛中模型无关的未见物体检测任务。解决方案的关键在于利用高斯点云（Gaussian splatting）和视觉基础模型（vision Foundation models），即GFreeDet方法，以实现对未见物体的有效检测。

链接: https://arxiv.org/abs/2412.01552
作者: Xingyu Liu,Yingyue Li,Chengxi Li,Gu Wang,Chenyangguang Zhang,Ziqin Huang,Xiangyang Ji
关键词-EN: exploits Gaussian splatting, vision Foundation models, unseen object Detection, object Detection track, submitted method GFreeDet
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this report, we provide the technical details of the submitted method GFreeDet, which exploits Gaussian splatting and vision Foundation models for the model-free unseen object Detection track in the BOP 2024 Challenge.
zh

[CV-51] SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

【速读】：该论文试图解决现有3D affordance分割方法在处理复杂用户意图和长时程任务时的局限性，特别是它们无法有效推理和分解用户意图中的序列性affordance。解决方案的关键在于引入“序列3D Affordance推理任务”，并通过构建首个基于指令的affordance分割基准，包含18万对指令-点云数据，以支持单affordance和序列affordance的推理。论文提出的SeqAfford模型通过解锁3D多模态大语言模型的affordance分割能力，结合多粒度语言-点云集成模块，实现了在统一框架内进行世界知识的推理和细粒度affordance的定位，从而在实验中展现出优于现有方法的性能和开放世界的序列推理能力。

链接: https://arxiv.org/abs/2412.01550
作者: Chunlin Yu,Hanqing Wang,Ye Shi,Haoyang Luo,Sibei Yang,Jingyi Yu,Jingya Wang
关键词-EN: link human instructions, objects for embodied, embodied manipulations, aims to link, link human
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.
zh

[CV-52] 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

【速读】：该论文试图解决在实时RGB-D视频流中对任意对象进行高效且准确的6D姿态估计（6D object pose estimation）问题。解决方案的关键在于提出了6DOPE-GS方法，该方法通过有效利用高斯光栅化（Gaussian Splatting）的快速可微渲染能力，同时优化6D对象姿态和3D对象重建。具体来说，6DOPE-GS采用增量2D高斯光栅化，并结合智能动态关键帧选择过程，以实现高空间覆盖率和防止错误的姿态更新。此外，论文还提出了基于不透明度统计的剪枝机制，用于自适应高斯密度控制，确保训练的稳定性和效率。通过在HO3D和YCBInEOAT数据集上的评估，6DOPE-GS在模型无关的6D姿态跟踪和重建任务中达到了与最先进基线相当的性能，同时提供了5倍的加速效果。

链接: https://arxiv.org/abs/2412.01543
作者: Yufeng Jin,Vignesh Prasad,Snehal Jauhri,Mathias Franzius,Georgia Chalvatzaki
关键词-EN: Augmented Reality, modern vision systems, Efficient and accurate, object pose estimation, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation \ tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5 \times speedup. We also demonstrate the method’s suitability for live, dynamic object tracking and reconstruction in a real-world setting.
zh

[CV-53] he Bare Necessities: Designing Simple Effective Open-Vocabulary Scene Graphs

【速读】：该论文试图解决3D开放词汇场景图方法在实体代理中的高效性和性能优化问题。解决方案的关键在于重新审视和优化现有方法中的关键设计选择，包括图像预处理、特征融合和特征选择。研究发现，常用的图像预处理技术虽然增加了计算量，但对性能提升有限；而跨视图的特征标签平均化显著降低了性能。论文提出了一种计算平衡的3D点云分割方法，通过改进特征选择策略，在不增加额外计算成本的情况下提升了性能，实现了与现有最先进方法相当的分类准确率，同时将计算量减少了三倍。

链接: https://arxiv.org/abs/2412.01539
作者: Christina Kassab,Matías Mattamala,Sacha Morin,Martin Büchner,Abhinav Valada,Liam Paull,Maurice Fallon
关键词-EN: promising map representation, open-vocabulary scene graph, embodied agents, scene graph methods, promising map
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D open-vocabulary scene graph methods are a promising map representation for embodied agents, however many current approaches are computationally expensive. In this paper, we reexamine the critical design choices established in previous works to optimize both efficiency and performance. We propose a general scene graph framework and conduct three studies that focus on image pre-processing, feature fusion, and feature selection. Our findings reveal that commonly used image pre-processing techniques provide minimal performance improvement while tripling computation (on a per object view basis). We also show that averaging feature labels across different views significantly degrades performance. We study alternative feature selection strategies that enhance performance without adding unnecessary computational costs. Based on our findings, we introduce a computationally balanced approach for 3D point cloud segmentation with per-object features. The approach matches state-of-the-art classification accuracy while achieving a threefold reduction in computation.
zh

[CV-54] HandOS: 3D Hand Reconstruction in One Stage

【速读】：该论文试图解决传统多阶段手部重建框架中存在的冗余计算和累积误差问题。解决方案的关键在于提出了一个端到端的3D手部重建框架HandOS，该框架通过利用冻结的检测器作为基础，并结合辅助模块进行2D和3D关键点估计，从而将姿态估计能力集成到检测框架中，同时避免了使用左右分类作为前提。具体来说，HandOS设计了一个交互式的2D-3D解码器，其中2D关节语义从检测线索中提取，而3D表示则从2D关节中提升。此外，通过设计分层注意力机制，实现了2D关节、3D顶点和相机平移的并行建模。最终，HandOS在一个单阶段框架中实现了手部检测、2D姿态估计和3D网格重建的端到端集成，克服了多阶段框架的缺点，并在公共基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.01537
作者: Xingyu Chen,Zhuheng Song,Xiaoke Jiang,Yaoqing Hu,Junzhi Yu,Lei Zhang
关键词-EN: Existing approaches, reconstruction predominantly adhere, hand reconstruction predominantly, predominantly adhere, Existing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Existing approaches of hand reconstruction predominantly adhere to a multi-stage framework, encompassing detection, left-right classification, and pose estimation. This paradigm induces redundant computation and cumulative errors. In this work, we propose HandOS, an end-to-end framework for 3D hand reconstruction. Our central motivation lies in leveraging a frozen detector as the foundation while incorporating auxiliary modules for 2D and 3D keypoint estimation. In this manner, we integrate the pose estimation capacity into the detection framework, while at the same time obviating the necessity of using the left-right category as a prerequisite. Specifically, we propose an interactive 2D-3D decoder, where 2D joint semantics is derived from detection cues while 3D representation is lifted from those of 2D joints. Furthermore, hierarchical attention is designed to enable the concurrent modeling of 2D joints, 3D vertices, and camera translation. Consequently, we achieve an end-to-end integration of hand detection, 2D pose estimation, and 3D mesh reconstruction within a one-stage framework, so that the above multi-stage drawbacks are overcome. Meanwhile, the HandOS reaches state-of-the-art performances on public benchmarks, e.g., 5.0 PA-MPJPE on FreiHand and 64.6% PCK@0.05 on HInt-Ego4D. Project page: this http URL.
zh

[CV-55] raversing the Subspace of Adversarial Patches

【速读】：该论文试图解决对抗样本（adversarial examples）在深度学习中的本质问题，特别是验证高维数据是否倾向于属于低维流形（manifold）的假设。解决方案的关键在于通过分析对抗补丁（adversarial patches）并使用三种不同的降维方法来重建这些补丁，从而量化重建补丁在攻击环境中的表现，并研究在对抗训练过程中从潜在空间采样补丁的影响。研究结果表明，更复杂的降维方法并不比简单的**主成分分析（Principal Component Analysis, PCA）**更具优势。

链接: https://arxiv.org/abs/2412.01527
作者: Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
关键词-EN: attacks remain unclear, computer vision, remain unclear, ongoing research, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite ongoing research on the topic of adversarial examples in deep learning for computer vision, some fundamentals of the nature of these attacks remain unclear. As the manifold hypothesis posits, high-dimensional data tends to be part of a low-dimensional manifold. To verify the thesis with adversarial patches, this paper provides an analysis of a set of adversarial patches and investigates the reconstruction abilities of three different dimensionality reduction methods. Quantitatively, the performance of reconstructed patches in an attack setting is measured and the impact of sampled patches from the latent space during adversarial training is investigated. The evaluation is performed on two publicly available datasets for person detection. The results indicate that more sophisticated dimensionality reduction methods offer no advantages over a simple principal component analysis.
zh

[CV-56] InfinityDrive: Breaking Time Limits in Driving World Models

【速读】：该论文试图解决自动驾驶系统在复杂场景中由于缺乏多样性、广泛性和分布外的驾驶数据而导致的导航安全性问题。解决方案的关键在于引入InfinityDrive，这是一种具有卓越泛化能力的驾驶世界模型，能够在高保真度、一致性和多样性方面实现分钟级视频生成。InfinityDrive通过高效的时空协同建模模块和扩展的时间训练策略，实现了高分辨率（576 × 1024）视频生成，并保持了空间和时间上的一致性。此外，通过引入记忆注入和保留机制以及自适应记忆曲线损失，InfinityDrive能够最小化累积误差，生成超过1500帧（约2分钟）的一致性视频。这些创新使得InfinityDrive在多个数据集上的实验中展现出其生成复杂和多样化场景的能力，预示着它作为下一代驾驶世界模型在自动驾驶领域中的巨大潜力。

链接: https://arxiv.org/abs/2412.01522
作者: Xi Guo,Chenjing Ding,Haoxuan Dou,Xin Zhang,Weixuan Tang,Wei Wu
关键词-EN: driving systems struggle, driving world model, access to diverse, safe navigation, Autonomous driving systems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project homepage: this https URL

点击查看摘要

Abstract:Autonomous driving systems struggle with complex scenarios due to limited access to diverse, extensive, and out-of-distribution driving data which are critical for safe navigation. World models offer a promising solution to this challenge; however, current driving world models are constrained by short time windows and limited scenario diversity. To bridge this gap, we introduce InfinityDrive, the first driving world model with exceptional generalization capabilities, delivering state-of-the-art performance in high fidelity, consistency, and diversity with minute-scale video generation. InfinityDrive introduces an efficient spatio-temporal co-modeling module paired with an extended temporal training strategy, enabling high-resolution (576 \times 1024) video generation with consistent spatial and temporal coherence. By incorporating memory injection and retention mechanisms alongside an adaptive memory curve loss to minimize cumulative errors, achieving consistent video generation lasting over 1500 frames (approximately 2 minutes). Comprehensive experiments in multiple datasets validate InfinityDrive’s ability to generate complex and varied scenarios, highlighting its potential as a next-generation driving world model built for the evolving demands of autonomous driving. Our project homepage: this https URL
zh

[CV-57] ArtBrain: An Explainable end-to-end Toolkit for Classification and Attribution of AI-Generated Art and Style

【速读】：该论文试图解决合成艺术品（AI-generated artworks）的真实性识别及其来源追踪问题。解决方案的关键在于引入了一个名为AI-ArtBench的数据集，该数据集包含185,015幅艺术图像，涵盖10种艺术风格，其中包括125,015幅AI生成的图像和60,000幅人类创作的艺术品。论文提出了一种基于ConvNeXt模型的新型卷积神经网络模型——AttentionConvNeXt，该模型能够准确区分艺术品的来源及其风格，F1-Score达到0.869，对生成模型的归属准确率达到0.999。此外，论文还开发了一个名为ArtBrain的网络应用，使技术与非技术用户都能与模型互动。通过艺术图灵测试的结果，论文展示了人类识别AI生成图像的准确率约为58%，而模型本身的准确率高达99%。

链接: https://arxiv.org/abs/2412.01512
作者: Ravidu Suien Rammuni Silva,Ahmad Lotfi,Isibor Kennedy Ihianle,Golnaz Shahtahmassebi,Jordan J. Bird
关键词-EN: Artificial Intelligence, generated using Artificial, detecting synthetic artworks, resulting in growing, growing difficulties
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the quality of artworks generated using Artificial Intelligence (AI) has increased significantly, resulting in growing difficulties in detecting synthetic artworks. However, limited studies have been conducted on identifying the authenticity of synthetic artworks and their source. This paper introduces AI-ArtBench, a dataset featuring 185,015 artistic images across 10 art styles. It includes 125,015 AI-generated images and 60,000 pieces of human-created artwork. This paper also outlines a method to accurately detect AI-generated images and trace them to their source model. This work proposes a novel Convolutional Neural Network model based on the ConvNeXt model called AttentionConvNeXt. AttentionConvNeXt was implemented and trained to differentiate between the source of the artwork and its style with an F1-Score of 0.869. The accuracy of attribution to the generative model reaches 0.999. To combine the scientific contributions arising from this study, a web-based application named ArtBrain was developed to enable both technical and non-technical users to interact with the model. Finally, this study presents the results of an Artistic Turing Test conducted with 50 participants. The findings reveal that humans could identify AI-generated images with an accuracy of approximately 58%, while the model itself achieved a significantly higher accuracy of around 99%.
zh

[CV-58] HaGRIDv2: 1M Images for Static and Dynamic Hand Gesture Recognition

【速读】：该论文试图解决手势识别数据集的扩展和优化问题，特别是通过引入新的手势类别和改进动态手势识别算法来提升手势识别的准确性和泛化能力。解决方案的关键在于：1) 增加了15种新的手势，包括双手手势，以丰富数据集的多样性；2) 通过引入三组新的操作手势，增强了动态手势识别算法的功能；3) 通过多样化“无手势”类别的样本，显著减少了误报率；4) 通过结合额外样本和原始数据集，提升了预训练模型在手势相关任务中的表现；5) 改进了扩散模型生成手势的质量，从而提高了数据集的整体质量。这些改进使得HaGRIDv2在手势和手部检测数据集中具有最佳的泛化能力。

链接: https://arxiv.org/abs/2412.01508
作者: Anton Nuzhdin,Alexander Nagaev,Alexander Sautin,Alexander Kapitanov,Karina Kvanchiani
关键词-EN: widespread Hand Gesture, dynamic gesture recognition, gesture recognition algorithm, Gesture Recognition, paper proposes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: hand gesture recognition, dataset, hgr system, large-scale database

点击查看摘要

Abstract:This paper proposes the second version of the widespread Hand Gesture Recognition dataset HaGRID – HaGRIDv2. We cover 15 new gestures with conversation and control functions, including two-handed ones. Building on the foundational concepts proposed by HaGRID’s authors, we implemented the dynamic gesture recognition algorithm and further enhanced it by adding three new groups of manipulation gestures. The ``no gesture" class was diversified by adding samples of natural hand movements, which allowed us to minimize false positives by 6 times. Combining extra samples with HaGRID, the received version outperforms the original in pre-training models for gesture-related tasks. Besides, we achieved the best generalization ability among gesture and hand detection datasets. In addition, the second version enhances the quality of the gestures generated by the diffusion model. HaGRIDv2, pre-trained models, and a dynamic gesture recognition algorithm are publicly available.
zh

[CV-59] Structured 3D Latents for Scalable and Versatile 3D Generation

【速读】：该论文试图解决多用途、高质量3D资产生成的挑战，其解决方案的关键在于引入了一种统一的结构化潜在表示（Structured LATent, SLAT）。SLAT通过整合稀疏填充的3D网格与从强大视觉基础模型中提取的密集多视角视觉特征，实现了对结构（几何）和纹理（外观）信息的全面捕捉，同时保持了解码过程中的灵活性。论文采用专门为SLAT设计的校正流变换器作为3D生成模型，并在包含50万多样对象的大型3D资产数据集上训练了高达20亿参数的模型。该方法在生成高质量3D结果方面显著超越了现有方法，并展示了灵活的输出格式选择和局部3D编辑能力，这些都是先前模型所不具备的。

链接: https://arxiv.org/abs/2412.01506
作者: Jianfeng Xiang,Zelong Lv,Sicheng Xu,Yu Deng,Ruicheng Wang,Bowen Zhang,Dong Chen,Xin Tong,Jiaolong Yang
关键词-EN: Radiance Fields, unified Structured LATent, asset creation, Structured LATent, unified Structured
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
zh

[CV-60] SF-Loc: A Visual Mapping and Geo-Localization System based on Sparse Visual Structure Frames

【速读】：该论文试图解决在复杂环境中，全球导航卫星系统（GNSS）定位受限的问题，特别是在高级别地理空间应用和智能机器人领域中，需要精确的全球姿态信息。解决方案的关键在于提出了一种轻量级的视觉建图与地图辅助定位系统，称为SF-Loc。其核心思想是基于稀疏帧与密集（尽管是降采样的）深度信息的地图表示，称为视觉结构帧。在地图构建阶段，采用多传感器密集束调整（MS-DBA）来构建地理参考的视觉结构帧，并通过检查局部共视性来保持地图的稀疏性并实现增量建图。在定位阶段，通过粗到细的基于视觉的定位方法，充分利用多帧信息和地图分布，提出空间平滑相似度（SSS）概念以克服位置模糊性，并应用成对帧匹配进行高效且鲁棒的姿态估计。实验结果表明，在复杂的城市道路场景中，地图大小降至每公里3 MB，并能实现稳定的分米级重定位。

链接: https://arxiv.org/abs/2412.01500
作者: Yuxuan Zhou,Xingxing Li,Shengyu Li,Chunxi Xia,Xuanbin Wang,Shaoquan Feng
关键词-EN: high-level geo-spatial applications, accurate global pose, intelligent robotics, crucial importance, high-level geo-spatial
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For high-level geo-spatial applications and intelligent robotics, accurate global pose information is of crucial importance. Map-aided localization is an important and universal approach to overcome the limitations of global navigation satellite system (GNSS) in challenging environments. However, current solutions face challenges in terms of mapping flexibility, storage burden and re-localization performance. In this work, we present SF-Loc, a lightweight visual mapping and map-aided localization system, whose core idea is the map representation based on sparse frames with dense (though downsampled) depth, termed as visual structure frames. In the mapping phase, multi-sensor dense bundle adjustment (MS-DBA) is applied to construct geo-referenced visual structure frames. The local co-visbility is checked to keep the map sparsity and achieve incremental mapping. In the localization phase, coarse-to-fine vision-based localization is performed, in which multi-frame information and the map distribution are fully integrated. To be specific, the concept of spatially smoothed similarity (SSS) is proposed to overcome the place ambiguity, and pairwise frame matching is applied for efficient and robust pose estimation. Experimental results on both public and self-made datasets verify the effectiveness of the system. In complex urban road scenarios, the map size is down to 3 MB per kilometer and stable decimeter-level re-localization can be achieved. The code will be made open-source soon (this https URL).
zh

[CV-61] RaD: A Metric for Medical Image Distribution Comparison in Out-of-Domain Detection and Other Applications

【速读】：该论文试图解决在医学图像分析和深度学习中，如何准确判断两组图像是否属于同一或不同领域的问题。这一问题在处理领域偏移（domain shift）时尤为关键，因为它直接影响到模型性能和生成模型的输出质量。论文提出的解决方案是引入一种新的感知度量标准：放射组学特征距离（Radiomic Feature Distance, RaD）。RaD利用标准化、临床上有意义且可解释的图像特征，显著优于现有的依赖于下游任务（如分割）或采用自然图像感知度量（如FID）的方法。RaD不仅在领域外检测（OOD detection）中表现出色，还在图像到图像的翻译任务中显示出更强的下游任务性能、解剖一致性和真实性相关性。此外，RaD在解释性、稳定性和计算效率方面也具有优势，尤其在小样本量下表现突出。通过广泛的实验验证，RaD在多个多领域医学图像数据集、九个下游任务和六个图像翻译模型中展示了其广泛的应用潜力。

链接: https://arxiv.org/abs/2412.01496
作者: Nicholas Konz,Yuwen Chen,Hanxue Gu,Haoyu Dong,Yaqian Chen,Maciej A. Mazurowski
关键词-EN: deep learning, domain shift, modern medical image, common problem, problem that commonly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Determining whether two sets of images belong to the same or different domain is a crucial task in modern medical image analysis and deep learning, where domain shift is a common problem that commonly results in decreased model performance. This determination is also important to evaluate the output quality of generative models, e.g., image-to-image translation models used to mitigate domain shift. Current metrics for this either rely on the (potentially biased) choice of some downstream task such as segmentation, or adopt task-independent perceptual metrics (e.g., FID) from natural imaging which insufficiently capture anatomical consistency and realism in medical images. We introduce a new perceptual metric tailored for medical images: Radiomic Feature Distance (RaD), which utilizes standardized, clinically meaningful and interpretable image features. We show that RaD is superior to other metrics for out-of-domain (OOD) detection in a variety of experiments. Furthermore, RaD outperforms previous perceptual metrics (FID, KID, etc.) for image-to-image translation by correlating more strongly with downstream task performance as well as anatomical consistency and realism, and shows similar utility for evaluating unconditional image generation. RaD also offers additional benefits such as interpretability, as well as stability and computational efficiency at low sample sizes. Our results are supported by broad experiments spanning four multi-domain medical image datasets, nine downstream tasks, six image translation models, and other factors, highlighting the broad potential of RaD for medical image analysis.
zh

[CV-62] Learning Adaptive Lighting via Channel-Aware Guidance

【速读】：该论文试图解决光照适应性学习的问题，特别是在多个光照相关任务（如图像润色和曝光校正）中，如何实现一个统一的框架来处理这些任务。解决方案的关键在于提出了一个名为学习适应性光照网络 (Learning Adaptive Lighting Network, LALNet) 的统一框架。该框架通过引入颜色分离特征 (color-separated features) 来强调不同颜色通道的光照差异，并结合传统的颜色混合特征 (color-mixed features) 通过光照引导注意力机制 (Light Guided Attention, LGA) 进行处理。LGA 利用颜色分离特征来引导颜色混合特征，重点关注通道差异，并确保通道间的视觉一致性。此外，论文还引入了双域通道调制 (dual domain channel modulation) 来生成颜色分离特征，并通过小波变换和视觉状态空间模块 (vision state space module) 生成颜色混合特征。实验结果表明，LALNet 在多个光照相关任务上显著优于现有最先进的方法，同时减少了计算资源的消耗。

链接: https://arxiv.org/abs/2412.01493
作者: Qirui Yang,Peng-Tao Jiang,Hao Zhang,Jinwei Chen,Bo Li,Huanjing Yue,Jingyu Yang
关键词-EN: Learning lighting adaption, good visual perception, Learning Adaptive Lighting, supporting downstream vision, Adaptive Lighting Network
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo at this https URL.
zh

[CV-63] SerialGen: Personalized Image Generation by First Standardization Then Personalization

【速读】：该论文试图解决在生成个性化人物角色时，如何同时实现高文本可控性和整体外观一致性的问题。解决方案的关键在于提出了一种名为 SerialGen 的新框架，该框架采用串行生成方法，分为两个阶段：首先是标准化阶段，用于标准化参考图像；然后是基于标准化参考的个性化生成阶段。此外，论文还引入了两个模块来增强标准化过程。实验结果验证了该框架能够生成忠实于参考图像整体外观的个性化图像，并能准确响应广泛的文本提示。通过深入分析，论文强调了串行生成方法和标准化模型在提升参考图像与输出图像之间以及不同文本提示生成图像之间的外观一致性方面的关键作用。

链接: https://arxiv.org/abs/2412.01485
作者: Cong Xie,Han Zou,Ruiqi Yu,Yan Zhang,Zhenpeng Zhan
关键词-EN: personalized human characters, high text controllability, human characters, interested in achieving, achieving both high
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we are interested in achieving both high text controllability and overall appearance consistency in the generation of personalized human characters. We propose a novel framework, named SerialGen, which is a serial generation method consisting of two stages: first, a standardization stage that standardizes reference images, and then a personalized generation stage based on the standardized reference. Furthermore, we introduce two modules aimed at enhancing the standardization process. Our experimental results validate the proposed framework’s ability to produce personalized images that faithfully recover the reference image’s overall appearance while accurately responding to a wide range of text prompts. Through thorough analysis, we highlight the critical contribution of the proposed serial generation method and standardization model, evidencing enhancements in appearance consistency between reference and output images and across serial outputs generated from diverse text prompts. The term “Serial” in this work carries a double meaning: it refers to the two-stage method and also underlines our ability to generate serial images with consistent appearance throughout.
zh

[CV-64] Improving Object Detection by Modifying Synthetic Data with Explainable AI

【速读】：该论文试图解决在计算机视觉领域中，由于真实世界数据收集困难导致模型性能受限的问题，特别是在训练数据中未见或代表性不足的样本上进行推理时。解决方案的关键在于利用可解释人工智能（Explainable AI, XAI）技术来指导3D模型的修改，从而优化合成图像数据集的设计。具体来说，通过XAI显著性图（saliency maps）来引导Unity游戏引擎中的3D模型修改，既可以增加也可以减少合成数据的现实感，以提高模型在车辆红外图像检测任务中的性能。实验结果表明，这种XAI引导的方法能够显著提升模型在未见方向车辆检测中的准确性，从4.6%提升至96.1%的mAP50分数，证明了该方法在精细调整合成数据集以优化目标检测性能方面的潜力。

链接: https://arxiv.org/abs/2412.01477
作者: Nitish Mital,Simon Malzard,Richard Walters,Celso M. De Melo,Raghuveer Rao,Victoria Nockles
关键词-EN: severely impact model, computer vision domains, domains the collection, collection of sufficient, severely impact
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In many computer vision domains the collection of sufficient real-world data is challenging and can severely impact model performance, particularly when running inference on samples that are unseen or underrepresented in training. Synthetically generated images provide a promising solution, but it remains unclear how to design synthetic data to optimally improve model performance, for example whether to introduce more realism or more abstraction in such datasets. Here we propose a novel conceptual approach to improve the performance of computer vision models trained on synthetic images, by using robust Explainable AI (XAI) techniques to guide the modification of 3D models used to generate these images. Importantly, this framework allows both modifications that increase and decrease realism in synthetic data, which can both improve model performance. We illustrate this concept using a real-world example where data are sparse; the detection of vehicles in infrared imagery. We fine-tune an initial YOLOv8 model on the ATR DSIAC infrared dataset and synthetic images generated from 3D mesh models in the Unity gaming engine, and then use XAI saliency maps to guide modification of our Unity models. We show that synthetic data can improve detection of vehicles in orientations unseen in training by 4.6% (to mAP50 scores of 94.6%). We further improve performance by an additional 1.5% (to 96.1%) through our new XAI-guided approach, which reduces misclassifications through both increasing and decreasing the realism of different parts of the synthetic data. These proof-of-concept results pave the way for fine, XAI-controlled curation of synthetic datasets through detailed feature modifications, tailored to improve object detection performance.
zh

[CV-65] Multi-Granularity Video Object Segmentation

【速读】：该论文试图解决现有视频分割基准仅标注显著物体（即前景实例）的问题，导致这些基准在实际应用中难以适应。解决方案的关键在于开发了一个大规模、密集标注的多粒度视频对象分割 (MUG-VOS) 数据集，该数据集不仅标注显著物体，还包括非显著物体的多粒度分割目标。通过自动收集的训练集和人工标注的测试集，论文提出了基于记忆的掩码传播模型 (MMPM)，该模型在 MUG-VOS 数据集上训练和评估，显著提升了视频对象分割的性能，尤其是在显著和非显著物体的跟踪上。

链接: https://arxiv.org/abs/2412.01471
作者: Sangbeom Lim,Seongchan Kim,Seungjun An,Seokju Cho,Paul Hongsuck Seo,Seungryong Kim
关键词-EN: Current benchmarks, foreground instances, limited to annotating, segmentation, video segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real-world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi-granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi-granularity video segmentation dataset that is annotated for both salient and non-salient masks. To achieve this, we propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM-based video segmentation methods. Project page is available at this https URL.
zh

[CV-66] Learning Differential Pyramid Representation for Tone Mapping

【速读】：该论文试图解决传统色调映射方法在处理高动态范围（HDR）图像时，由于手工特征提取的高频分量不足，导致输出图像过度平滑和细节丢失的问题。解决方案的关键在于引入了一种可学习的微分金字塔表示网络（Differential Pyramid Representation Network, DPRNet），该网络能够有效捕捉图像的细节纹理和结构，从而实现高质量的色调映射恢复。此外，论文还设计了全局色调感知模块和局部色调调整模块，分别确保全局调整的一致性和局部调整的准确性，从而在保持图像细节的同时实现色调映射的全局和局部协调。实验结果表明，该方法在HDR+和HDRI Haven数据集上分别比次优方法提高了2.58 dB和3.31 dB的PSNR，显示出优越的性能和泛化能力。

链接: https://arxiv.org/abs/2412.01463
作者: Qirui Yang,Yinbo Li,Peng-Tao Jiang,Qihua Cheng,Biting Yu,Yihao Liu,Huanjing Yue,Jingyu Yang
关键词-EN: high-frequent components extracted, Previous tone mapping, Previous tone, learnable Differential Pyramid, Pyramid Representation Network
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Previous tone mapping methods mainly focus on how to enhance tones in low-resolution images and recover details using the high-frequent components extracted from the input image. These methods typically rely on traditional feature pyramids to artificially extract high-frequency components, such as Laplacian and Gaussian pyramids with handcrafted kernels. However, traditional handcrafted features struggle to effectively capture the high-frequency components in HDR images, resulting in excessive smoothing and loss of detail in the output image. To mitigate the above issue, we introduce a learnable Differential Pyramid Representation Network (DPRNet). Based on the learnable differential pyramid, our DPRNet can capture detailed textures and structures, which is crucial for high-quality tone mapping recovery. In addition, to achieve global consistency and local contrast harmonization, we design a global tone perception module and a local tone tuning module that ensure the consistency of global tuning and the accuracy of local tuning, respectively. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods, improving PSNR by 2.58 dB in the HDR+ dataset and 3.31 dB in the HDRI Haven dataset respectively compared with the second-best method. Notably, our method exhibits the best generalization ability in the non-homologous image and video tone mapping operation. We provide an anonymous online demo at this https URL.
zh

[CV-67] A comprehensive review of datasets and deep learning techniques for vision in Unmanned Surface Vehicles

【速读】：该论文试图解决无人水面车辆（USVs）在视觉任务中缺乏系统性综述的问题，特别是针对USV视觉数据集和深度学习技术的发展现状、局限性和未来趋势。解决方案的关键在于提供一个全面的综述，涵盖了USV视觉任务中使用的数据集和深度学习技术的现状，并通过详细分析当前的数据集和深度学习技术，指出研究和发展中的挑战与潜在机会。

链接: https://arxiv.org/abs/2412.01461
作者: Linh Trinh,Siegfried Mercelis,Ali Anwar
关键词-EN: Unmanned Surface Vehicles, Surface Vehicles, Unmanned Surface, USVs, capable of supporting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned Surface Vehicles (USVs) have emerged as a major platform in maritime operations, capable of supporting a wide range of applications. USVs can help reduce labor costs, increase safety, save energy, and allow for difficult unmanned tasks in harsh maritime environments. With the rapid development of USVs, many vision tasks such as detection and segmentation become increasingly important. Datasets play an important role in encouraging and improving the research and development of reliable vision algorithms for USVs. In this regard, a large number of recent studies have focused on the release of vision datasets for USVs. Along with the development of datasets, a variety of deep learning techniques have also been studied, with a focus on USVs. However, there is a lack of a systematic review of recent studies in both datasets and vision techniques to provide a comprehensive picture of the current development of vision on USVs, including limitations and trends. In this study, we provide a comprehensive review of both USV datasets and deep learning techniques for vision tasks. Our review was conducted using a large number of vision datasets from USVs. We elaborate several challenges and potential opportunities for research and development in USV vision based on a thorough analysis of current datasets and deep learning techniques.
zh

[CV-68] Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond

【速读】：该论文试图解决水下图像由于光线折射和吸收导致的质量退化问题，包括色彩偏移、模糊和可见度受限，这些问题影响海洋应用中自主水下车辆（AUV）的性能。解决方案的关键在于提出了一种轻量级的基于相位变换器网络（phase-based transformer network），该网络仅包含1.77M参数，用于水下图像恢复（UIR）。其核心在于利用基于相位的自注意力机制有效提取未受污染的特征，并通过优化的相位注意力块（phase attention block）传播显著的注意力特征以恢复结构信息。该方法在合成和真实世界的水下图像数据集上进行了评估，并展示了其在低光照图像增强方面的有效性，通过广泛的消融研究和对比分析，证明了其优于现有的最先进（SOTA）方法。

链接: https://arxiv.org/abs/2412.01456
作者: MD Raqib Khan,Anshul Negi,Ashutosh Kulkarni,Shruti S. Phutke,Santosh Kumar Vipparthi,Subrahmanyam Murala
关键词-EN: Quality degradation, absorption by water, leading to issues, color cast, limited visibility
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 8 figures, conference

点击查看摘要

Abstract:Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.
zh

[CV-69] Artificial Intelligence for Geometry-Based Feature Extraction Analysis and Synthesis in Artistic Images: A Survey

【速读】：该论文试图解决在视觉艺术领域中，AI模型在处理高类间变异、领域差距以及风格与内容分离等问题时面临的挑战。解决方案的关键在于将几何数据（geometric data）整合到AI模型中。通过几何数据提取和利用，模型不仅提升了生成图形合成的质量，还能有效区分风格与内容，利用模型固有的偏差和共享数据特征。此外，几何数据的引入还增强了模型在分类和合成任务中的表现，并为未来AI在视觉艺术领域的应用提供了重要见解。

链接: https://arxiv.org/abs/2412.01450
作者: Mridula Vijendran,Jingjing Deng,Shuang Chen,Edmond S. L. Ho,Hubert P. H. Shum
关键词-EN: Artificial Intelligence significantly, Artificial Intelligence, Intelligence significantly enhances, Intelligence significantly, generating digitized artistic
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 56 pages, 8 tables, 1 figure (35 embedded images), Artificial Intelligence Review (AIR) 2024

点击查看摘要

Abstract:Artificial Intelligence significantly enhances the visual art industry by analyzing, identifying and generating digitized artistic images. This review highlights the substantial benefits of integrating geometric data into AI models, addressing challenges such as high inter-class variations, domain gaps, and the separation of style from content by incorporating geometric information. Models not only improve AI-generated graphics synthesis quality, but also effectively distinguish between style and content, utilizing inherent model biases and shared data traits. We explore methods like geometric data extraction from artistic images, the impact on human perception, and its use in discriminative tasks. The review also discusses the potential for improving data quality through innovative annotation techniques and the use of geometric data to enhance model adaptability and output refinement. Overall, incorporating geometric guidance boosts model performance in classification and synthesis tasks, providing crucial insights for future AI applications in the visual arts domain.
zh

[CV-70] DiffPatch: Generating Customizable Adversarial Patches using Diffusion Model

【速读】：该论文试图解决现有对抗性补丁生成方法在攻击效果与隐蔽性之间难以平衡的问题，特别是当这些补丁应用于衣物上时，往往显得不自然且容易被察觉。解决方案的关键在于提出了一个名为DiffPatch的新型扩散模型框架，该框架允许用户使用参考图像作为生成补丁的起点，并通过引入掩码来生成各种形状的自然补丁，而不仅限于方形。此外，通过使用Null-text inversion和Incomplete Diffusion Optimization (IDO)技术，确保在扩散过程中不丢失原始图像的语义信息，从而在保持补丁自然外观的同时，实现与现有非自然补丁相当的攻击效果。

链接: https://arxiv.org/abs/2412.01440
作者: Zhixiang Wang,Guangnan Ye,Xiaosen Wang,Siheng Chen,Zhibo Wang,Xingjun Ma,Yu-Gang Jiang
关键词-EN: evade person detectors, person detectors, printed on clothing, clothing can easily, easily allow individuals
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physical adversarial patches printed on clothing can easily allow individuals to evade person detectors. However, most existing adversarial patch generation methods prioritize attack effectiveness over stealthiness, resulting in patches that are aesthetically unpleasing. Although existing methods using generative adversarial networks or diffusion models can produce more natural-looking patches, they often struggle to balance stealthiness with attack effectiveness and lack flexibility for user customization. To address these challenges, we propose a novel diffusion-based customizable patch generation framework termed DiffPatch, specifically tailored for creating naturalistic and customizable adversarial patches. Our approach enables users to utilize a reference image as the source, rather than starting from random noise, and incorporates masks to craft naturalistic patches of various shapes, not limited to squares. To prevent the original semantics from being lost during the diffusion process, we employ Null-text inversion to map random noise samples to a single input image and generate patches through Incomplete Diffusion Optimization (IDO). Notably, while maintaining a natural appearance, our method achieves a comparable attack performance to state-of-the-art non-naturalistic patches when using similarly sized attacks. Using DiffPatch, we have created a physical adversarial T-shirt dataset, AdvPatch-1K, specifically targeting YOLOv5s. This dataset includes over a thousand images across diverse scenarios, validating the effectiveness of our attack in real-world environments. Moreover, it provides a valuable resource for future research.
zh

[CV-71] Semantic Scene Completion with Multi-Feature Data Balancing Network

【速读】：该论文试图解决语义场景补全 (Semantic Scene Completion, SSC) 任务中的关键挑战，即在室内场景中，由于数据不平衡、类间模糊性和类内多样性，如何从单一的2D图像生成详细的3D模型。解决方案的关键在于提出了多特征数据平衡网络 (Multi-Feature Data Balancing Network, MDBNet)，这是一个双头模型，能够处理RGB和深度数据 (F-TSDF) 输入。MDBNet通过混合编码器-解码器架构和预激活残差模块中的恒等变换 (Identity Transformation in a Pre-activation Residual Module, ITRM)，有效地管理了F-TSDF中的多样信号。此外，论文还评估了RGB特征融合策略，并采用了结合交叉熵损失和加权交叉熵损失的组合损失函数，以优化2D RGB特征和3D SSC预测。实验结果表明，MDBNet在NYU数据集上的表现优于现有的最先进 (SOTA) 方法，证明了该方法的有效性。

链接: https://arxiv.org/abs/2412.01431
作者: Mona Alawadh,Mahesan Niranjan,Hansung Kim
关键词-EN: Semantic Scene Completion, Scene Completion, computer vision, virtual reality, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic Scene Completion (SSC) is a critical task in computer vision, that utilized in applications such as virtual reality (VR). SSC aims to construct detailed 3D models from partial views by transforming a single 2D image into a 3D representation, assigning each voxel a semantic label. The main challenge lies in completing 3D volumes with limited information, compounded by data imbalance, inter-class ambiguity, and intra-class diversity in indoor scenes. To address this, we propose the Multi-Feature Data Balancing Network (MDBNet), a dual-head model for RGB and depth data (F-TSDF) inputs. Our hybrid encoder-decoder architecture with identity transformation in a pre-activation residual module (ITRM) effectively manages diverse signals within F-TSDF. We evaluate RGB feature fusion strategies and use a combined loss function cross entropy for 2D RGB features and weighted cross-entropy for 3D SSC predictions. MDBNet results surpass comparable state-of-the-art (SOTA) methods on NYU datasets, demonstrating the effectiveness of our approach.
zh

[CV-72] MVImgNet2.0: A Larger-scale Dataset of Multi-view Images SIGGRAPH

【速读】：该论文试图解决的问题是如何在2D和3D视觉之间建立一个有效的桥梁，以促进大规模3D重建模型的性能提升。解决方案的关键在于构建了一个名为MVImgNet2.0的大规模数据集，该数据集扩展了原有的MVImgNet，包含约520k个对象和515个类别，具有更高的质量和更广泛的应用潜力。关键特性包括：(i) 大多数拍摄提供了360度视角，支持完整对象重建的学习；(ii) 采用先进的分割方式，生成更准确的前景对象掩码；(iii) 使用更强大的运动结构方法，降低相机姿态估计误差；(iv) 通过先进方法重建高质量的密集点云，适用于下游应用。这些改进使得MVImgNet2.0在提升大规模3D重建模型性能方面具有显著价值。

链接: https://arxiv.org/abs/2412.01430
作者: Xiaoguang Han,Yushuang Wu,Luyue Shi,Haolin Liu,Hongjie Liao,Lingteng Qiu,Weihao Yuan,Xiaodong Gu,Zilong Dong,Shuguang Cui
关键词-EN: multi-view images, large-scale dataset, real-world objects, objects, including multi-view images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: ACM Transactions on Graphics (TOG), SIGGRAPH Asia 2024

点击查看摘要

Abstract:MVImgNet is a large-scale dataset that contains multi-view images of ~220k real-world objects in 238 classes. As a counterpart of ImageNet, it introduces 3D visual signals via multi-view shooting, making a soft bridge between 2D and 3D vision. This paper constructs the MVImgNet2.0 dataset that expands MVImgNet into a total of ~520k objects and 515 categories, which derives a 3D dataset with a larger scale that is more comparable to ones in the 2D domain. In addition to the expanded dataset scale and category range, MVImgNet2.0 is of a higher quality than MVImgNet owing to four new features: (i) most shoots capture 360-degree views of the objects, which can support the learning of object reconstruction with completeness; (ii) the segmentation manner is advanced to produce foreground object masks of higher accuracy; (iii) a more powerful structure-from-motion method is adopted to derive the camera pose for each frame of a lower estimation error; (iv) higher-quality dense point clouds are reconstructed via advanced methods for objects captured in 360-degree views, which can serve for downstream applications. Extensive experiments confirm the value of the proposed MVImgNet2.0 in boosting the performance of large 3D reconstruction models. MVImgNet2.0 will be public at this http URL, including multi-view images of all 520k objects, the reconstructed high-quality point clouds, and data annotation codes, hoping to inspire the broader vision community.
zh

[CV-73] CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

【速读】：该论文试图解决基于扩散变换器（Diffusion Transformer, DiT）的视频生成方法在可控相机姿态视角方面存在的显著差距，特别是现有方法如OpenSora未能精确遵循预期轨迹和物理交互，从而限制了下游应用的灵活性。解决方案的关键在于引入了一种统一的相机姿态感知文本到视频生成方法，称为CPA（Camera-Pose-Awareness）。该方法通过部署稀疏运动编码（Sparse Motion Encoding, SME）模块将相机姿态信息转换为时空嵌入，并激活时间注意力注入（Temporal Attention Injection, TAI）模块将运动补丁注入每个ST-DiT块，从而实现了对相机运动的详细描述和文本、视觉及空间条件的整合。这种插件架构兼容原始DiT参数，支持多种相机姿态和灵活的对象运动，显著提升了长视频生成中的轨迹一致性和对象一致性。

链接: https://arxiv.org/abs/2412.01429
作者: Yuelei Wang,Jian Zhang,Pengtao Jiang,Hao Zhang,Jinwei Chen,Bo Li
关键词-EN: Diffusion Transformer, significant advancements made, made by Diffusion, camera pose perspectives, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
zh

[CV-74] FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration WWW

【速读】：该论文试图解决现有图像恢复模型在真实世界场景中泛化能力不足的问题，主要原因是这些模型大多在小规模合成数据集上训练，且退化类型有限。解决方案的关键在于引入了一个百万级的大规模高质量真实世界数据集，该数据集具有更大规模和更高多样性的退化类型。通过调整相机内部设置和外部成像条件，论文设计了一个数据采集系统，能够捕捉多轮对齐的图像对。此外，论文提出了一种名为FoundIR的鲁棒模型，该模型结合了扩散型通用模型和退化感知专家模型，通过增量学习策略从多样化的输入中学习退化无关的共同表示，以提升在复杂场景中的恢复能力。

链接: https://arxiv.org/abs/2412.01427
作者: Hao Li,Xiang Chen,Jiangxin Dong,Jinhui Tang,Jinshan Pan
关键词-EN: significant progress made, small-scale synthetic datasets, universal image restoration, significant progress, progress made
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model’s restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.
zh

[CV-75] MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

【速读】：该论文试图解决实时二维关键点检测中，基于卷积神经网络 (CNN) 和基于Transformer的方法在性能和速度上难以兼顾的问题。解决方案的关键在于提出了MamKPD框架，这是首个基于Mamba的高效姿态估计框架。MamKPD通过引入轻量级上下文建模模块 (CMM)，利用深度卷积来建模补丁间的依赖关系，并通过线性层提炼每个补丁内的姿态线索，从而解决了传统Mamba模块在补丁间信息交互不足的问题。结合Mamba进行全局建模，MamKPD能够有效提取实例的姿态信息，实现了在COCO、MPII和AP-10K数据集上的优异性能，同时显著减少了模型参数。

链接: https://arxiv.org/abs/2412.01422
作者: Yonghao Dang,Liyuan Liu,Hui Kang,Ping Ye,Jianqin Yin
关键词-EN: keypoint detection plays, computer vision, plays an essential, essential role, role in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances’ pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at this https URL.
zh

[CV-76] CellSeg1: Robust Cell Segmentation with One Training Image

【速读】：该论文试图解决细胞分割领域中，针对不断涌现的新型细胞类型和成像技术，现有模型仍需大量标注数据进行微调的问题。解决方案的关键在于引入CellSeg1，通过采用低秩适应的分割任何模型（Low-Rank Adaptation of the Segment Anything Model, SAM），实现仅需少量标注（几十个细胞）即可对任意形态和成像方式的细胞进行有效分割。实验结果表明，CellSeg1在仅使用一张图像进行训练的情况下，平均精度（mAP）达到0.81（IoU为0.5时），与使用超过500张图像训练的现有模型表现相当，并在跨数据集测试中展示了优越的泛化能力。高质量的密集排列、大小不一的细胞标注是实现有效分割的关键。

链接: https://arxiv.org/abs/2412.01410
作者: Peilin Zhou,Bo Du,Yongchao Xu
关键词-EN: Recent trends, shifted towards universal, imaging modalities, morphologies and imaging, handle diverse cell
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Recent trends in cell segmentation have shifted towards universal models to handle diverse cell morphologies and imaging modalities. However, for continuously emerging cell types and imaging techniques, these models still require hundreds or thousands of annotated cells for fine-tuning. We introduce CellSeg1, a practical solution for segmenting cells of arbitrary morphology and modality with a few dozen cell annotations in 1 image. By adopting Low-Rank Adaptation of the Segment Anything Model (SAM), we achieve robust cell segmentation. Tested on 19 diverse cell datasets, CellSeg1 trained on 1 image achieved 0.81 average mAP at 0.5 IoU, performing comparably to existing models trained on over 500 images. It also demonstrated superior generalization in cross-dataset tests on TissueNet. We found that high-quality annotation of a few dozen densely packed cells of varied sizes is key to effective segmentation. CellSeg1 provides an efficient solution for cell segmentation with minimal annotation effort.
zh

[CV-77] HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

【速读】：该论文试图解决自动驾驶系统中2D-3D多模态联合生成的问题，即如何同时生成相机图像和LiDAR点云数据，以充分利用这两种传感器之间的互补信息。解决方案的关键在于提出了一个名为HoloDrive的框架，该框架通过引入BEV-to-Camera和Camera-to-BEV变换模块，以及在2D生成模型中加入深度预测分支，来实现图像空间到BEV空间的解耦投影。此外，通过添加时间结构和渐进式训练，该方法还能预测未来的状态。实验结果表明，该方法在生成质量上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.01407
作者: Zehuan Wu,Jingcheng Ni,Xiaodong Wang,Yuxin Guo,Rui Chen,Lewei Lu,Jifeng Dai,Yuwen Xiong
关键词-EN: LiDAR point clouds, autonomous driving, significantly improved, real-world autonomous driving, autonomous driving system
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emphHoloDrive, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.
zh

[CV-78] MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation

【速读】：该论文试图解决在皮肤病变区域分割中，由于高分辨率图像需求和病变边界不清晰导致的性能挑战，同时满足医疗设备对模型轻量化和高效率的要求。解决方案的关键在于引入了一种名为MambaU-Lite的新型轻量级模型，该模型结合了Mamba和CNN架构的优势，仅包含约400K参数和超过1G flops的计算成本。为了增强全局上下文和局部特征提取，论文提出了P-Mamba块，该组件融合了VSS块和多层池化层，使模型能够有效学习多尺度特征并提升分割性能。

链接: https://arxiv.org/abs/2412.01405
作者: Thi-Nhu-Quynh Nguyen,Quang-Huy Ho,Duy-Thai Nguyen,Hoang-Minh-Quang Le,Van-Truong Pham,Thi-Thao Tran
关键词-EN: treating skin cancer, skin abnormalities plays, Early detection, abnormalities plays, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Early detection of skin abnormalities plays a crucial role in diagnosing and treating skin cancer. Segmentation of affected skin regions using AI-powered devices is relatively common and supports the diagnostic process. However, achieving high performance remains a significant challenge due to the need for high-resolution images and the often unclear boundaries of individual lesions. At the same time, medical devices require segmentation models to have a small memory foot-print and low computational cost. Based on these requirements, we introduce a novel lightweight model called MambaU-Lite, which combines the strengths of Mamba and CNN architectures, featuring just over 400K parameters and a computational cost of more than 1G flops. To enhance both global context and local feature extraction, we propose the P-Mamba block, a novel component that incorporates VSS blocks along-side multiple pooling layers, enabling the model to effectively learn multiscale features and enhance segmentation performance. We evaluate the model’s performance on two skin datasets, ISIC2018 and PH2, yielding promising results. Our source code will be made publicly available at: this https URL.
zh

[CV-79] ULSR-GS: Ultra Large-scale Surface Reconstruction Gaussian Splatting with Multi-View Geometric Consistency

【速读】：该论文试图解决高斯喷射法 (Gaussian Splatting, GS) 在大规模航空影像表面提取任务中的局限性问题。解决方案的关键在于提出了ULSR-GS框架，该框架通过点对照片分区方法与多视角最佳视图匹配原则，为每个子区域选择最佳训练图像，并在训练过程中采用基于多视角几何一致性的密集化策略，以增强表面提取的细节。实验结果表明，ULSR-GS在大型航空摄影测量基准数据集上优于其他最先进的GS方法，显著提高了复杂城市环境中的表面提取精度。

链接: https://arxiv.org/abs/2412.01402
作者: Zhuoxiao Li,Shanliang Yao,Qizhong Gao,Angel F. Garcia-Fernandez,Yong Yue,Xiaohui Zhu
关键词-EN: Gaussian Splatting, high-quality scene rendering, small area surface, surface extraction ability, surface extraction tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While Gaussian Splatting (GS) demonstrates efficient and high-quality scene rendering and small area surface extraction ability, it falls short in handling large-scale aerial image surface extraction tasks. To overcome this, we present ULSR-GS, a framework dedicated to high-fidelity surface extraction in ultra-large-scale scenes, addressing the limitations of existing GS-based mesh extraction methods. Specifically, we propose a point-to-photo partitioning approach combined with a multi-view optimal view matching principle to select the best training images for each sub-region. Additionally, during training, ULSR-GS employs a densification strategy based on multi-view geometric consistency to enhance surface extraction details. Experimental results demonstrate that ULSR-GS outperforms other state-of-the-art GS-based works on large-scale aerial photogrammetry benchmark datasets, significantly improving surface extraction accuracy in complex urban environments. Project page: this https URL.
zh

[CV-80] Fire-Image-DenseNet (FIDN) for predicting wildfire burnt area using remote sensing data

【速读】：该论文试图解决大规模野火事件的预测问题，特别是现有基于物理的模型在预测大型或长时间持续野火事件时的局限性。解决方案的关键在于开发了一种基于深度学习的预测模型，称为Fire-Image-DenseNet (FIDN)，该模型利用从近实时和再分析数据中提取的空间特征，结合环境与气象驱动因素来预测野火的蔓延。通过训练和测试超过300起发生在2012至2019年间美国西部的野火事件，FIDN模型在预测不同规模和持续时间的野火时表现出更高的准确性和稳定性，其均方误差（MSE）分别比基于细胞自动机（CA）和最小旅行时间（MTT）的模型低82%和67%，结构相似性指数（SSIM）平均达到97%，且计算效率提高了约三个数量级。这些改进为消防战略规划和资源分配提供了重要见解。

链接: https://arxiv.org/abs/2412.01400
作者: Bo Pang,Sibo Cheng,Yuhan Huang,Yufang Jin,Yike Guo,I. Colin Prentice,Sandy P. Harrison,Rossella Arcucci
关键词-EN: subsequent socioeconomic losses, extent of massive, ignited is essential, essential to reduce, reduce the subsequent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures

点击查看摘要

Abstract:Predicting the extent of massive wildfires once ignited is essential to reduce the subsequent socioeconomic losses and environmental damage, but challenging because of the complexity of fire behaviour. Existing physics-based models are limited in predicting large or long-duration wildfire events. Here, we develop a deep-learning-based predictive model, Fire-Image-DenseNet (FIDN), that uses spatial features derived from both near real-time and reanalysis data on the environmental and meteorological drivers of wildfire. We trained and tested this model using more than 300 individual wildfires that occurred between 2012 and 2019 in the western US. In contrast to existing models, the performance of FIDN does not degrade with fire size or duration. Furthermore, it predicts final burnt area accurately even in very heterogeneous landscapes in terms of fuel density and flammability. The FIDN model showed higher accuracy, with a mean squared error (MSE) about 82% and 67% lower than those of the predictive models based on cellular automata (CA) and the minimum travel time (MTT) approaches, respectively. Its structural similarity index measure (SSIM) averages 97%, outperforming the CA and FlamMap MTT models by 6% and 2%, respectively. Additionally, FIDN is approximately three orders of magnitude faster than both CA and MTT models. The enhanced computational efficiency and accuracy advancements offer vital insights for strategic planning and resource allocation for firefighting operations.
zh

[CV-81] Holistic Understanding of 3D Scenes as Universal Scene Description

【速读】：该论文试图解决3D场景理解中关于可交互和可动对象理解不足的问题。解决方案的关键在于：(1) 引入了一个精心策划的数据集，采用Universal Scene Description (USD) 格式，包含280个室内场景的高质量手动标注，涵盖实例分割和关节运动；(2) 提出了一种基于学习的模型和一个新的基准，能够预测部分分割以及完整的运动属性，包括运动类型、可动和可交互部分以及运动参数；(3) 提供了一个用于比较当前任务方法的基准。该数据集提供了8种类型的标注，包括对象和部分分割、运动类型、可动和可交互部分、运动参数、连接性和对象质量标注，为全面的3D场景理解模型提供了基础。所有数据均以USD格式提供，便于与其他下游任务的互操作性和集成。

链接: https://arxiv.org/abs/2412.01398
作者: Anna-Maria Halacheva,Yang Miao,Jan-Nico Zaech,Xi Wang,Luc Van Gool,Danda Pani Paudel
关键词-EN: enabling mixed reality, wearable computing, mixed reality, long-standing challenge, challenge in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method’s source code.
zh

[CV-82] Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data

【速读】：该论文试图解决在人脸识别技术中使用合成数据（Synthetic Data）所面临的挑战，特别是如何有效利用合成数据来提升模型的性能，解决诸如人口统计偏差（demographic bias）、领域适应（domain adaptation）以及在复杂场景下的性能限制等问题。解决方案的关键在于推动新型生成式 AI 方法（Generative AI methods）和合成数据的提出，并通过引入第2届FRCSyn-onGoing挑战赛，为研究人员提供一个平台，以基准测试新型生成式 AI 方法和专门为利用合成数据而设计的人脸识别系统。该挑战赛旨在探索合成数据单独使用及其与真实数据结合使用的效果，以应对当前人脸识别中的关键问题。

链接: https://arxiv.org/abs/2412.01383
作者: Ivan DeAndres-Tame,Ruben Tolosana,Pietro Melzi,Ruben Vera-Rodriguez,Minchul Kim,Christian Rathgeb,Xiaoming Liu,Luis F. Gomez,Aythami Morales,Julian Fierrez,Javier Ortega-Garcia,Zhizhou Zhong,Yuge Huang,Yuxi Mi,Shouhong Ding,Shuigeng Zhou,Shuai He,Lingzhi Fu,Heng Cong,Rongyu Zhang,Zhihong Xiao,Evgeny Smirnov,Anton Pimenov,Aleksei Grigorev,Denis Timoshenko,Kaleb Mesfin Asfaw,Cheng Yaw Low,Hao Liu,Chuyi Wang,Qing Zuo,Zhixiang He,Hatef Otroshi Shahreza,Anjith George,Alexander Unnervik,Parsa Rahimi,Sébastien Marcel,Pedro C. Neto,Marco Huber,Jan Niklas Kolf,Naser Damer,Fadi Boutros,Jaime S. Cardoso,Ana F. Sequeira,Andrea Atzori,Gianni Fenu,Mirko Marras,Vitomir Štruc,Jiang Yu,Zhangjie Li,Jichun Li,Weisong Zhao,Zhen Lei,Xiangyu Zhu,Xiao-Yu Zhang,Bernardo Biesseck,Pedro Vidal,Luiz Coelho,Roger Granada,David Menotti
关键词-EN: Synthetic data, gaining increasing popularity, face recognition, face recognition technologies, obtaining real data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.
zh

[CV-83] Exploring the Robustness of AI-Driven Tools in Digital Forensics: A Preliminary Study

【速读】：该论文试图解决的问题是当前基于人工智能（AI）的数字取证工具在面对对抗性攻击时的鲁棒性不足。具体来说，论文关注的是这些工具在自动分类数据（如毒品、武器、色情内容）时，可能被恶意操纵以逃避检测的问题。解决方案的关键在于通过初步测试和分析，揭示了现有工具（如Magnet AI和Excire Photo AI）在处理真实数据和深度伪造（deepfake）图像时的不足，并提出了改进建议，以增强这些工具在不同场景下的鲁棒性和对抗性攻击的防御能力。

链接: https://arxiv.org/abs/2412.01363
作者: Silvia Lucia Sanna,Leonardo Regano,Davide Maiorca,Giorgio Giacinto
关键词-EN: facilitate forensic tasks, leverage Artificial Intelligence, Artificial Intelligence, tools leverage Artificial, Nowadays
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nowadays, many tools are used to facilitate forensic tasks about data extraction and data analysis. In particular, some tools leverage Artificial Intelligence (AI) to automatically label examined data into specific categories (\ie, drugs, weapons, nudity). However, this raises a serious concern about the robustness of the employed AI algorithms against adversarial attacks. Indeed, some people may need to hide specific data to AI-based digital forensics tools, thus manipulating the content so that the AI system does not recognize the offensive/prohibited content and marks it at as suspicious to the analyst. This could be seen as an anti-forensics attack scenario. For this reason, we analyzed two of the most important forensics tools employing AI for data classification: Magnet AI, used by Magnet Axiom, and Excire Photo AI, used by X-Ways Forensics. We made preliminary tests using about 200 images, other 100 sent in 3 chats about pornography and teenage nudity, drugs and weapons to understand how the tools label them. Moreover, we loaded some deepfake images (images generated by AI forging real ones) of some actors to understand if they would be classified in the same category as the original images. From our preliminary study, we saw that the AI algorithm is not robust enough, as we expected since these topics are still open research problems. For example, some sexual images were not categorized as nudity, and some deepfakes were categorized as the same real person, while the human eye can see the clear nudity image or catch the difference between the deepfakes. Building on these results and other state-of-the-art works, we provide some suggestions for improving how digital forensics analysis tool leverage AI and their robustness against adversarial attacks or different scenarios than the trained one.
zh

[CV-84] Integrative CAM: Adaptive Layer Fusion for Comprehensive Interpretation of CNNs

【速读】：该论文试图解决传统基于梯度的类激活映射（CAM）方法在解释卷积神经网络（CNN）时，主要依赖最终层激活而忽略中间层关键特征的问题。解决方案的关键在于引入了一种名为Integrative CAM的高级CAM技术，通过融合网络所有层的梯度和激活分数，自适应地加权各层的贡献，从而提供对模型内部表示的全面解释。此外，该方法在显著性图计算中引入了一个新的偏置项，这在现有CAM技术中常被忽略，但对于捕捉更完整的特征重要性景观至关重要，因为现代CNN依赖于加权激活和偏置来进行预测。论文还推广了Grad-CAM++中的alpha项，使其适用于任何平滑函数，从而扩展了CAM在更广泛模型中的应用性。通过在多样化和复杂数据集上的广泛实验，Integrative CAM展示了在特征重要性映射方面的优越性，有效提升了复杂融合场景和复杂决策任务的解释性。

链接: https://arxiv.org/abs/2412.01354
作者: Aniket K. Singh,Debasis Chaudhuri,Manish P. Singh,Samiran Chattopadhyay
关键词-EN: Convolutional Neural Networks, Convolutional Neural, advanced Class Activation, Class Activation Mapping, paper introduces Integrative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing demand for interpretable deep learning models, this paper introduces Integrative CAM, an advanced Class Activation Mapping (CAM) technique aimed at providing a holistic view of feature importance across Convolutional Neural Networks (CNNs). Traditional gradient-based CAM methods, such as Grad-CAM and Grad-CAM++, primarily use final layer activations to highlight regions of interest, often neglecting critical features derived from intermediate layers. Integrative CAM addresses this limitation by fusing insights across all network layers, leveraging both gradient and activation scores to adaptively weight layer contributions, thus yielding a comprehensive interpretation of the model’s internal representation. Our approach includes a novel bias term in the saliency map calculation, a factor frequently omitted in existing CAM techniques, but essential for capturing a more complete feature importance landscape, as modern CNNs rely on both weighted activations and biases to make predictions. Additionally, we generalize the alpha term from Grad-CAM++ to apply to any smooth function, expanding CAM applicability across a wider range of models. Through extensive experiments on diverse and complex datasets, Integrative CAM demonstrates superior fidelity in feature importance mapping, effectively enhancing interpretability for intricate fusion scenarios and complex decision-making tasks. By advancing interpretability methods to capture multi-layered model insights, Integrative CAM provides a valuable tool for fusion-driven applications, promoting the trustworthy and insightful deployment of deep learning models.
zh

[CV-85] See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification

【速读】：该论文试图解决衣物变化场景下的人员重识别问题 (Cloth-changing person re-identification, CC-ReID)，即在不同监控摄像头下匹配同一人物，尽管其衣物发生变化。解决方案的关键在于提出了一种名为语义上下文整合 (Semantic Contextual Integration, SCI) 的新型提示学习框架，该框架利用CLIP的视觉-文本表示能力来最小化衣物变化的影响并增强与身份相关的特征。具体来说，SCI框架包括两个核心模块：语义分离增强模块 (Semantic Separation Enhancement, SSE) 和语义引导交互模块 (Semantic-Guided Interaction Module, SIM)。SSE模块通过双学习文本标记分别捕获干扰和衣物相关的语义信息，从而有效隔离身份相关特征与衣物语义的干扰。SIM模块则利用正交化的文本特征引导视觉表示，增强模型对独特身份特征的关注。这些整合措施提升了模型的判别能力，并丰富了视觉上下文的高维语义洞察。

链接: https://arxiv.org/abs/2412.01345
作者: Xiyu Han,Xian Zhong,Wenxin Huang,Xuemei Jia,Wenxuan Liu,Xiaohan Yu,Alex Chichung Kot
关键词-EN: Cloth-changing person re-identification, multiple surveillance cameras, Cloth-changing person, person re-identification, aims to match
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures, submitted to IEEE TNNLS

点击查看摘要

Abstract:Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing. Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features but often struggle to capture complex semantic information. In this paper, we propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID, which leverages the visual-text representation capabilities of CLIP to minimize the impact of clothing changes and enhance ID-relevant features. Specifically, we introduce Semantic Separation Enhancement (SSE) module, which uses dual learnable text tokens to separately capture confounding and clothing-related semantic information, effectively isolating ID-relevant features from distracting clothing semantics. Additionally, we develop a Semantic-Guided Interaction Module (SIM) that uses orthogonalized text features to guide visual representations, sharpening the model’s focus on distinctive ID characteristics. This integration enhances the model’s discriminative power and enriches the visual context with high-dimensional semantic insights. Extensive experiments on three CC-ReID datasets demonstrate that our method outperforms state-of-the-art techniques. The code will be released at github.
zh

[CV-86] MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models ACM-MM2024

【速读】：该论文试图解决现有文本到视频（T2V）模型在生成复杂、以人为中心的动作时表现不佳的问题。解决方案的关键在于提出了一种名为MoTrans的自定义运动转移方法，通过引入多模态大语言模型（MLLM）基础的重述器（recaptioner）来扩展初始提示，使其更关注外观，并结合外观注入模块将视频帧中的外观先验信息适配到运动建模过程中。这些多模态表示的互补性促进了外观和运动的解耦，同时设计了运动特定嵌入（motion-specific embedding）以进一步增强特定运动的建模能力。实验结果表明，该方法能够有效从单一或多参考视频中学习特定运动模式，并在自定义视频生成方面优于现有方法。

链接: https://arxiv.org/abs/2412.01343
作者: Xiaomin Li,Xu Jia,Qinghe Wang,Haiwen Diao,Mengmeng Ge,Pengxiang Li,You He,Huchuan Lu
关键词-EN: demonstrated impressive abilities, generating realistic videos, camera movement, motion, demonstrated impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2024, code will be released in this https URL

点击查看摘要

Abstract:Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.
zh

[CV-87] Negative Token Merging: Image-based Adversarial Feature Guidance

【速读】：该论文试图解决基于文本的对抗性指导在捕捉复杂视觉概念和避免不希望的视觉元素（如版权角色）方面的不足。解决方案的关键是引入了一种新的对抗性指导方法，称为负标记合并（Negative Token Merging, NegToMe），该方法通过直接使用参考图像或其他图像的视觉特征，在反向扩散过程中选择性地分离匹配的语义特征。NegToMe不仅显著增加了输出图像的多样性（包括种族、性别和视觉多样性），同时保持了图像质量，并且在处理版权内容时减少了34.57%的视觉相似性。该方法简单易实现，仅需少量代码，且在推理时间上仅增加4%，并适用于不同的扩散架构，如Flux。

链接: https://arxiv.org/abs/2412.01339
作者: Jaskirat Singh,Lindsey Li,Weijia Shi,Ranjay Krishna,Yejin Choi,Pang Wei Koh,Michael F. Cohen,Stephen Gould,Liang Zheng,Luke Zettlemoyer
关键词-EN: Text-based adversarial guidance, Text-based adversarial, adversarial guidance, widely adopted approach, performing adversarial guidance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to push the output features away from undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts and avoid undesired visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. In particular, we introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance by selectively pushing apart matching semantic features (between reference and output generation) during the reverse diffusion process. When used w.r.t. other images in the same batch, we observe that NegToMe significantly increases output diversity (racial, gender, visual) without sacrificing output image quality. Similarly, when used w.r.t. a reference copyrighted asset, NegToMe helps reduce visual similarity with copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (4%) inference times and generalizes to different diffusion architectures like Flux, which do not natively support the use of a separate negative prompt. Code is available at this https URL
zh

[CV-88] Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

【速读】：该论文试图解决视频生成模型在长时间视频中保持场景多样性和内容丰富性的问题。解决方案的关键在于提出了分段交叉注意力机制 (Segmented Cross-Attention, SCA)，该机制通过将隐藏状态沿时间维度分割成多个片段，并使每个片段与相应的子标题进行交叉注意力处理，从而在不增加额外参数的情况下，有效维持长时间视频的场景多样性和内容丰富性。此外，论文还构建了LongTake-HD数据集，包含261k个内容丰富且场景连贯的视频，每个视频配有总体描述和五个逐步细化的子标题，以支持高质量长时间视频的生成。实验结果表明，Presto模型在语义评分和动态度方面均优于现有最先进的方法，显著提升了内容丰富性、长程连贯性和文本细节捕捉能力。

链接: https://arxiv.org/abs/2412.01316
作者: Xin Yan,Yuxuan Cai,Qiuyue Wang,Yuan Zhou,Wenhao Huang,Huan Yang
关键词-EN: diffusion model designed, video diffusion model, designed to generate, diffusion model, model designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: this https URL.
zh

[CV-89] Multimodal Medical Disease Classification with LLaMA II

【速读】：该论文试图解决医学领域中多模态数据（如图像、文本、年龄、性别等）的整合与处理问题，以提高疾病分类的准确性。解决方案的关键在于采用基于Transformer的多模态模型，并重点研究了不同融合方法（如早期融合和晚期融合）对模型性能的影响。实验结果表明，早期融合方法在融合特定模态特征时表现更优，最佳模型达到了97.10%的平均AUC，优于晚期融合方法（96.67%平均AUC），并且均超过了以往在该多模态数据集上的分类模型。该多模态架构具有通用性，可轻松应用于其他多模态数据集，并可进一步扩展至医学AI领域的研究。

链接: https://arxiv.org/abs/2412.01306
作者: Christian Gapp,Elias Tappeiner,Martin Welk,Rainer Schubert
关键词-EN: multimodal, Medical, Abstract, data, model
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, conference: AIRoV – The First Austrian Symposium on AI, Robotics, and Vision 25.-27.3.2024, Innsbruck

点击查看摘要

Abstract:Medical patient data is always multimodal. Images, text, age, gender, histopathological data are only few examples for different modalities in this context. Processing and integrating this multimodal data with deep learning based methods is of utmost interest due to its huge potential for medical procedure such as diagnosis and patient treatment planning. In this work we retrain a multimodal transformer-based model for disease classification. To this end we use the text-image pair dataset from OpenI consisting of 2D chest X-rays associated with clinical reports. Our focus is on fusion methods for merging text and vision information extracted from medical datasets. Different architecture structures with a LLaMA II backbone model are tested. Early fusion of modality specific features creates better results with the best model reaching 97.10% mean AUC than late fusion from a deeper level of the architecture (best model: 96.67% mean AUC). Both outperform former classification models tested on the same multimodal dataset. The newly introduced multimodal architecture can be applied to other multimodal datasets with little effort and can be easily adapted for further research, especially, but not limited to, the field of medical AI.
zh

[CV-90] Event-Based Tracking Any Point with Motion-Augmented Temporal Consistency

【速读】：该论文试图解决基于视频的跟踪方法在处理大位移或非线性运动时容易丢失目标点的问题。解决方案的关键在于利用事件相机（event cameras）的高时间分辨率和无运动模糊特性，通过两个定制模块来应对事件数据的空间稀疏性和运动敏感性。具体来说，运动引导模块（motion-guidance module）通过整合运动学特征来解决事件稀疏性引起的歧义，而变运动感知模块（variable motion aware module）则确保了时间上的一致响应，使其对速度变化不敏感，从而提高了匹配精度。实验结果表明，该方法在处理速度和模型参数方面均优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.01300
作者: Han Han,Wei Zhai,Yang Cao,Bin Li,Zheng-jun Zha
关键词-EN: plays a crucial, crucial role, Tracking Any Point, TAP, Tracking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to target point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic features into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision. To validate the effectiveness of the approach, an event dataset for tracking any point is constructed by simulation, and is applied in experiments together with two real-world datasets. The experimental results show that the proposed method outperforms existing SOTA methods. Moreover, it achieves 150% faster processing with competitive model parameters.
zh

[CV-91] Cross-Modal Visual Relocalization in Prior LiDAR Maps Utilizing Intensity Textures

【速读】：该论文试图解决在先验LiDAR地图中进行视觉重定位时，2D纹理与3D几何不一致的问题，特别是忽略了LiDAR点云中的强度特征。解决方案的关键在于提出了一种利用强度纹理的跨模态视觉重定位系统，该系统包含三个主要模块：地图投影、粗略检索和精细重定位。在地图投影模块中，通过全景投影的密集特性构建强度通道地图图像数据库；粗略检索模块从数据库中检索与查询图像最相似的图像，并通过共视性聚类保留前K’个结果；精细重定位模块采用两阶段的2D-3D关联和共视性内点选择方法，以获得稳健的对应关系用于6DoF姿态估计。实验结果表明，该方法在地点识别和姿态估计任务中均表现有效。

链接: https://arxiv.org/abs/2412.01299
作者: Qiyuan Shen,Hengwang Zhao,Weihao Yan,Chunxiang Wang,Tong Qin,Ming Yang
关键词-EN: drawn increasing attention, prior LiDAR maps, cross-modal visual relocalization, recent years, localization has drawn
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cross-modal localization has drawn increasing attention in recent years, while the visual relocalization in prior LiDAR maps is less studied. Related methods usually suffer from inconsistency between the 2D texture and 3D geometry, neglecting the intensity features in the LiDAR point cloud. In this paper, we propose a cross-modal visual relocalization system in prior LiDAR maps utilizing intensity textures, which consists of three main modules: map projection, coarse retrieval, and fine relocalization. In the map projection module, we construct the database of intensity channel map images leveraging the dense characteristic of panoramic projection. The coarse retrieval module retrieves the top-K most similar map images to the query image from the database, and retains the top-K’ results by covisibility clustering. The fine relocalization module applies a two-stage 2D-3D association and a covisibility inlier selection method to obtain robust correspondences for 6DoF pose estimation. The experimental results on our self-collected datasets demonstrate the effectiveness in both place recognition and pose estimation tasks.
zh

[CV-92] I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames WACV25

【速读】：该论文试图解决在社会科学中通过视觉框架分析确定共同主题和概念时，手动标注过程耗时的问题。解决方案的关键在于将图像聚类任务表述为最小成本多割问题（Minimum Cost Multicut Problem, MP），并通过利用不同的嵌入空间（如DINOv2和ConvNeXt V2）来优化聚类效果。研究表明，DINOv2适用于广泛的视觉框架检测，而ConvNeXt V2则能识别出更多包含细粒度差异的集群，如演讲和抗议。通过结合不同嵌入空间的特性与最优聚类算法，论文实现了自动化视觉框架检测的进步。

链接: https://arxiv.org/abs/2412.01296
作者: Katharina Prasse,Isaac Bravo,Stefanie Walter,Margret Keuper
关键词-EN: determining common themes, Cost Multicut Problem, Minimum Cost Multicut, social sciences, sciences for determining
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV25 applications track

点击查看摘要

Abstract:Visual framing analysis is a key method in social sciences for determining common themes and concepts in a given discourse. To reduce manual effort, image clustering can significantly speed up the annotation process. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP]. Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. We discuss the efficacy of numerous embedding spaces to detect visual frames and show its superiority over other clustering methods. To this end, we employ the climate change dataset \textitClimateTV which contains images commonly used for visual frame analysis. For broad visual frames, DINOv2 is a suitable embedding space, while ConvNeXt V2 returns a larger number of clusters which contain fine-grain differences, i.e. speech and protest. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection. Our code can be found at this https URL.
zh

[CV-93] LSceneLLM : Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

【速读】：该论文试图解决在3D视觉-语言模型（3D-VLMs）中，由于视觉特征的高密度和复杂性，导致在大型3D场景中准确定位任务相关视觉信息的挑战。现有方法通过分割所有对象并将其特征作为场景表示，但这些任务无关的对象特征包含大量冗余信息，且缺乏任务相关区域的细节。论文提出的解决方案是LSceneLLM，一个自适应框架，通过利用大型语言模型（LLM）对不同任务的视觉偏好，自动识别任务相关区域，并引入一个即插即用的场景放大模块，以捕捉聚焦区域的细粒度细节。具体来说，密集令牌选择器检查LLM的注意力图，以识别指令输入的视觉偏好，然后放大聚焦区域的细粒度细节。此外，自适应自注意力模块用于融合粗粒度和选定的细粒度视觉信息。为了全面评估3D-VLMs在大场景理解能力，论文还引入了一个跨房间理解基准XR-Scene，包含一系列大场景理解任务。实验结果表明，该方法在大型场景理解和现有场景理解基准上均优于现有方法。

链接: https://arxiv.org/abs/2412.01292
作者: Hongyan Zhi,Peihao Chen,Junyan Li,Shuailei Ma,Xinyu Sun,Tianhang Xiang,Yinjie Lei,Mingkui Tan,Chuang Gan
关键词-EN: embodied question answering, Vision-Language Models, gaining increasing attention, developing embodied, embodied question
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM’s visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.
zh

[CV-94] Enhancing Perception Capabilities of Multimodal LLM s with Training-free Fusion

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在视觉感知增强过程中面临的高训练成本问题。解决方案的关键在于引入了一种名为 VisionFuse 的新型集成框架，该框架通过有效利用现有多模态大语言模型中的多个视觉编码器 (vision encoders) 来增强视觉感知，而无需额外的训练。VisionFuse 的核心在于观察到不同的 MLLMs 在处理相同查询和图像时倾向于关注不同的区域，并且同一 MLLM 家族内的视觉编码器的特征分布高度对齐。基于这些观察，VisionFuse 通过连接来自同一家族中选定的 MLLMs 的视觉编码器生成的标记 (tokens) 来丰富视觉上下文，并通过合并这些 MLLMs 的语言模型参数，使得单一语言模型能够与多种视觉编码器对齐，从而显著降低部署开销。

链接: https://arxiv.org/abs/2412.01289
作者: Zhuokun Chen,Jinwu Hu,Zeshuai Deng,Yufeng Wang,Bohan Zhuang,Mingkui Tan
关键词-EN: equip language models, vision encoders, language models, aligning vision encoders, capabilities by aligning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
zh

[CV-95] MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

【速读】：该论文试图解决文本到图像生成模型在通过文本指导精确控制对象形状、外观和位置方面的不足。现有的扩散模型在全局布局控制方面通常依赖于额外的掩码或图像指导，且局部对象编辑模型无法控制对象的位置。论文提出的MFTF模型通过并行去噪控制扩散模型的去噪过程，动态生成注意力掩码并应用于自注意力层的查询，从而实现对对象位置的精确控制，无需额外掩码或图像，支持单对象和多对象的位置控制（如平移、旋转等），并允许同时进行布局控制和对象语义编辑。

链接: https://arxiv.org/abs/2412.01284
作者: Shan Yang
关键词-EN: transformative tools, control, model, MFTF model, layout control
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figures

点击查看摘要

Abstract:Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.
zh

[CV-96] Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

【速读】：该论文试图解决在将视觉语言模型 (Vision-Language Models, VLMs) 迁移到边缘设备时，如何在模型缩减的同时保持或提升其跨模态对齐能力的问题。解决方案的关键在于提出了一种名为 Align-KD 的知识蒸馏 (Knowledge Distillation, KD) 方法，该方法通过引导学生模型学习浅层中的跨模态匹配，以及教师模型帮助学生模型学习视觉标记在文本嵌入空间中的投影，从而在不增加模型大小或数据量的情况下，提升学生模型的综合能力。实验结果显示，在轻量级训练损失设计下，1.7B 的 MobileVLM V2 模型能够从 7B 的教师模型中学习到丰富的知识，并在多个基准测试中平均得分提升了2.0。

链接: https://arxiv.org/abs/2412.01282
作者: Qianhan Feng,Wenshuo Li,Tong Lin,Xinghao Chen
关键词-EN: bring powerful understanding, bring powerful, multimodal tasks, powerful understanding, understanding and reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: this https URL.
zh

[CV-97] AR-Facilitated Safety Inspection and Fall Hazard Detection on Construction Sites

【速读】：该论文试图解决高层建筑工地周边安全屏检查的问题，特别是如何有效地跟踪和记录检查过程，以及自动检测安全屏中的缺陷。解决方案的关键在于利用头戴式增强现实（Augmented Reality, AR）设备结合机器学习技术，自动识别安全屏中的缺口并生成检查报告，从而提高检查效率并确保检查的全面性。此外，论文还探讨了如何处理与工人隐私相关的问题，并提出了相应的缓解措施。

链接: https://arxiv.org/abs/2412.01273
作者: Jiazhou Liu,Aravinda S. Rao,Fucai Ke,Tim Dwyer,Benjamin Tag,Pari Delir Haghighi
关键词-EN: head-mounted augmented reality, high-rise construction sites, construction sites, exploring the potential, potential of head-mounted
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 1 figure, ISMAR24 Workshop Paper

点击查看摘要

Abstract:Together with industry experts, we are exploring the potential of head-mounted augmented reality to facilitate safety inspections on high-rise construction sites. A particular concern in the industry is inspecting perimeter safety screens on higher levels of construction sites, intended to prevent falls of people and objects. We aim to support workers performing this inspection task by tracking which parts of the safety screens have been inspected. We use machine learning to automatically detect gaps in the perimeter screens that require closer inspection and remediation and to automate reporting. This work-in-progress paper describes the problem, our early progress, concerns around worker privacy, and the possibilities to mitigate these.
zh

[CV-98] Ponder Press: Advancing Visual GUI Agent towards General Computer Control

【速读】：该论文试图解决现有图形用户界面（GUI）代理在跨多样软件环境和平台上的灵活性受限问题，主要依赖于非视觉输入如HTML源代码或可访问性树。解决方案的关键在于引入Ponder Press框架，该框架采用分治策略，结合通用多模态大语言模型（MLLM）作为“解释器”，将高级用户指令转化为详细动作描述，以及专门用于GUI的MLLM作为“定位器”，精确地定位GUI元素以执行动作。通过完全依赖视觉输入，Ponder Press提供了一种灵活且类似人类的交互模式，适用于广泛的应用场景，并在ScreenSpot GUI定位基准上表现优于现有模型22.5%，同时在多种GUI环境下的离线和交互代理基准测试中达到最先进性能。

链接: https://arxiv.org/abs/2412.01268
作者: Yiqin Wang,Haoji Zhang,Jingqi Tian,Yansong Tang
关键词-EN: HTML source code, HTML source, agents typically depend, GUI, accessibility trees
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements – a critical requirement for effective GUI automation – due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an ‘interpreter’, responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a ‘locator’ that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications. Ponder Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both offline and interactive agent benchmarks across various GUI environments – including web pages, desktop software, and mobile UIs – demonstrate that Ponder Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents. Refer to the project homepage this https URL
zh

[CV-99] EdgeOAR: Real-time Online Action Recognition On Edge Devices

【速读】：该论文试图解决在线动作识别 (Online Action Recognition, OAR) 中的实时性和效率问题，特别是在边缘计算环境中。解决方案的关键在于设计了 EdgeOAR 框架，其中包括早期退出导向的任务特定特征增强模块 (Early Exit-oriented Task-specific Feature Enhancement Module, TFEM)，该模块通过轻量级子模块优化时间和空间维度上的特征。此外，EdgeOAR 还采用了逆信息熵 (Inverse Information Entropy, IIE) 和模态一致性 (Modality Consistency, MC) 驱动的融合模块，以更好地融合特征并做出退出决策。这些设计克服了在线视频流中初始帧有限的情况下对时空动作表示的鲁棒建模问题，并在资源受限的边缘设备上平衡了准确性和效率。实验结果表明，EdgeOAR 在 UCF-101 数据集上相比最先进的方法 (SOTA) 显著降低了延迟和能耗，同时保持了足够的准确性。

链接: https://arxiv.org/abs/2412.01267
作者: Wei Luo,Deyu Zhang,Ying Tang,Fan Wu,Yaoxue Zhang
关键词-EN: involves instantaneous analysis, paper addresses, involves instantaneous, instantaneous analysis, analysis and classification
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:This paper addresses the challenges of Online Action Recognition (OAR), a framework that involves instantaneous analysis and classification of behaviors in video streams. OAR must operate under stringent latency constraints, making it an indispensable component for real-time feedback for edge computing. Existing methods, which typically rely on the processing of entire video clips, fall short in scenarios requiring immediate recognition. To address this, we designed EdgeOAR, a novel framework specifically designed for OAR on edge devices. EdgeOAR includes the Early Exit-oriented Task-specific Feature Enhancement Module (TFEM), which comprises lightweight submodules to optimize features in both temporal and spatial dimensions. We design an iterative training method to enable TFEM learning features from the beginning of the video. Additionally, EdgeOAR includes an Inverse Information Entropy (IIE) and Modality Consistency (MC)-driven fusion module to fuse features and make better exit decisions. This design overcomes the two main challenges: robust modeling of spatio-temporal action representations with limited initial frames in online video streams and balancing accuracy and efficiency on resource-constrained edge devices. Experiments show that on the UCF-101 dataset, our method EdgeOAR reduces latency by 99.23% and energy consumption by 99.28% compared to state-of-the-art (SOTA) method. And achieves an adequate accuracy on edge devices.
zh

[CV-100] NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

【速读】：该论文试图解决在视觉-语言基础模型（如CLIP）中，由于真实世界数据集通常包含噪声标签，导致提示学习（prompt learning）性能下降的问题。解决方案的关键在于引入基于平均绝对误差（MAE）的损失函数，即PromptMAE，以增强模型对噪声标签的鲁棒性，同时保持高准确性。通过利用特征学习理论，论文展示了MAE能够抑制噪声样本的影响，从而提高信噪比并增强整体鲁棒性。此外，论文还提出了PromptOT，一种基于提示的最优传输数据净化方法，通过构建最优传输矩阵来有效划分数据集为干净和噪声子集，分别应用交叉熵损失和MAE损失，进一步增强鲁棒性。最终，提出的噪声标签提示学习方法（NLPrompt）利用视觉-语言模型的表达能力和精确对齐能力，提供了一种简单高效的鲁棒提示学习方案。

链接: https://arxiv.org/abs/2412.01256
作者: Bikang Pan,Qun Li,Xiaoying Tang,Wei Huang,Zhen Fang,Feng Liu,Jingya Wang,Jingyi Yu,Ye Shi
关键词-EN: prompt learning, revolutionized image-text representation, enabling a broad, learning, prompt
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text encoder representations in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representation and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.
zh

[CV-101] EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

【速读】：该论文试图解决在生成式 AI (Generative AI) 中，如何在保持身份特征的同时实现精细的表情控制问题。现有方法通常只能生成中性或刻板表情，即使使用面部标志等控制信号，也难以根据用户指令生成准确生动的表情。解决方案的关键在于引入 EmojiDiff，这是一种端到端的解决方案，能够同时实现精细表情和身份的双重控制。与传统使用粗略控制信号的方法不同，EmojiDiff 直接接受 RGB 表情图像作为输入模板，在扩散过程中提供极其精确和细粒度的表情控制。其核心创新在于提出了一种解耦方案，将表情模板中的表情特征与其他无关信息（如身份、皮肤和风格）分离。具体措施包括引入身份无关数据迭代 (ID-irrelevant Data Iteration, IDI) 以合成高质量的跨身份表情对进行解耦训练，以及精心选择表情敏感层注入参考表情特征，防止风格泄露。此外，提出的身份增强对比对齐 (ID-enhanced Contrast Alignment, ICA) 策略进一步提升了身份保真度，消除了表情控制对原始身份保留的负面影响。实验结果表明，该方法在精确表情控制和高度身份保持方面显著优于现有方法，并能很好地泛化到各种扩散模型。

链接: https://arxiv.org/abs/2412.01254
作者: Liangwei Jiang,Ruida Li,Zhifeng Zhang,Shuo Fang,Chenguang Ma
关键词-EN: identity-preserving portrait generation, expression, textbf, expression control, paper aims
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper aims to bring fine-grained expression control to identity-preserving portrait generation. Existing methods tend to synthesize portraits with either neutral or stereotypical expressions. Even when supplemented with control signals like facial landmarks, these models struggle to generate accurate and vivid expressions following user instructions. To solve this, we introduce EmojiDiff, an end-to-end solution to facilitate simultaneous dual control of fine expression and identity. Unlike the conventional methods using coarse control signals, our method directly accepts RGB expression images as input templates to provide extremely accurate and fine-grained expression control in the diffusion process. As its core, an innovative decoupled scheme is proposed to disentangle expression features in the expression template from other extraneous information, such as identity, skin, and style. On one hand, we introduce \textbfID-irrelevant \textbfData \textbfIteration (IDI) to synthesize extremely high-quality cross-identity expression pairs for decoupled training, which is the crucial foundation to filter out identity information hidden in the expressions. On the other hand, we meticulously investigate network layer function and select expression-sensitive layers to inject reference expression features, effectively preventing style leakage from expression signals. To further improve identity fidelity, we propose a novel fine-tuning strategy named \textbfID-enhanced \textbfContrast \textbfAlignment (ICA), which eliminates the negative impact of expression control on original identity preservation. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.
zh

[CV-102] Multimodal Fusion Learning with Dual Attention for Medical Imaging

【速读】：该论文试图解决多模态融合学习在疾病分类中存在的三个关键问题：缺乏对其他诊断任务的通用性、未能充分利用多种健康记录中的互补信息、以及依赖单一注意力机制。解决方案的关键在于提出了一种双稳健信息融合注意力机制 (Dual Robust Information Fusion Attention Mechanism, DRIFA)，该机制包含两个注意力模块：多分支融合注意力模块和多模态信息融合注意力模块。DRIFA 可以与任何深度神经网络结合，形成一个名为 DRIFA-Net 的多模态融合学习框架。通过这种设计，DRIFA 能够增强每种模态（如皮肤镜检查、宫颈涂片、MRI 和 CT 扫描）的表示，并学习更精细的多模态共享表示，从而提高网络在多个任务上的泛化能力和整体性能。此外，论文还采用了集成蒙特卡罗 dropout 策略来估计 DRIFA-Net 预测的不确定性。

链接: https://arxiv.org/abs/2412.01248
作者: Joy Dhar,Nayyar Zaidi,Maryam Haghighat,Puneet Goyal,Sudipta Roy,Azadeh Alavi,Vikas Kumar
关键词-EN: shown significant promise, fusion attention, brain tumors, information fusion attention, fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations. First, they often lack generalizability to other diagnosis tasks due to their focus on a particular disease. Second, they do not fully leverage multiple health records from diverse modalities to learn robust complementary information. And finally, they typically rely on a single attention mechanism, missing the benefits of multiple attention strategies within and across various modalities. To address these issues, this paper proposes a dual robust information fusion attention mechanism (DRIFA) that leverages two attention modules, i.e. multi-branch fusion attention module and the multimodal information fusion attention module. DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework denoted as DRIFA-Net. We show that the multi-branch fusion attention of DRIFA learns enhanced representations for each modality, such as dermoscopy, pap smear, MRI, and CT-scan, whereas multimodal information fusion attention module learns more refined multimodal shared representations, improving the network’s generalization across multiple tasks and enhancing overall performance. Additionally, to estimate the uncertainty of DRIFA-Net predictions, we have employed an ensemble Monte Carlo dropout strategy. Extensive experiments on five publicly available datasets with diverse modalities demonstrate that our approach consistently outperforms state-of-the-art methods. The code is available at this https URL.
zh

[CV-103] Class Distance Weighted Cross Entropy Loss for Classification of Disease Severity

【速读】：该论文试图解决在涉及序数类别（ordinal classes）的疾病严重程度评估中，传统分类损失函数（如交叉熵损失 (Cross-Entropy, CE)）表现不佳的问题。解决方案的关键是提出了一种新的损失函数，即类间距离加权交叉熵损失 (Class Distance Weighted Cross-Entropy, CDW-CE)，该损失函数根据类别之间的距离更严厉地惩罚错误分类，从而在序数图像分类任务中显著提升性能。通过在溃疡性结肠炎标记图像数据集 (Labeled Images for Ulcerative Colitis, LIMUC) 上的实验，CDW-CE 不仅在分类准确性上表现优异，还通过 t-SNE 可视化和 Silhouette Score 量化了更好的类别区分度，同时其生成的类激活图 (Class Activation Maps, CAM) 更符合临床专家的认知，显示出对临床重要区域的更强关注。

链接: https://arxiv.org/abs/2412.01246
作者: Gorkem Polat,Ümit Mert Çağlar,Alptekin Temizel
关键词-EN: Assessing disease severity, disease severity involving, represents increasing levels, Assessing disease, class represents increasing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing disease severity involving ordinal classes, where each class represents increasing levels of severity, benefit from loss functions that account for this ordinal structure. Traditional categorical loss functions, like Cross-Entropy (CE), often perform suboptimally in these scenarios. To address this, we propose a novel loss function, Class Distance Weighted Cross-Entropy (CDW-CE), which penalizes misclassifications more harshly when classes are farther apart. We evaluated CDW-CE on the Labeled Images for Ulcerative Colitis (LIMUC) dataset using various deep architectures. Its performance was compared against several categorical and ordinal loss functions. To analyze the quality of latent representations, we used t-distributed stochastic neighbor embedding (t-SNE) visualizations and quantified their clustering with the Silhouette Score. We also compared Class Activation Maps (CAM) generated by models trained with CDW-CE and CE loss, incorporating domain expert feedback to evaluate alignment with expert knowledge. Our results show that CDW-CE consistently improves performance in ordinal image classification tasks. It achieves higher Silhouette Scores, indicating better differentiation of class representations, and its CAM visualizations demonstrate a stronger focus on clinically significant regions, as confirmed by domain experts.
zh

[CV-104] Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

【速读】：该论文试图解决大规模扩散模型在生成高质量图像时，常伴随产生不希望的内容（如性暗示或暴力内容）的问题。解决方案的关键在于提出了一种新的目标概念替换方法，通过引入专用概念定位器（concept localizer）和无训练的双提示交叉注意力模块（Dual Prompts Cross-Attention, DPCA），能够在去噪过程中精确识别并替换目标概念，同时最小化对非目标区域的影响。该方法采用少样本学习（few-shot learning）进行训练，仅需少量标注数据，确保了概念定位的精确性和替换的高效性，从而在不影响图像整体一致性的前提下，实现了对特定概念的有效替换。

链接: https://arxiv.org/abs/2412.01244
作者: Lingyun Zhang,Yu Xie,Yanwei Fu,Ping Chen
关键词-EN: producing high-quality images, generate unwanted content, diffusion models continue, large-scale diffusion models, continue to advance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.
zh

[CV-105] Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

【速读】：该论文试图解决扩散模型和流模型在推理过程中依赖固定去噪时间表（denoising schedule）的问题，这种固定时间表可能限制了推理效率和处理不同提示（prompt）的灵活性。解决方案的关键是引入时间预测扩散模型（Time Prediction Diffusion Model, TPDM），其核心在于使用一个即插即用的时间预测模块（Time Prediction Module, TPM），该模块在每个去噪步骤中根据当前潜在特征预测下一个噪声水平。TPM通过强化学习进行训练，目标是最大化一个奖励函数，该函数通过去噪步骤的数量折扣最终图像质量。这种自适应调度器不仅生成了高质量的图像，还动态调整去噪步骤的数量和时间，从而提高了性能和效率。

链接: https://arxiv.org/abs/2412.01243
作者: Zilyu Ye,Zhiyang Chen,Tiancheng Li,Zemin Huang,Weijian Luo,Guo-Jun Qi
关键词-EN: achieved remarkable successes, Time Prediction Diffusion, Time Prediction Module, Prediction Diffusion Model, Time Prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
zh

[CV-106] Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes

【速读】：该论文试图解决生成式模型（如SAM和SAM 2）在处理依赖上下文的概念（Context-Dependent, CD）时表现不足的问题。解决方案的关键在于开发了一个统一的评估框架，该框架支持手动、自动和中间自提示的评估方式，并结合特定的提示生成和交互策略。此外，论文还引入了上下文学习潜力和提示鲁棒性测试，以模拟现实世界中不完美的提示情况，从而全面评估SAM和SAM 2在理解和分割CD概念方面的能力。通过这些方法，论文旨在揭示SAM系列模型在处理CD概念时的优势与局限，并为未来模型（如SAM 3）的设计提供指导。

链接: https://arxiv.org/abs/2412.01240
作者: Xiaoqi Zhao,Youwei Pang,Shijie Chang,Yuan Zhao,Lihe Zhang,Huchuan Lu,Jinsong Ouyang,Georges El Fakhri,Xiaofeng Liu
关键词-EN: significantly influenced multiple, influenced multiple fields, SAM, computer vision, poised to make
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a foundational model, SAM has significantly influenced multiple fields within computer vision, and its upgraded version, SAM 2, enhances capabilities in video segmentation, poised to make a substantial impact once again. While SAMs (SAM and SAM 2) have demonstrated excellent performance in segmenting context-independent concepts like people, cars, and roads, they overlook more challenging context-dependent (CD) concepts, such as visual saliency, camouflage, product defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough quantitative evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM 2 that supports manual, automatic, and intermediate self-prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM 2 for in-context learning and introduce prompt robustness testing to simulate real-world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks. This work aims to provide valuable insights to guide future research in both context-independent and context-dependent concepts segmentation, potentially informing the development of the next version - SAM 3.
zh

[CV-107] PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

【速读】：该论文试图解决基于扩散模型的图像修复方法在语义一致性和用户编辑习惯方面的问题。解决方案的关键在于提出了PainterNet插件，该插件能够灵活嵌入到各种扩散模型中，并通过以下创新来增强模型性能：1) 局部提示输入（local prompt input），2) 注意力控制点（Attention Control Points, ACP），3) 实际标记注意力损失（Actual-Token Attention Loss, ATAL），以提高模型对局部区域的聚焦能力；4) 重新设计了MASK生成算法，以模拟用户的MASK应用习惯；5) 引入了定制的训练数据集PainterData和基准数据集PainterBench。这些改进使得PainterNet在图像质量和全局/局部文本一致性等关键指标上超越了现有的最先进模型。

链接: https://arxiv.org/abs/2412.01223
作者: Ruichen Wang,Junliang Zhang,Qingsong Xie,Chen Chen,Haonan Lu
关键词-EN: exhibited superior performance, diffusion models, exhibited superior, superior performance, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, diffusion models have exhibited superior performance in the area of image inpainting. Inpainting methods based on diffusion models can usually generate realistic, high-quality image content for masked areas. However, due to the limitations of diffusion models, existing methods typically encounter problems in terms of semantic consistency between images and text, and the editing habits of users. To address these issues, we present PainterNet, a plugin that can be flexibly embedded into various diffusion models. To generate image content in the masked areas that highly aligns with the user input prompt, we proposed local prompt input, Attention Control Points (ACP), and Actual-Token Attention Loss (ATAL) to enhance the model’s focus on local areas. Additionally, we redesigned the MASK generation algorithm in training and testing dataset to simulate the user’s habit of applying MASK, and introduced a customized new training dataset, PainterData, and a benchmark dataset, PainterBench. Our extensive experimental analysis exhibits that PainterNet surpasses existing state-of-the-art models in key metrics including image quality and global/local text consistency.
zh

[CV-108] RGBDS-SLAM: A RGB-D Semantic Dense SLAM Based on 3D Multi Level Pyramid Gaussian Splatting

【速读】：该论文试图解决现有密集SLAM（Simultaneous Localization and Mapping）系统在场景RGB、深度和语义重建中存在的细节和一致性问题。解决方案的关键在于提出了RGBDS-SLAM系统，该系统基于3D多层次金字塔高斯拼接（3D multi-level pyramid gaussian splatting）技术，通过提取多层次图像金字塔进行高斯拼接训练，从而恢复场景细节并确保RGB、深度和语义重建的一致性。此外，论文还设计了一种紧密耦合的多特征重建优化机制，使得RGB、深度和语义图的重建精度在渲染优化过程中相互增强。

链接: https://arxiv.org/abs/2412.01217
作者: Zhenzhong Cao
关键词-EN: Gaussian Splatting, pyramid gaussian splatting, depth, RGB, dense SLAM
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality reconstruction is crucial for dense SLAM. Recent popular approaches utilize 3D Gaussian Splatting (3D GS) techniques for RGB, depth, and semantic reconstruction of scenes. However, these methods often overlook issues of detail and consistency in different parts of the scene. To address this, we propose RGBDS-SLAM, a RGB-D semantic dense SLAM system based on 3D multi-level pyramid gaussian splatting, which enables high-quality dense reconstruction of scene RGB, depth, and this http URL this system, we introduce a 3D multi-level pyramid gaussian splatting method that restores scene details by extracting multi-level image pyramids for gaussian splatting training, ensuring consistency in RGB, depth, and semantic reconstructions. Additionally, we design a tightly-coupled multi-features reconstruction optimization mechanism, allowing the reconstruction accuracy of RGB, depth, and semantic maps to mutually enhance each other during the rendering optimization process. Extensive quantitative, qualitative, and ablation experiments on the Replica and ScanNet public datasets demonstrate that our proposed method outperforms current state-of-the-art methods. The open-source code will be available at: this https URL.
zh

[CV-109] Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

【速读】：该论文试图解决临床应用中由于领域偏移（Domain Shift）导致的模型适应性问题，特别是在糖尿病视网膜病变（Diabetic Retinopathy, DR）分级任务中。解决方案的关键在于提出了一种新的生成式无对抗样本方法（Generative Unadversarial ExampleS, GUES），该方法从数据中心化的角度进行领域适应。具体来说，GUES通过学习一个扰动生成函数，利用变分自编码器（Variational AutoEncoder）来表达该函数，其中编码器通过重参数化技巧预测潜在输入，解码器负责生成扰动。此外，选择显著性图作为伪扰动标签，以捕捉潜在病变并提供函数输入的上限，从而识别潜在变量。实验结果表明，GUES在DR基准测试中表现优异，即使在较小批量的情况下也表现出强大的鲁棒性。

链接: https://arxiv.org/abs/2412.01203
作者: Wenxin Su,Song Tang,Xiaofeng Liu,Xiaojing Yi,Mao Ye,Chunxiao Zu,Jiahao Li,Xiatian Zhu
关键词-EN: Diabetic Retinopathy, poses a significant, Domain shift, Online Model-aGnostic Domain, Model-aGnostic Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain shift (the difference between source and target domains) poses a significant challenge in clinical applications, e.g., Diabetic Retinopathy (DR) grading. Despite considering certain clinical requirements, like source data privacy, conventional transfer methods are predominantly model-centered and often struggle to prevent model-targeted attacks. In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. This setting is characterized by the absence of the model and the flow of target data. To tackle the new challenge, we propose a novel approach, Generative Unadversarial ExampleS (GUES), which enables adaptation from a data-centric perspective. Specifically, we first theoretically reformulate conventional perturbation optimization in a generative way–learning a perturbation generation function with a latent input variable. During model instantiation, we leverage a Variational AutoEncoder to express this function. The encoder with the reparameterization trick predicts the latent input, whilst the decoder is responsible for the generation. Furthermore, the saliency map is selected as pseudo-perturbation labels. Because it not only captures potential lesions but also theoretically provides an upper bound on the function input, enabling the identification of the latent variable. Extensive comparative experiments on DR benchmarks with both frozen pre-trained models and trainable models demonstrate the superiority of GUES, showing robustness even with small batch size.
zh

[CV-110] Neuron Abandoning Attention Flow: Visual Explanation of Dynamics inside CNN Models

【速读】：该论文试图解决卷积神经网络 (CNN) 在分类决策过程中注意力演化动态的可视化解释问题。解决方案的关键在于提出了一种名为“神经元放弃注意力流 (Neuron Abandoning Attention Flow, NAFlow)”的方法，通过设计一种新颖的级联神经元放弃反向传播算法 (cascading neuron abandoning back-propagation algorithm) 来追踪所有层中参与决策的神经元，从而减少被放弃神经元的显著干扰。具体来说，该方法首先提出了一个神经元放弃反向传播 (Neuron Abandoning Back-Propagation, NA-BP) 模块，用于生成反向传播特征图 (Back-Propagated Feature Maps, BPFM)，并通过中间层的逆函数来放弃那些不用于决策的神经元。同时，级联的 NA-BP 模块计算重要性系数张量，这些张量与 BPFM 张量线性组合形成 NAFlow。此外，为了能够可视化基于相似度度量的 CNN 模型的注意力流，论文还提出了一种新的通道贡献权重模块，通过雅可比矩阵 (Jacobian Matrix) 计算重要性系数。该方法在九种广泛使用的 CNN 模型上进行了验证，涵盖了通用图像分类、对比学习分类、少样本图像分类和图像检索等多种任务。

链接: https://arxiv.org/abs/2412.01202
作者: Yi Liao,Yongsheng Gao,Weichuan Zhang
关键词-EN: neuron abandoning back-propagation, evolution dynamics inside, dynamics inside CNNs, Neuron Abandoning Attention, Neuron Abandoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method to address the open problem of visually explaining the attention evolution dynamics inside CNNs when making their classification decisions. A novel cascading neuron abandoning back-propagation algorithm is designed to trace neurons in all layers of a CNN that involve in making its prediction to address the problem of significant interference from abandoned neurons. Firstly, a Neuron Abandoning Back-Propagation (NA-BP) module is proposed to generate Back-Propagated Feature Maps (BPFM) by using the inverse function of the intermediate layers of CNN models, on which the neurons not used for decision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate the tensors of importance coefficients which are linearly combined with the tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize attention flow for similarity metric-based CNN models, a new channel contribution weights module is proposed to calculate the importance coefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is validated on nine widely-used CNN models for various tasks of general image classification, contrastive learning classification, few-shot image classification, and image retrieval.
zh

[CV-111] nyFusion: Diffusion Transformers Learned Shallow

【速读】：该论文试图解决扩散变换器（Diffusion Transformers）在图像生成任务中参数过多导致的推理开销问题。解决方案的关键在于提出了一种名为TinyFusion的深度剪枝方法，通过端到端学习去除冗余层。核心原理是创建一个具有高恢复性的剪枝模型，使其在微调后能够恢复强性能。为此，论文引入了一种可微分的采样技术，使剪枝过程可学习，并结合一个协同优化的参数来模拟未来的微调过程。与以往专注于剪枝后最小化损失或误差的方法不同，TinyFusion明确地建模并优化了剪枝模型在微调后的性能。实验结果表明，这种可学习的方法在扩散变换器的层剪枝中显著优于现有的基于重要性和基于误差的方法，并且在多种架构（如DiTs、MARs和SiTs）中表现出强大的泛化能力。

链接: https://arxiv.org/abs/2412.01199
作者: Gongfan Fang,Kunjun Li,Xinyin Ma,Xinchao Wang
关键词-EN: demonstrated remarkable capabilities, considerable inference overhead, excessive parameterization, resulting in considerable, real-world applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2 \times speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at this https URL.
zh

[CV-112] InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences

【速读】：该论文试图解决现有定制化概念交换（Customized Concept Swapping, CCS）方法在处理图像中概念交换时面临的不一致性和低效率问题。解决方案的关键在于引入了一种名为InstantSwap的新方法，通过自动提取源图像中对象的边界框（bbox）并利用注意力图分析来实现前景和背景的一致性。具体来说，背景一致性通过在交换过程中去除边界框外部的梯度来实现，而前景一致性则通过交叉注意力机制将语义信息注入到源和目标概念中来实现。此外，为了提高交换速度，该方法避免了在每个时间步计算梯度，而是周期性地计算，从而减少了前向传递的次数，显著提高了效率。

链接: https://arxiv.org/abs/2412.01197
作者: Chenyang Zhu,Kai Li,Yue Ma,Longxiang Tang,Chengyu Fang,Chubin Chen,Qifeng Chen,Xiu Li
关键词-EN: Customized Concept Swapping, customized target concept, Recent advances, Customized Concept, model to swap
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL . Github Page: this https URL

点击查看摘要

Abstract:Recent advances in Customized Concept Swapping (CCS) enable a text-to-image model to swap a concept in the source image with a customized target concept. However, the existing methods still face the challenges of inconsistency and inefficiency. They struggle to maintain consistency in both the foreground and background during concept swapping, especially when the shape difference is large between objects. Additionally, they either require time-consuming training processes or involve redundant calculations during inference. To tackle these issues, we introduce InstantSwap, a new CCS method that aims to handle sharp shape disparity at speed. Specifically, we first extract the bbox of the object in the source image automatically based on attention map analysis and leverage the bbox to achieve both foreground and background consistency. For background consistency, we remove the gradient outside the bbox during the swapping process so that the background is free from being modified. For foreground consistency, we employ a cross-attention mechanism to inject semantic information into both source and target concepts inside the box. This helps learn semantic-enhanced representations that encourage the swapping process to focus on the foreground objects. To improve swapping speed, we avoid computing gradients at each timestep but instead calculate them periodically to reduce the number of forward passes, which improves efficiency a lot with a little sacrifice on performance. Finally, we establish a benchmark dataset to facilitate comprehensive evaluation. Extensive evaluations demonstrate the superiority and versatility of this http URL Page: this https URL
zh

[CV-113] MeasureNet: Measurement Based Celiac Disease Identification

【速读】：该论文试图解决在乳糜泻（Celiac disease）诊断中，手动测量小肠绒毛（villi）和隐窝（crypt）长度比率时存在的时间消耗大和观察者间变异性高的问题。解决方案的关键在于提出了一个病理驱动的折线检测框架MeasureNet，该框架结合了折线定位和对象驱动的损失函数，专门设计用于测量任务。此外，论文还利用分割模型来辅助指导隐窝位置的识别，并通过掩码特征混合技术增强模型的鲁棒性，以减少对分割掩码的过度依赖。通过引入一个包含750张注释的十二指肠活检图像的新数据集，MeasureNet在二分类和多类分级任务中分别达到了82.66%和81%的分类准确率。

链接: https://arxiv.org/abs/2412.01182
作者: Aayush Kumar Tyagi,Vaibhav Mishra,Ashok Tiwari,Lalita Mehra,Prasenjit Das,Govind Makharia,Prathosh AP,Mausam
关键词-EN: autoimmune disorder triggered, Celiac disease, consumption of gluten, autoimmune disorder, disorder triggered
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Celiac disease is an autoimmune disorder triggered by the consumption of gluten. It causes damage to the villi, the finger-like projections in the small intestine that are responsible for nutrient absorption. Additionally, the crypts, which form the base of the villi, are also affected, impairing the regenerative process. The deterioration in villi length, computed as the villi-to-crypt length ratio, indicates the severity of celiac disease. However, manual measurement of villi-crypt length can be both time-consuming and susceptible to inter-observer variability, leading to inconsistencies in diagnosis. While some methods can perform measurement as a post-hoc process, they are prone to errors in the initial stages. This gap underscores the need for pathologically driven solutions that enhance measurement accuracy and reduce human error in celiac disease assessments. Our proposed method, MeasureNet, is a pathologically driven polyline detection framework incorporating polyline localization and object-driven losses specifically designed for measurement tasks. Furthermore, we leverage segmentation model to provide auxiliary guidance about crypt location when crypt are partially visible. To ensure that model is not overdependent on segmentation mask we enhance model robustness through a mask feature mixup technique. Additionally, we introduce a novel dataset for grading celiac disease, consisting of 750 annotated duodenum biopsy images. MeasureNet achieves an 82.66% classification accuracy for binary classification and 81% accuracy for multi-class grading of celiac disease. Code: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.01182 [cs.CV] (or arXiv:2412.01182v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-114] Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video IROS2024

【速读】：该论文试图解决单目视频中人体网格重建（Human Mesh Reconstruction）在准确重建与平滑运动之间的权衡问题。现有方法在提取局部时间相关性或全局时间依赖性时，缺乏互补的长期信息和局部细节，从而限制了其性能。解决方案的关键在于提出了一个双分支图变换网络（Dual-branch Graph Transformer network, DGTR），该网络包括一个全局运动注意力分支（Global Motion Attention, GMA）和一个局部细节精炼分支（Local Details Refine, LDR）。GMA利用全局变换器模型长期人体运动，而LDR结合调制图卷积网络和变换器框架，聚合相邻帧的局部信息并提取人体细节的关键信息。实验结果表明，DGTR在重建精度和运动平滑性方面均优于现有最先进的方法，同时使用更少的参数和计算量，验证了其有效性和效率。

链接: https://arxiv.org/abs/2412.01179
作者: Tao Tang,Hong Liu,Yingxuan You,Ti Wang,Wenhao Li
关键词-EN: Human Mesh Reconstruction, Human Mesh, Mesh Reconstruction, monocular video plays, mesh reconstruction methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2024. Project page: this https URL

点击查看摘要

Abstract:Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbfDual-branch \textbfGraph \textbfTransformer network for 3D human mesh \textbfReconstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \hrefthis https URL\textcolormyBluethis https URL.
zh

[CV-115] OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?

【速读】：该论文试图解决大型多模态模型（LMMs）在甲骨文（Oracle Bone Inscriptions, OBI）处理任务中的全面评估问题，特别是针对需要专家级领域知识和深思熟虑认知的整个处理流程。解决方案的关键在于引入了OBI-Bench，这是一个综合性的基准测试，包含了5,523张精心收集的多样化来源图像，涵盖了甲骨文处理的五个关键领域问题：识别、拼接、分类、检索和解码。这些图像跨越了数百年的考古发现和前沿学者的多年研究，涵盖了从挖掘到合成的多阶段字体外观，如原始甲骨、墨拓、甲骨碎片、裁剪的单字和手写体字。OBI-Bench专注于高级视觉感知和基于甲骨文学科知识的推理，挑战LMMs执行类似于专家面临的任务。通过评估6个专有LMMs和17个开源LMMs，OBI-Bench揭示了这些模型在某些细粒度感知任务上仍远未达到公众水平的人类表现，但在解码任务上表现出了与未受过训练的人类相当的水平，显示出在提供新的解释视角和生成创造性猜测方面的显著能力。

链接: https://arxiv.org/abs/2412.01175
作者: Zijian Chen,Tingzhu Chen,Wenjun Zhang,Guangtao Zhai
关键词-EN: systematically evaluate large, demanding expert-level domain, oracle bone inscriptions, holistic benchmark crafted, whole-process oracle bone
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 18 figures

点击查看摘要

Abstract:We introduce OBI-Bench, a holistic benchmark crafted to systematically evaluate large multi-modal models (LMMs) on whole-process oracle bone inscriptions (OBI) processing tasks demanding expert-level domain knowledge and deliberate cognition. OBI-Bench includes 5,523 meticulously collected diverse-sourced images, covering five key domain problems: recognition, rejoining, classification, retrieval, and deciphering. These images span centuries of archaeological findings and years of research by front-line scholars, comprising multi-stage font appearances from excavation to synthesis, such as original oracle bone, inked rubbings, oracle bone fragments, cropped single character, and handprinted character. Unlike existing benchmarks, OBI-Bench focuses on advanced visual perception and reasoning with OBI-specific knowledge, challenging LMMs to perform tasks akin to those faced by experts. The evaluation of 6 proprietary LMMs as well as 17 open-source LMMs highlights the substantial challenges and demands posed by OBI-Bench. Even the latest versions of GPT-4o, Gemini 1.5 Pro, and Qwen-VL-Max are still far from public-level humans in some fine-grained perception tasks. However, they perform at a level comparable to untrained humans in deciphering task, indicating remarkable capabilities in offering new interpretative perspectives and generating creative guesses. We hope OBI-Bench can facilitate the community to develop domain-specific multi-modal foundation models towards ancient language research and delve deeper to discover and enhance these untapped potentials of LMMs.
zh

[CV-116] OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

【速读】：该论文试图解决多模态生成任务中的任意到任意生成问题，如文本到图像、文本到音频和音频到图像合成。解决方案的关键在于扩展了矫正流 (Rectified Flow, RF) 框架以处理多模态联合分布，并引入了新的引导机制，使用户能够灵活控制不同模态在生成输出中的对齐。此外，论文提出了一种新的架构，扩展了Stable Diffusion 3中的文本到图像MMDiT架构，使其能够支持音频和文本生成，并通过单独预训练和微调的方式提高了效率。最后，论文还对大规模音频和文本生成中的矫正流变换器设计选择进行了全面研究，提供了优化多模态性能的宝贵见解。

链接: https://arxiv.org/abs/2412.01169
作者: Shufan Li,Konstantinos Kallidromitis,Akash Gokul,Zichun Liao,Yusuke Kato,Kazuki Kozuka,Aditya Grover
关键词-EN: generative model designed, generative model, model designed, synthesis, introduce OmniFlow
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 14 figures

点击查看摘要

Abstract:We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at this https URL.
zh

[CV-117] Object Agnostic 3D Lifting in Space and Time

【速读】：该论文试图解决在时间序列中对2D关键点进行类别无关的3D提升问题。解决方案的关键在于两个核心原则：首先，当缺乏特定对象的数据时，可以利用相似对象的通用信息来提高性能；其次，尽管时间信息重要，但最关键的信息在于时间上的即时接近性。这些原则使得该方法在单帧和序列级别的多种对象上超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.01166
作者: Christopher Fusco,Mosam Dabhi,Shin-Fang Ch’ng,Simon Lucey
关键词-EN: perspective on category-agnostic, present a spatio-temporal, spatio-temporal perspective, Abstract, model space-time dependencies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 3DV 2025

点击查看摘要

Abstract:We present a spatio-temporal perspective on category-agnostic 3D lifting of 2D keypoints over a temporal sequence. Our approach differs from existing state-of-the-art methods that are either: (i) object agnostic, but can only operate on individual frames, or (ii) can model space-time dependencies, but are only designed to work with a single object category. Our approach is grounded in two core principles. First, when there is a lack of data about an object, general information from similar objects can be leveraged for better performance. Second, while temporal information is important, the most critical information is in immediate temporal proximity. These two principles allow us to outperform current state-of-the-art methods on per-frame and per-sequence metrics for a variety of objects. Lastly, we release a new synthetic dataset containing 3D skeletons and motion sequences of a diverse set animals. Dataset and code will be made publicly available.
zh

[CV-118] ControlFace: Harnessing Facial Parametric Control for Face Rigging

【速读】：该论文试图解决面部图像操作（face rigging）中现有方法依赖于图像数据集、需要个体特定微调、难以保留细粒度身份和语义细节的问题。解决方案的关键在于引入了一种名为ControlFace的新型面部操作方法，该方法基于3DMM渲染，能够实现灵活且高保真的控制。其核心创新包括使用双分支U-Nets架构，其中FaceNet分支捕捉身份和细节，另一个分支专注于生成；以及通过控制混合模块和参考控制引导方法来增强控制精度，确保生成过程更好地遵循控制指令。通过在面部视频数据集上训练，ControlFace不仅充分利用了FaceNet的丰富表示，还确保了控制的一致性，从而在身份保留和控制精度方面表现出优越性能。

链接: https://arxiv.org/abs/2412.01160
作者: Wooseok Jang,Youngjun Hong,Gunho Cha,Seungryong Kim
关键词-EN: meet specific controls, computer vision, meet specific, complex task, task in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL

点击查看摘要

Abstract:Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also known as face rigging, is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, the control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet’s rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace’s superior performance in identity preservation and control precision, highlighting its practicality. Please see the project website: this https URL.
zh

[CV-119] A2VIS: Amodal-Aware Approach to Video Instance Segmentation

【速读】：该论文试图解决视频实例级任务（如多目标跟踪 (MOT) 和视频实例分割 (VIS)）中遮挡问题。解决方案的关键在于提出了一种新的框架——全模态感知视频实例分割 (Amodal-Aware Video Instance Segmentation, A2VIS)，该框架通过引入全模态表示 (amodal representations) 来实现对视频中物体可见部分和遮挡部分的综合理解。核心直觉是通过时空维度上的全模态分割 (amodal segmentation) 来稳定地传递物体信息，尤其是在物体部分或完全被遮挡的情况下，全模态分割相比可见分割 (visible segmentation) 更能保持时间轴上的连续性和一致性。为此，论文引入了时空先验全模态掩码头 (spatiotemporal-prior Amodal Mask Head)，该模块利用单帧内的可见信息提取跨帧的全模态特征，从而有效地解决了视频全模态分割的挑战。

链接: https://arxiv.org/abs/2412.01147
作者: Minh Tran,Thang Pham,Winston Bounsavy,Tri Nguyen,Ngan Le
关键词-EN: Handling occlusion remains, Multiple Object Tracking, Handling occlusion, Video Instance Segmentation, Multiple Object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project page: this https URL

点击查看摘要

Abstract:Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.
zh

[CV-120] Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes

【速读】：该论文试图解决现有高光谱三维成像技术在动态场景中应用受限的问题，主要原因是这些技术需要较长的采集时间或依赖于大型昂贵的系统。论文提出的解决方案是Dense Dispersed Structured Light (DDSL)，其关键在于利用立体RGB相机和配备有经济型衍射光栅膜的RGB投影仪，设计了光谱复用的DDSL图案，显著减少了所需的投影图案数量，从而加速了采集速度。此外，论文还提出了图像形成模型和重建方法，用于从捕获的立体图像中估计高光谱图像和深度图。实验结果表明，DDSL在动态场景中实现了15.5 nm的谱分辨率（FWHM）、4 mm的深度误差和6.6 fps的帧率。

链接: https://arxiv.org/abs/2412.01140
作者: Suhyun Shin,Seungwoo Yoon,Ryota Maeda,Seung-Hwan Baek
关键词-EN: enabling comprehensive geometric, Dispersed Structured Light, Dense Dispersed Structured, enabling comprehensive, material analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral 3D imaging captures both depth maps and hyperspectral images, enabling comprehensive geometric and material analysis. Recent methods achieve high spectral and depth accuracy; however, they require long acquisition times often over several minutes or rely on large, expensive systems, restricting their use to static scenes. We present Dense Dispersed Structured Light (DDSL), an accurate hyperspectral 3D imaging method for dynamic scenes that utilizes stereo RGB cameras and an RGB projector equipped with an affordable diffraction grating film. We design spectrally multiplexed DDSL patterns that significantly reduce the number of required projector patterns, thereby accelerating acquisition speed. Additionally, we formulate an image formation model and a reconstruction method to estimate a hyperspectral image and depth map from captured stereo images. As the first practical and accurate hyperspectral 3D imaging method for dynamic scenes, we experimentally demonstrate that DDSL achieves a spectral resolution of 15.5 nm full width at half maximum (FWHM), a depth error of 4 mm, and a frame rate of 6.6 fps.
zh

[CV-121] xtSSR: Diffusion-based Data Synthesis for Scene Text Recognition

【速读】：该论文试图解决场景文本识别（Scene Text Recognition, STR）中由于训练数据不足或质量不高而导致的模型效果受限问题。解决方案的关键在于引入了一种名为TextSSR的新框架，通过基于扩散模型的通用文本区域合成模型来合成场景文本识别数据。TextSSR通过聚焦于指定图像区域内的文本生成，并利用丰富的字形和位置信息来简化文本区域的生成，同时利用邻近文本作为提示来捕捉真实世界的字体样式和布局模式，从而生成更逼真的文本图像。此外，TextSSR的无提示特性和字符级合成能力使其具有良好的可扩展性，并构建了一个包含0.4百万文本实例的TextSSR-F数据集。实验结果表明，使用TextSSR-F数据集训练的模型在准确性上优于使用4百万现有合成数据训练的模型，且与完全使用真实世界数据集训练的模型相比，准确性差距小于3.7%，验证了TextSSR在场景文本图像合成中的有效性和潜力。

链接: https://arxiv.org/abs/2412.01137
作者: Xingsong Ye,Yongkun Du,Yunbo Tao,Zhineng Chen
关键词-EN: collecting sufficient high-quality, Scene text recognition, trained STR models, STR models, text
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR’s effectiveness and its great potential in scene text image synthesis. Our code is available at this https URL.
zh

[CV-122] Referring Video Object Segmentation via Language-aligned Track Selection

【速读】：该论文试图解决视频对象分割 (Referring Video Object Segmentation, RVOS) 中由于对象跟踪不一致导致的视觉语言对齐问题。解决方案的关键在于提出了一个新的框架——对象语言对齐选择 (Selection by Object Language Alignment, SOLA)，该框架将RVOS问题重新定义为两个子问题：轨迹生成和轨迹选择。在轨迹生成阶段，利用视觉基础模型 Segment Anything Model 2 (SAM2) 生成跨帧一致的掩码轨迹，为前景和背景对象提供可靠的候选。在轨迹选择阶段，提出了一种轻量但有效的选择模块，通过视觉和文本特征的对齐，同时建模对象的外观和运动，从而实现精确的运动建模和视觉语言对齐。这一设计使得SOLA在MeViS数据集上达到了最先进的性能，并在Ref-Youtube-VOS和Ref-DAVIS数据集的零样本设置中表现出色，同时在噪声和运动模糊等损坏设置下展现出强大的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2412.01136
作者: Seongchan Kim,Woojeong Jin,Sangbeom Lim,Heeji Yoon,Hyunwook Choi,Seungryong Kim
关键词-EN: Referring Video Object, Video Object Segmentation, natural language expressions, Referring Video, Object Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at this https URL
zh

[CV-123] Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks

【速读】：该论文试图解决视频问答（VideoQA）在交通监控中的应用问题，特别是在处理复杂、实时查询时，如“过去10分钟内有多少辆红车经过？”或“3:00 PM到3:05 PM之间是否有事故发生？”。解决方案的关键在于评估现有最先进的VideoQA模型在非基准合成和真实世界交通序列中的表现，特别是针对基本检测、时间推理和分解查询的准确性、相关性和一致性。论文通过利用GPT-4o框架，发现VideoLLaMA-2在组合推理和一致性回答方面表现最佳，准确率达到57%。然而，所有模型在多对象跟踪、时间一致性和复杂场景解释方面仍存在局限性，这表明当前架构在这些关键领域需要改进，以使VideoQA在交通监控中更加不可或缺。

链接: https://arxiv.org/abs/2412.01132
作者: Joseph Raj Vishal,Divesh Basina,Aarya Choudhary,Bharatesh Chakravarthi
关键词-EN: offer promising applications, video question answering, Recent advances, efficient video interpretation, video question
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video question answering (VideoQA) offer promising applications, especially in traffic monitoring, where efficient video interpretation is critical. Within ITS, answering complex, real-time queries like “How many red cars passed in the last 10 minutes?” or “Was there an incident between 3:00 PM and 3:05 PM?” enhances situational awareness and decision-making. Despite progress in vision-language models, VideoQA remains challenging, especially in dynamic environments involving multiple objects and intricate spatiotemporal relationships. This study evaluates state-of-the-art VideoQA models using non-benchmark synthetic and real-world traffic sequences. The framework leverages GPT-4o to assess accuracy, relevance, and consistency across basic detection, temporal reasoning, and decomposition queries. VideoLLaMA-2 excelled with 57% accuracy, particularly in compositional reasoning and consistent answers. However, all models, including VideoLLaMA-2, faced limitations in multi-object tracking, temporal coherence, and complex scene interpretation, highlighting gaps in current architectures. These findings underscore VideoQA’s potential in traffic monitoring but also emphasize the need for improvements in multi-object tracking, temporal reasoning, and compositional capabilities. Enhancing these areas could make VideoQA indispensable for incident detection, traffic flow management, and responsive urban planning. The study’s code and framework are open-sourced for further exploration: this https URL
zh

[CV-124] Object Tracking in a 360o View: A Novel Perspective on Bridging the Gap to Biomedical Advancements

【速读】：该论文旨在综述对象跟踪技术在现代创新中的应用，特别是在生物医学研究中的重要性。论文指出，对象跟踪技术在细胞生物学中对于揭示细胞迁移、相互作用及药物或病原体响应等机制至关重要，推动了对疾病进展和治疗干预的理解。解决方案的关键在于从传统的特征基础方法向先进的机器学习和深度学习框架的转变。深度学习模型通过提供更高的准确性、适应性和鲁棒性，解决了传统方法在复杂环境中（如遮挡、光照变化和高密度对象）的局限性。论文将对象跟踪技术分类为传统、统计、特征基础和机器学习范式，并探讨了当前方法的局限性和新兴趋势，以指导下一代跟踪系统在生物医学研究和更广泛科学领域的开发。

链接: https://arxiv.org/abs/2412.01119
作者: Mojtaba S. Fazli,Shannon Quinn
关键词-EN: autonomous vehicles, Object tracking, modern innovation, fundamental tool, tool in modern
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 56 Pages

点击查看摘要

Abstract:Object tracking is a fundamental tool in modern innovation, with applications in defense systems, autonomous vehicles, and biomedical research. It enables precise identification, monitoring, and spatiotemporal analysis of objects across sequential frames, providing insights into dynamic behaviors. In cell biology, object tracking is vital for uncovering cellular mechanisms, such as migration, interactions, and responses to drugs or pathogens. These insights drive breakthroughs in understanding disease progression and therapeutic interventions. Over time, object tracking methods have evolved from traditional feature-based approaches to advanced machine learning and deep learning frameworks. While classical methods are reliable in controlled settings, they struggle in complex environments with occlusions, variable lighting, and high object density. Deep learning models address these challenges by delivering greater accuracy, adaptability, and robustness. This review categorizes object tracking techniques into traditional, statistical, feature-based, and machine learning paradigms, with a focus on biomedical applications. These methods are essential for tracking cells and subcellular structures, advancing our understanding of health and disease. Key performance metrics, including accuracy, efficiency, and adaptability, are discussed. The paper explores limitations of current methods and highlights emerging trends to guide the development of next-generation tracking systems for biomedical research and broader scientific domains. Comments: 56 Pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.01119 [cs.CV] (or arXiv:2412.01119v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.01119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-125] LoyalDiffusion: A Diffusion Model Guarding Against Data Replication CVPR2025

【速读】：该论文试图解决扩散模型在图像生成过程中可能导致的隐私风险问题，特别是当训练数据包含机密信息时，模型可能会复现这些数据。解决方案的关键在于改进扩散模型的U-Net架构中的跳跃连接（skip connections），通过引入复制感知U-Net（RAU-Net）架构，减少模型对训练数据的记忆。具体来说，RAU-Net在跳跃连接中嵌入信息传递模块，这些模块在不影响图像质量的前提下，降低了数据复制的风险。此外，研究还识别了模型记忆训练数据的关键时间步，并在此基础上提出了LoyalDiffusion框架，通过有针对性的训练和推理策略，实现了在减少数据复制的同时保持图像质量的目标。

链接: https://arxiv.org/abs/2412.01118
作者: Chenghao Li,Yuke Zhang,Dake Chen,Jingqi Xu,Peter A. Beerel
关键词-EN: demonstrated significant potential, diffusion model, demonstrated significant, training data, model
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 13 pages, 6 figures, Submission to CVPR 2025

点击查看摘要

Abstract:Diffusion models have demonstrated significant potential in image generation. However, their ability to replicate training data presents a privacy risk, particularly when the training data includes confidential information. Existing mitigation strategies primarily focus on augmenting the training dataset, leaving the impact of diffusion model architecture under explored. In this paper, we address this gap by examining and mitigating the impact of the model structure, specifically the skip connections in the diffusion model’s U-Net model. We first present our observation on a trade-off in the skip connections. While they enhance image generation quality, they also reinforce the memorization of training data, increasing the risk of replication. To address this, we propose a replication-aware U-Net (RAU-Net) architecture that incorporates information transfer blocks into skip connections that are less essential for image quality. Recognizing the potential impact of RAU-Net on generation quality, we further investigate and identify specific timesteps during which the impact on memorization is most pronounced. By applying RAU-Net selectively at these critical timesteps, we couple our novel diffusion model with a targeted training and inference strategy, forming a framework we refer to as LoyalDiffusion. Extensive experiments demonstrate that LoyalDiffusion outperforms the state-of-the-art replication mitigation method achieving a 48.63% reduction in replication while maintaining comparable image quality.
zh

[CV-126] Look Ma No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM

【速读】：该论文试图解决在开发和调整运动结构恢复（Structure from Motion, SfM）和视觉同步定位与地图构建（Visual SLAM, VSLAM）系统时，对高质量几何地面实况（geometric ground truth）的依赖问题。这种依赖不仅成本高昂且耗时，而且在许多情况下根本无法获取，限制了这些系统在多样环境中的应用和实际场景的扩展性。论文提出的解决方案之关键是引入了一种无需地面实况（ground-truth-free, GTF）的评估方法，通过从原始和噪声版本的输入图像中采样来进行敏感性估计，从而替代传统的基于地面实况的基准测试。该方法显示出与传统基准测试的强相关性，并支持GTF超参数调优。这一创新消除了对地面实况的需求，为利用更多数据集来源以及实现自监督和在线调优开辟了新的可能性，有望在数据驱动方面取得类似生成式AI（Generative AI）的突破。

链接: https://arxiv.org/abs/2412.01116
作者: Alejandro Fontan,Javier Civera,Tobias Fischer,Michael Milford
关键词-EN: Structure from Motion, Visual SLAM, geometric ground truth, high-quality geometric ground, ground truth
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth – a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.
zh

[CV-127] DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

【速读】：该论文试图解决图像描述模型在新数据集上性能下降的问题，特别是在跨领域场景中的泛化能力不足。解决方案的关键在于提出了Dive Into Retrieval (DIR)方法，该方法通过两个创新点来增强图像到文本的检索过程和检索文本的利用：(1) 扩散引导的检索增强，利用预训练的扩散模型通过重建噪声图像来指导图像特征学习，从而捕捉到比标准标注更全面和细粒度的视觉信息；(2) 高质量的检索数据库，提供全面的语义信息以增强描述生成，特别是在跨领域场景中。实验结果表明，DIR方法不仅在领域内保持了竞争力，而且在跨领域泛化方面显著提升，且不增加推理成本。

链接: https://arxiv.org/abs/2412.01115
作者: Hao Wu,Zhihang Zhong,Xiao Sun
关键词-EN: degradation when applied, trained on domain-specific, Image captioning models, domain-specific data, Image captioning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image captioning models often suffer from performance degradation when applied to novel datasets, as they are typically trained on domain-specific data. To enhance generalization in out-of-domain scenarios, retrieval-augmented approaches have garnered increasing attention. However, current methods face two key challenges: (1) image features used for retrieval are often optimized based on ground-truth (GT) captions, which represent the image from a specific perspective and are influenced by annotator biases, and (2) they underutilize the full potential of retrieved text, typically relying on raw captions or parsed objects, which fail to capture the full semantic richness of the data. In this paper, we propose Dive Into Retrieval (DIR), a method designed to enhance both the image-to-text retrieval process and the utilization of retrieved text to achieve a more comprehensive understanding of the visual content. Our approach introduces two key innovations: (1) diffusion-guided retrieval enhancement, where a pretrained diffusion model guides image feature learning by reconstructing noisy images, allowing the model to capture more comprehensive and fine-grained visual information beyond standard annotated captions; and (2) a high-quality retrieval database, which provides comprehensive semantic information to enhance caption generation, especially in out-of-domain scenarios. Extensive experiments demonstrate that DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.
zh

[CV-128] One Shot One Talk: Whole-body Talking Avatar from a Single Image

【速读】：该论文试图解决从单张图像构建全身说话虚拟形象的问题，特别是解决复杂动态建模和泛化到新姿态与表情的问题。解决方案的关键在于：1) 利用姿态引导的图像到视频扩散模型生成伪标签视频帧，以实现无缝泛化；2) 引入紧密耦合的3D网格-网格混合虚拟形象表示，并应用多种关键正则化方法来缓解伪标签视频中不一致和噪声带来的动态建模挑战。通过这些方法，论文展示了从单张图像创建逼真、可精确动画且富有表现力的全身说话虚拟形象的能力。

链接: https://arxiv.org/abs/2412.01106
作者: Jun Xiang,Yudong Guo,Leipeng Hu,Boyang Guo,Yancheng Yuan,Juyong Zhang
关键词-EN: lack precise control, Building realistic, monocular self-rotating videos, methods lack precise, requires minutes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos, and most methods lack precise control over gestures and expressions. To push this boundary, we address the challenge of constructing a whole-body talking avatar from a single image. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. To achieve seamless generalization, we leverage recent pose-guided image-to-video diffusion models to generate imperfect video frames as pseudo-labels. To overcome the dynamic modeling challenge posed by inconsistent and noisy pseudo-videos, we introduce a tightly coupled 3DGS-mesh hybrid avatar representation and apply several key regularizations to mitigate inconsistencies caused by imperfect labels. Extensive experiments on diverse subjects demonstrate that our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
zh

[CV-129] Hiding Faces in Plain Sight: Defending DeepFakes by Disrupting Face Detection

【速读】：该论文试图解决的问题是如何通过主动防御策略防止个人成为DeepFake视频的受害者。解决方案的关键在于开发了一种名为FacePosion的框架，通过破坏面部检测器（face detectors）的功能来阻止DeepFake模型的训练或合成过程。具体来说，FacePosion利用对抗攻击（adversarial attacks）技术，专门设计来干扰面部检测器，使其提取的面部图像失真或错误，从而破坏DeepFake模型的生成过程。此外，论文还提出了VideoFacePoison策略，通过在视频帧间传播FacePosion，而非逐帧应用，以减少计算开销并保持攻击效果。实验结果表明，该方法在多个面部检测器和DeepFake模型上均有效，能够显著阻碍DeepFake视频的生成。

链接: https://arxiv.org/abs/2412.01101
作者: Delong Zhu,Yuezun Li,Baoyuan Wu,Jiaran Zhou,Zhibo Wang,Siwei Lyu
关键词-EN: sabotaging face detection, DeepFake defense framework, proactive DeepFake defense, face detectors, defense framework
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper investigates the feasibility of a proactive DeepFake defense framework, \em FacePosion, to prevent individuals from becoming victims of DeepFake videos by sabotaging face detection. The motivation stems from the reliance of most DeepFake methods on face detectors to automatically extract victim faces from videos for training or synthesis (testing). Once the face detectors malfunction, the extracted faces will be distorted or incorrect, subsequently disrupting the training or synthesis of the DeepFake model. To achieve this, we adapt various adversarial attacks with a dedicated design for this purpose and thoroughly analyze their feasibility. Based on FacePoison, we introduce \em VideoFacePoison, a strategy that propagates FacePoison across video frames rather than applying them individually to each frame. This strategy can largely reduce the computational overhead while retaining the favorable attack performance. Our method is validated on five face detectors, and extensive experiments against eleven different DeepFake models demonstrate the effectiveness of disrupting face detectors to hinder DeepFake generation.
zh

[CV-130] VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

【速读】：该论文试图解决视频异常检测 (Video Anomaly Detection, VAD) 中现有方法在利用视觉-语言模型 (Vision-Language Models, VLMs) 进行异常检测时，需要额外推理模块或大量数据标注的问题。解决方案的关键在于提出了一个名为 VERA 的口头化学习框架，该框架通过将复杂的推理任务分解为针对特定异常模式的简单引导问题，并将这些问题作为可学习的参数进行优化。VERA 利用粗略标注的训练数据，通过学习者与优化者 VLMs 之间的数据驱动口头交互来优化这些参数。在推理阶段，VERA 将学习到的问题嵌入模型提示中，指导 VLMs 生成段级异常分数，并通过场景和时间上下文的融合进一步细化到帧级分数。这种方法在不修改模型参数的情况下，显著提升了 VAD 的检测性能和解释性。

链接: https://arxiv.org/abs/2412.01095
作者: Muchao Ye,Weiyang Liu,Pan He
关键词-EN: simultaneously detect anomalies, provide comprehendible explanations, VAD, rapid advancement, advancement of vision-language
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of vision-language models (VLMs) has established a new paradigm in video anomaly detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations for the decisions. Existing work in this direction often assumes the complex reasoning required for VAD exceeds the capabilities of pretrained VLMs. Consequently, these approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal interactions between learner and optimizer VLMs, using coarsely labeled training data. During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores, which are then refined into frame-level scores via the fusion of scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions of VERA are highly adaptable, significantly improving both detection performance and explainability of VLMs for VAD.
zh

[CV-131] DuoCast: Duo-Probabilistic Meteorology-Aware Model for Extended Precipitation Nowcasting

【速读】：该论文试图解决短期降水预报中精度下降的问题，主要原因是现有模型未能充分考虑气象知识，如天气锋面对降水强度、持续时间和空间分布的显著影响。解决方案的关键在于提出了DuoCast，一种新颖的双概率气象感知模型，通过两个扩散模型PrecipFlow和MicroDynamic分别处理广域天气演变和微尺度波动。PrecipFlow模型利用极端降水感知编码器（EPA-Encoder），包含AirConvolution和FrontAttention模块，处理一般和极端降水数据，并通过基于UNet的扩散模型生成富含天气锋面信息的预测图。MicroDynamic模型则进一步细化结果，捕捉微尺度变异性。实验结果表明，DuoCast在四个公开基准测试中表现优异，超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.01091
作者: Penghui Wen,Lei Bai,Mengwei He,Patrick Filippi,Feng Zhang,Thomas Francis Bishop,Zhiyong Wang,Kun Hu
关键词-EN: extended short-term precipitation, influence precipitation intensity, short-term precipitation nowcasting, precipitation nowcasting struggles, significantly influence precipitation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, extended short-term precipitation nowcasting struggles with decreasing precision because of insufficient consideration of meteorological knowledge, such as weather fronts which significantly influence precipitation intensity, duration, and spatial distribution. Therefore, in this paper, we present DuoCast, a novel dual-probabilistic meteorology-aware model designed to address both broad weather evolution and micro-scale fluctuations using two diffusion models, PrecipFlow and MicroDynamic, respectively. Our PrecipFlow model captures evolution trends through an Extreme Precipitation-Aware Encoder (EPA-Encoder), which includes AirConvolution and FrontAttention blocks to process two levels of precipitation data: general and extreme. The output conditions a UNet-based diffusion to produce prediction maps enriched with weather front information. The MicroDynamic model further refines the results to capture micro-scale variability. Extensive experiments on four public benchmarks demonstrate the effectiveness of our DuoCast, achieving superior performance over state-of-the-art methods. Our code is available at this https URL.
zh

[CV-132] STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

【速读】：该论文试图解决视频单目深度估计中深度一致性问题，特别是在动态或不规则运动场景下的深度估计。解决方案的关键在于提出了一种名为STATIC的新模型，该模型通过独立学习静态和动态区域的时序一致性来实现这一目标。具体来说，STATIC利用表面法线差异掩码（difference mask from surface normals）来区分静态和动态区域，并通过Masked Static (MS)模块和Surface Normal Similarity (SNS)模块分别增强静态和动态区域的时序一致性。最终，通过整合这两个独立学习的结果，STATIC在整个视频序列中实现了深度一致性，并在KITTI和NYUv2数据集上达到了最先进的视频深度估计性能。

链接: https://arxiv.org/abs/2412.01090
作者: Sunghun Yang,Minhyeok Lee,Suhwan Cho,Jungho Lee,Sangyoun Lee
关键词-EN: monocular depth estimation, depth estimation, Video monocular depth, video depth estimation, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information.
zh

[CV-133] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

【速读】：该论文试图解决基于扩散生成模型的肖像视频动画在时间一致性视频生成和快速采样方面的挑战。解决方案的关键在于提出了FLOAT方法，这是一种基于流匹配生成模型的音频驱动肖像视频生成技术。FLOAT方法通过将生成建模从基于像素的潜在空间转移到学习到的运动潜在空间，从而实现了高效的时间一致性运动设计。具体来说，论文引入了一个基于Transformer的向量场预测器，并采用了一种简单而有效的逐帧条件机制。此外，该方法还支持语音驱动的情感增强，使得表达性动作的自然融合成为可能。通过广泛的实验，FLOAT方法在视觉质量、运动保真度和效率方面均优于现有的最先进音频驱动肖像视频生成方法。

链接: https://arxiv.org/abs/2412.01064
作者: Taekyung Ki,Dongchan Min,Gyoungsu Chae
关键词-EN: achieved remarkable results, portrait image animation, remarkable results, diffusion-based generative models, rapid advancement
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Project page: this https URL

点击查看摘要

Abstract:With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
zh

[CV-134] Classifying Simulated Gait Impairments using Privacy-preserving Explainable Artificial Intelligence and Mobile Phone Videos ALT

【速读】：该论文试图解决步态障碍诊断中主观评估方法的局限性和高成本问题，提出了一种基于智能手机的隐私保护型人工智能（AI）系统，用于分类步态障碍。解决方案的关键在于利用标准智能手机摄像头捕捉的正面和侧面视角视频，结合频率域特征和熵测量，通过模型特征重要性分析，确定下肢关键点在分类中的重要性，从而实现对七种不同步态模式的高精度分类（86.5%）。该系统通过在设备上处理数据来保护患者隐私，展示了其在临床、社区和远程康复环境中作为可访问、客观步态评估工具的潜力。

链接: https://arxiv.org/abs/2412.01056
作者: Lauhitya Reddy,Ketan Anand,Shoibolina Kaushik,Corey Rodrigo,J. Lucas McKay,Trisha M. Kesar,Hyeokhyen Kwon
关键词-EN: current solutions requiring, expensive multi-camera equipment, Accurate diagnosis, costly assessment methods, subjective clinical observation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 Figures, 4 Tables, Submitted to PLOS Digital Health

点击查看摘要

Abstract:Accurate diagnosis of gait impairments is often hindered by subjective or costly assessment methods, with current solutions requiring either expensive multi-camera equipment or relying on subjective clinical observation. There is a critical need for accessible, objective tools that can aid in gait assessment while preserving patient privacy. In this work, we present a mobile phone-based, privacy-preserving artificial intelligence (AI) system for classifying gait impairments and introduce a novel dataset of 743 videos capturing seven distinct gait patterns. The dataset consists of frontal and sagittal views of trained subjects simulating normal gait and six types of pathological gait (circumduction, Trendelenburg, antalgic, crouch, Parkinsonian, and vaulting), recorded using standard mobile phone cameras. Our system achieved 86.5% accuracy using combined frontal and sagittal views, with sagittal views generally outperforming frontal views except for specific gait patterns like Circumduction. Model feature importance analysis revealed that frequency-domain features and entropy measures were critical for classifcation performance, specifically lower limb keypoints proved most important for classification, aligning with clinical understanding of gait assessment. These findings demonstrate that mobile phone-based systems can effectively classify diverse gait patterns while preserving privacy through on-device processing. The high accuracy achieved using simulated gait data suggests their potential for rapid prototyping of gait analysis systems, though clinical validation with patient data remains necessary. This work represents a significant step toward accessible, objective gait assessment tools for clinical, community, and tele-rehabilitation settings
zh

[CV-135] CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

【速读】：该论文试图解决从RGB-D图像中估计物体姿态和形状的问题。解决方案的关键在于提出了一个名为CRISP的类别无关的物体姿态和形状估计管道。CRISP管道包括一个用于形状估计的编码器-解码器模型，采用FiLM条件化进行隐式形状重建，以及基于DPT的网络用于估计姿态归一化点以进行姿态估计。此外，论文还提出了一个基于优化的姿态和形状校正器，用于纠正由于领域差异引起的估计误差。通过将形状解码器近似为主动形状模型，形状校正问题被简化为一个约束线性最小二乘问题，可高效地通过内点算法求解。最后，论文引入了一个自训练管道，基于correct-and-certify方法进行自监督领域适应，利用校正器在测试时生成伪标签，并用其自训练CRISP。实验结果表明，CRISP在多个数据集上表现出色，并能有效跨越领域差异，甚至泛化到未见过的物体。

链接: https://arxiv.org/abs/2412.01052
作者: Jingnan Shi,Rajat Talak,Harry Zhang,David Jin,Luca Carlone
关键词-EN: shape, RGB-D image, CRISP, object pose, pose
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects. Code and pre-trained models will be available on this https URL.
zh

[CV-136] Cerberus: Attribute-based person re-identification using semantic IDs

【速读】：该论文试图解决基于属性的行人重识别（reID）问题，即通过利用行人属性标签来学习局部和全局行人表示，以编码性别、服装风格等特定特征。解决方案的关键在于定义语义ID（SIDs）并通过语义引导损失（semantic guidance loss）将行人表示与相应SIDs的原型特征对齐，从而鼓励表示编码相关语义。同时，通过强制同一行人的表示紧密嵌入，以识别共享相同属性标签的行人的细微外观差异。此外，提出了一种正则化方法，利用SID原型之间的关系来增强对未见数据的泛化能力。该框架通过比较查询和图库图像之间的局部和全局行人表示，实现了基于属性的reID，并能进行行人属性识别（PAR）和基于属性的行人搜索（APS）。

链接: https://arxiv.org/abs/2412.01048
作者: Chanho Eom,Geon Lee,Kyunghwan Cho,Hyeonseok Jung,Moonsub Jin,Bumsub Ham
关键词-EN: dubbed Cerberus, attribute-based person re-identification, person, global person representations, attribute labels
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a new framework, dubbed Cerberus, for attribute-based person re-identification (reID). Our approach leverages person attribute labels to learn local and global person representations that encode specific traits, such as gender and clothing style. To achieve this, we define semantic IDs (SIDs) by combining attribute labels, and use a semantic guidance loss to align the person representations with the prototypical features of corresponding SIDs, encouraging the representations to encode the relevant semantics. Simultaneously, we enforce the representations of the same person to be embedded closely, enabling recognizing subtle differences in appearance to discriminate persons sharing the same attribute labels. To increase the generalization ability on unseen data, we also propose a regularization method that takes advantage of the relationships between SID prototypes. Our framework performs individual comparisons of local and global person representations between query and gallery images for attribute-based reID. By exploiting the SID prototypes aligned with the corresponding representations, it can also perform person attribute recognition (PAR) and attribute-based person search (APS) without bells and whistles. Experimental results on standard benchmarks on attribute-based person reID, Market-1501 and DukeMTMC, demonstrate the superiority of our model compared to the state of the art.
zh

[CV-137] Improving Detail in Pluralistic Image Inpainting with Feature Dequantization

【速读】：该论文试图解决基于VQGAN的图像修复模型PUT在特征量化过程中导致的细节质量下降问题。解决方案的关键在于提出了特征反量化模块（Feature Dequantization Module, FDM），通过补偿特征量化引起的信息损失来恢复图像的细节质量。此外，论文还开发了一种高效的FDM训练方法，显著降低了训练成本，同时确保在生成图像时保持高质量的细节，且训练和推理开销几乎可以忽略不计。

链接: https://arxiv.org/abs/2412.01046
作者: Kyungri Park,Woohwan Jung
关键词-EN: Pluralistic Image Inpainting, offers multiple plausible, multiple plausible solutions, restoring missing parts, applications including image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pluralistic Image Inpainting (PII) offers multiple plausible solutions for restoring missing parts of images and has been successfully applied to various applications including image editing and object removal. Recently, VQGAN-based methods have been proposed and have shown that they significantly improve the structural integrity in the generated images. Nevertheless, the state-of-the-art VQGAN-based model PUT faces a critical challenge: degradation of detail quality in output images due to feature quantization. Feature quantization restricts the latent space and causes information loss, which negatively affects the detail quality essential for image inpainting. To tackle the problem, we propose the FDM (Feature Dequantization Module) specifically designed to restore the detail quality of images by compensating for the information loss. Furthermore, we develop an efficient training method for FDM which drastically reduces training costs. We empirically demonstrate that our method significantly enhances the detail quality of the generated images with negligible training and inference overheads.
zh

[CV-138] Quantization-Aware Imitation-Learning for Resource-Efficient Robotic Control

【速读】：该论文试图解决基于深度神经网络（DNN）的策略模型（如视觉-语言-动作模型，VLA）在自动化复杂决策过程中，随着模型规模的扩大，计算成本显著增加的问题。特别是在机器人操作和自动驾驶等需要快速、准确响应的应用领域，如何在资源受限的硬件上高效部署这些模型成为一个挑战。论文提出的解决方案关键在于引入一种新的量化框架，该框架通过在训练过程中微调参数以增强模型对低比特精度误差的鲁棒性，从而在保持效率和可靠性的同时，实现模型在资源受限条件下的部署。实验结果表明，该框架在4-bit权重量化的情况下，能够在实际边缘GPU上实现高达2.5倍的加速和2.5倍的能耗节省，同时保持精度；在4-bit权重和激活量化的自动驾驶模型中，则能实现高达3.7倍的加速和3.1倍的能耗节省。

链接: https://arxiv.org/abs/2412.01034
作者: Seongmin Park,Hyungmin Kim,Wonseok Jeon,Juyoung Yang,Byeongwook Jeon,Yoonseon Oh,Jungwook Choi
关键词-EN: Deep neural network, interpreting multi-modal data, automating complex decision-making, Deep neural, based policy models
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural network (DNN)-based policy models like vision-language-action (VLA) models are transformative in automating complex decision-making across applications by interpreting multi-modal data. However, scaling these models greatly increases computational costs, which presents challenges in fields like robot manipulation and autonomous driving that require quick, accurate responses. To address the need for deployment on resource-limited hardware, we propose a new quantization framework for IL-based policy models that fine-tunes parameters to enhance robustness against low-bit precision errors during training, thereby maintaining efficiency and reliability under constrained conditions. Our evaluations with representative robot manipulation for 4-bit weight-quantization on a real edge GPU demonstrate that our framework achieves up to 2.5x speedup and 2.5x energy savings while preserving accuracy. For 4-bit weight and activation quantized self-driving models, the framework achieves up to 3.7x speedup and 3.1x energy saving on a low-end GPU. These results highlight the practical potential of deploying IL-based policy models on resource-constrained devices.
zh

[CV-139] Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

【速读】：该论文试图解决文本引导图像操作中存在的语言模糊性问题，特别是在训练集中未充分表示或难以仅用语言描述的操作。解决方案的关键在于引入了一种名为 InstaManip 的新型多模态自回归模型，该模型通过上下文学习（in-context learning）能够即时从文本和视觉指导中学习新的图像操作，并将其应用于新的查询图像。具体来说，论文提出了一种创新的组自注意力机制（group self-attention mechanism），将上下文学习过程分解为学习和应用两个独立的阶段，从而简化了复杂问题。此外，还引入了一种关系正则化方法（relation regularization method），以进一步分离图像变换特征与示例图像中的无关内容。实验结果表明，该方法在人类评估中显著优于以往的少样本图像操作模型（提升至少19%），并且通过增加示例图像的数量或多样性可以进一步提升模型性能。

链接: https://arxiv.org/abs/2412.01027
作者: Bolin Lai,Felix Juefei-Xu,Miao Liu,Xiaoliang Dai,Nikhil Mehta,Chenguang Zhu,Zeyi Huang,James M. Rehg,Sangmin Lee,Ning Zhang,Tong Xiao
关键词-EN: Text-guided image manipulation, experienced notable advancement, recent years, Text-guided image, advancement in recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 16 figures, 5 tables

点击查看摘要

Abstract:Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed \textbfInstaManip , that can \textbfinsta ntly learn a new image \textbfmanip ulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages – learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin ( \geq 19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
zh

[CV-140] Learning Structured Representations with Hyperbolic Embeddings NEURIPS’24

【速读】：该论文试图解决现有表示学习方法忽视标签层次结构的问题，特别是在处理自然层次关系时，欧几里得距离可能扭曲语义上下文。解决方案的关键是提出了一种名为HypStructure的新方法，它利用双曲空间（Hyperbolic Spaces）在模型层次关系方面的优势，通过双曲树状表示损失和中心化损失的结合，将标签层次结构准确嵌入到学习到的表示中。这种方法可以与任何标准任务损失结合，从而在低维度场景下显著减少失真并提升泛化性能。

链接: https://arxiv.org/abs/2412.01023
作者: Aditya Sinha,Siqi Zeng,Makoto Yamada,Han Zhao
关键词-EN: inherent label structure, real-world datasets consist, constructed cheaply, real-world datasets, natural hierarchy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Published as a conference paper at NeurIPS '24, first two authors contributed equally to the work. 40 pages, 23 figures

点击查看摘要

Abstract:Most real-world datasets consist of a natural hierarchy between classes or an inherent label structure that is either already available or can be constructed cheaply. However, most existing representation learning methods ignore this hierarchy, treating labels as permutation invariant. Recent work [Zeng et al., 2022] proposes using this structured information explicitly, but the use of Euclidean distance may distort the underlying semantic context [Chen et al., 2013]. In this work, motivated by the advantage of hyperbolic spaces in modeling hierarchical relationships, we propose a novel approach HypStructure: a Hyperbolic Structured regularization approach to accurately embed the label hierarchy into the learned representations. HypStructure is a simple-yet-effective regularizer that consists of a hyperbolic tree-based representation loss along with a centering loss, and can be combined with any standard task loss to learn hierarchy-informed features. Extensive experiments on several large-scale vision benchmarks demonstrate the efficacy of HypStructure in reducing distortion and boosting generalization performance especially under low dimensional scenarios. For a better understanding of structured representation, we perform eigenvalue analysis that links the representation geometry to improved Out-of-Distribution (OOD) detection performance seen empirically. The code is available at \urlthis https URL.
zh

[CV-141] Adaptive Rank Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA

【速读】：该论文试图解决在持续学习（Continual Learning, CL）过程中，如何在不依赖额外参考数据和复杂分布预测的情况下，保留或增强预训练视觉语言模型（Vision-Language Models, VLMs）的知识，同时吸收新数据流中的知识。解决方案的关键在于提出了动态秩选择低秩适应（Dynamic Rank-Selective Low Rank Adaptation, LoRA）方法，该方法根据当前数据的相关性动态分配秩给LoRA模块，从而在保留预训练知识的同时，持续增强模型对新任务的适应能力。这种方法无需显式的领域或分布预测，也不需要额外的参考数据，能够在不增加推理开销的情况下，无缝集成新任务并保持预训练模型的原始架构和部署流程。

链接: https://arxiv.org/abs/2412.01004
作者: Haodong Lu,Chongyang Zhao,Jason Xue,Lina Yao,Kristen Moore,Dong Gong
关键词-EN: continual learning, enhanced during continual, pre-trained, knowledge, CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:We investigate whether the pre-trained knowledge of vision-language models (VLMs), such as CLIP, can be retained or even enhanced during continual learning (CL) while absorbing knowledge from a data stream. Existing methods often rely on additional reference data, isolated components for distribution or domain predictions, leading to high training costs, increased inference complexity, and limited improvement potential for pre-trained models. To address these challenges, we first comprehensively analyze the effects of parameter update locations and ranks on downstream adaptation and knowledge retention. Based on these insights, we propose Dynamic Rank-Selective Low Rank Adaptation (LoRA), a universal and efficient CL approach that adaptively assigns ranks to LoRA modules based on their relevance to the current data. Unlike prior methods, our approach continually enhances the pre-trained VLM by retaining both the pre-trained knowledge and the knowledge acquired during CL. Our approach eliminates the need for explicit domain or distribution prediction and additional reference data, enabling seamless integration of new tasks while preserving pre-trained capabilities. It also maintains the original architecture and deployment pipeline of the pre-trained model without incurring any additional inference overhead. Extensive experiments and analyses demonstrate that our method outperforms state-of-the-art approaches in continually absorbing knowledge of downstream tasks while retaining pre-trained knowledge.
zh

[CV-142] oken Cropr: Faster ViTs for Quite a Few Tasks

【速读】：该论文试图解决在资源受限的应用中，如何提高Vision Transformers (ViTs)的推理吞吐量的问题。解决方案的关键在于提出了一种基于辅助预测头的token修剪方法，该方法能够在端到端训练中学习选择与任务相关的token。训练完成后，这些辅助头可以被移除，从而使得推理速度接近于随机修剪器。通过这种方法，论文在图像分类、语义分割、目标检测和实例分割等多个视觉任务中实现了1.5到4倍的加速，同时性能下降极小。在ADE20k语义分割基准测试中，该方法实现了相对于无修剪基线的2倍加速，且性能损失仅为0.1的平均交并比（mIoU）。

链接: https://arxiv.org/abs/2412.00965
作者: Benjamin Bergner,Christoph Lippert,Aravindh Mahendran
关键词-EN: resource-constrained applications necessitates, applications necessitates improvements, Vision Transformers, resource-constrained applications, applications necessitates
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
zh

[CV-143] WAFFLE: Multimodal Floorplan Understanding in the Wild

【速读】：该论文试图解决现有建筑理解研究中对建筑平面图（floorplan）这一核心元素的忽视问题，以及现有平面图理解数据集的局限性，即它们通常仅涵盖单一语义类别和地区。解决方案的关键在于引入了WAFFLE，这是一个新颖的多模态建筑平面图理解数据集，包含近20,000张平面图图像及其元数据，涵盖多种建筑类型、地理位置和数据格式。通过使用大型语言模型和多模态基础模型，WAFFLE能够从这些图像及其伴随的噪声元数据中提取和整理语义信息，从而支持新的建筑理解任务，包括判别性和生成性任务，这些都是以往数据集无法实现的。论文计划公开发布WAFFLE及其代码和训练模型，为研究社区提供一个学习建筑语义的新基础。

链接: https://arxiv.org/abs/2412.00955
作者: Keren Ganon,Morris Alper,Rachel Mikulinsky,Hadar Averbuch-Elor
关键词-EN: central feature, feature of human, human culture, increasingly being analyzed, computational methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Buildings are a central feature of human culture and are increasingly being analyzed with computational methods. However, recent works on computational building understanding have largely focused on natural imagery of buildings, neglecting the fundamental element defining a building’s structure – its floorplan. Conversely, existing works on floorplan understanding are extremely limited in scope, often focusing on floorplans of a single semantic category and region (e.g. floorplans of apartments from a single country). In this work, we introduce WAFFLE, a novel multimodal floorplan understanding dataset of nearly 20K floorplan images and metadata curated from Internet data spanning diverse building types, locations, and data formats. By using a large language model and multimodal foundation models, we curate and extract semantic information from these images and their accompanying noisy metadata. We show that WAFFLE enables progress on new building understanding tasks, both discriminative and generative, which were not feasible using prior datasets. We will publicly release WAFFLE along with our code and trained models, providing the research community with a new foundation for learning the semantics of buildings.
zh

[CV-144] ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

【速读】：该论文试图解决3D计算机视觉中的形状补全问题，特别是在面对未知姿态或标准坐标系时，现有方法在不同旋转角度下表现不佳的问题。解决方案的关键在于引入了一种名为ESCAPE（Equivariant Shape Completion via Anchor Point Encoding）的新框架，通过选择形状中的锚点并将其余点表示为到所有锚点的距离，从而实现旋转等变性（rotation-equivariant）的形状补全。ESCAPE利用transformer架构来编码和解码距离变换，确保生成的形状补全在旋转变换下保持准确和等变性，并通过优化计算预测的形状。实验结果表明，ESCAPE在任意旋转和变换下都能实现鲁棒且高质量的重建，无需额外的姿态估计模块。

链接: https://arxiv.org/abs/2412.00952
作者: Burak Bekci,Nassir Navab,Federico Tombari,Mahdi Saleh
关键词-EN: partially observed objects, computer vision, involves predicting, crucial task, predicting and filling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shape completion, a crucial task in 3D computer vision, involves predicting and filling the missing regions of scanned or partially observed objects. Current methods expect known pose or canonical coordinates and do not perform well under varying rotations, limiting their real-world applicability. We introduce ESCAPE (Equivariant Shape Completion via Anchor Point Encoding), a novel framework designed to achieve rotation-equivariant shape completion. Our approach employs a distinctive encoding strategy by selecting anchor points from a shape and representing all points as a distance to all anchor points. This enables the model to capture a consistent, rotation-equivariant understanding of the object’s geometry. ESCAPE leverages a transformer architecture to encode and decode the distance transformations, ensuring that generated shape completions remain accurate and equivariant under rotational transformations. Subsequently, we perform optimization to calculate the predicted shapes from the encodings. Experimental evaluations demonstrate that ESCAPE achieves robust, high-quality reconstructions across arbitrary rotations and translations, showcasing its effectiveness in real-world applications without additional pose estimation modules.
zh

[CV-145] FIction: 4D Future Interaction Prediction from Video

【速读】：该论文试图解决从视频中预测人类未来与环境中的物体交互的问题，特别是在3D空间中预测交互的“位置”（where）和“方式”（how），而不仅仅是2D视频帧中的“内容”（what）。解决方案的关键在于提出了一种名为FIction的新模型，该模型通过融合过去视频中的人类动作观察和环境信息，来预测未来交互的“位置”和“方式”。通过在Ego-Exo4D数据集上的广泛实验，该方法显著优于现有的自回归模型和（提升的）2D视频模型，相对增益超过30%。

链接: https://arxiv.org/abs/2412.00932
作者: Kumar Ashutosh,Georgios Pavlakos,Kristen Grauman
关键词-EN: frames-capturing physically ungrounded, physically ungrounded predictions, video frames-capturing physically, existing methods, methods are limited
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of ‘what’ and ignoring the ‘where’ and ‘how’. We introduce 4D future interaction prediction from videos. Given an input video of a human activity, the goal is to predict what objects at what 3D locations the person will interact with in the next time period (e.g., cabinet, fridge), and how they will execute that interaction (e.g., poses for bending, reaching, pulling). We propose a novel model FIction that fuses the past video observation of the person’s actions and their environment to predict both the ‘where’ and ‘how’ of future interactions. Through comprehensive experiments on a variety of activities and real-world environments in Ego-Exo4D, we show that our proposed approach outperforms prior autoregressive and (lifted) 2D video models substantially, with more than 30% relative gains.
zh

[CV-146] VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

【速读】：该论文试图解决当前大型多模态模型（LMMs）在处理和理解长时长或高分辨率视频时面临的挑战，主要原因是缺乏高质量的数据集。解决方案的关键在于提出了一个名为VISTA的视频时空增强框架，该框架通过合成现有视频-字幕数据集中的长时长和高分辨率视频指令跟随对来解决这一问题。VISTA通过空间和时间上的视频组合，生成新的合成视频，并为其生成相关的问题-答案对。基于此框架，论文开发了七种视频增强方法，并构建了VISTA-400K数据集，旨在提升对长时长和高分辨率视频的理解能力。通过在多个视频LMMs上微调这些数据，论文在四个长视频理解基准测试中平均提升了3.3%的性能，并在首个高分辨率视频理解基准HRVideoBench上实现了6.5%的性能提升。

链接: https://arxiv.org/abs/2412.00927
作者: Weiming Ren,Huan Yang,Jie Min,Cong Wei,Wenhu Chen
关键词-EN: Current large multimodal, face significant challenges, Current large, large multimodal models, Video Spatiotemporal Augmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.
zh

[CV-147] Ref-GS: Directional Factorization for 2D Gaussian Splatting

【速读】：该论文试图解决在2D高斯喷射（Gaussian splatting）中方向性光照分解的问题，以实现视图依赖的逼真外观渲染和精确的几何恢复。解决方案的关键在于引入了一种名为Ref-GS的新方法，该方法通过延迟渲染（deferred rendering）和高斯喷射技术，应用方向性编码到延迟渲染的表面上，从而有效减少方向和视角之间的模糊性。此外，论文还提出了球形Mip-grid来捕捉不同程度的表面粗糙度，实现粗糙度感知的高斯着色。最后，通过几何-光照分解方法，利用向量外积连接几何和光照，显著减少了体积属性集成时的渲染开销。这些创新使得该方法在开放世界场景中实现了优越的逼真渲染效果，并能准确恢复几何结构。

链接: https://arxiv.org/abs/2412.00905
作者: Youjia Zhang,Anpei Chen,Yumin Wan,Zikai Song,Junqing Yu,Yawei Luo,Wei Yang
关键词-EN: precise geometry recovery, view-dependent appearance rendering, enables photorealistic view-dependent, photorealistic view-dependent appearance, Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we introduce Ref-GS, a novel approach for directional light factorization in 2D Gaussian splatting, which enables photorealistic view-dependent appearance rendering and precise geometry recovery. Ref-GS builds upon the deferred rendering of Gaussian splatting and applies directional encoding to the deferred-rendered surface, effectively reducing the ambiguity between orientation and viewing angle. Next, we introduce a spherical Mip-grid to capture varying levels of surface roughness, enabling roughness-aware Gaussian shading. Additionally, we propose a simple yet efficient geometry-lighting factorization that connects geometry and lighting via the vector outer product, significantly reducing renderer overhead when integrating volumetric attributes. Our method achieves superior photorealistic rendering for a range of open-world scenes while also accurately recovering geometry.
zh

[CV-148] omographic SAR Reconstruction for Forest Height Estimation

【速读】：该论文试图解决全球范围内森林冠层高度估算的问题，特别是在传统方法如摄影测量和激光雷达（LiDAR）成本高且难以大规模应用的情况下。解决方案的关键在于利用合成孔径雷达（SAR）图像的2D单视复图像（SLC），通过深度学习直接估算森林冠层高度，从而避免传统的层析信号处理，降低从SAR数据采集到最终产品的时间延迟。论文还探讨了不同数量的SLC图像对高度估算精度的影响，以优化未来卫星数据收集策略。尽管这种方法相比结合层析处理和深度学习的完整方法误差高出16-21%，但它强调了在某些应用中简化处理流程的可行性。

链接: https://arxiv.org/abs/2412.00903
作者: Grace Colverd,Jumpei Takami,Laura Schade,Karol Bot,Joseph A. Gallego-Mejia
关键词-EN: Tree height estimation, Synthetic Aperture Radar, Tree height, important proxy, proxy for biomass
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tree height estimation serves as an important proxy for biomass estimation in ecological and forestry applications. While traditional methods such as photogrammetry and Light Detection and Ranging (LiDAR) offer accurate height measurements, their application on a global scale is often cost-prohibitive and logistically challenging. In contrast, remote sensing techniques, particularly 3D tomographic reconstruction from Synthetic Aperture Radar (SAR) imagery, provide a scalable solution for global height estimation. SAR images have been used in earth observation contexts due to their ability to work in all weathers, unobscured by clouds. In this study, we use deep learning to estimate forest canopy height directly from 2D Single Look Complex (SLC) images, a derivative of SAR. Our method attempts to bypass traditional tomographic signal processing, potentially reducing latency from SAR capture to end product. We also quantify the impact of varying numbers of SLC images on height estimation accuracy, aiming to inform future satellite operations and optimize data collection strategies. Compared to full tomographic processing combined with deep learning, our minimal method (partial processing + deep learning) falls short, with an error 16-21% higher, highlighting the continuing relevance of geometric signal processing.
zh

[CV-149] Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

【速读】：该论文试图解决工业异常检测 (Industrial Anomaly Detection, IAD) 中的问题，特别是在制造过程中的维护和质量控制。解决方案的关键在于提出了一种新的方法，即通过对比跨模态训练的视觉-语言异常检测 (Vision-Language Anomaly Detection via Contrastive Cross-Modal Training, CLAD)。该方法利用大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 来提升工业环境中的异常检测和定位能力。CLAD 通过对比学习将视觉和文本特征对齐到一个共享的嵌入空间中，确保正常实例聚集在一起，而异常实例则被推开。这种方法不仅在图像级别的异常检测和像素级别的异常定位上表现出色，还通过精确的异常定位增强了可解释性，使其成为实际工业应用中的一个有前景的解决方案。

链接: https://arxiv.org/abs/2412.00890
作者: Kun Qian,Tianyu Sun,Wenhong Wang
关键词-EN: anomaly detection, Industrial anomaly detection, plays a crucial, manufacturing processes, Vision-Language Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Industrial anomaly detection (IAD) plays a crucial role in the maintenance and quality control of manufacturing processes. In this paper, we propose a novel approach, Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD), which leverages large vision-language models (LVLMs) to improve both anomaly detection and localization in industrial settings. CLAD aligns visual and textual features into a shared embedding space using contrastive learning, ensuring that normal instances are grouped together while anomalies are pushed apart. Through extensive experiments on two benchmark industrial datasets, MVTec-AD and VisA, we demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization. Additionally, we provide ablation studies and human evaluation to validate the importance of key components in our method. Our approach not only achieves superior performance but also enhances interpretability by accurately localizing anomalies, making it a promising solution for real-world industrial applications.
zh

[CV-150] SyncVIS: Synchronized Video Instance Segmentation

【速读】：该论文试图解决现有基于DETR的视频实例分割(Video Instance Segmentation, VIS)方法在处理复杂和挑战性视频场景时遇到的困难，这些困难主要源于异步设计，即通过视频级查询或查询敏感的级联结构来建模视频序列。论文提出的解决方案是引入同步建模框架SyncVIS，其关键在于：1) 引入视频级查询嵌入(video-level query embeddings)；2) 设计两个关键模块：同步视频-帧建模范式(synchronized video-frame modeling paradigm)和同步嵌入优化策略(synchronized embedding optimization strategy)。前者促进帧级和视频级嵌入的相互学习，后者将大型视频序列分割为小片段以简化优化过程。实验结果表明，SyncVIS在多个基准测试中达到了最先进的结果，证明了其有效性和通用性。

链接: https://arxiv.org/abs/2412.00882
作者: Rongkun Zheng,Lu Qi,Xi Chen,Yi Wang,Kun Wang,Yu Qiao,Hengshuang Zhao
关键词-EN: Video Instance Segmentation, Recent DETR-based methods, Instance Segmentation, Recent DETR-based, Video Instance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Recent DETR-based methods have advanced the development of Video Instance Segmentation (VIS) through transformers’ efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video-level queries only or adopting query-sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video-level query embeddings and designs two key modules to synchronize video-level query with frame-level query embeddings: a synchronized video-frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame- and video-level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube-VIS 2019 2021 2022, and OVIS benchmarks and SyncVIS achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at this https URL.
zh

[CV-151] Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

【速读】：该论文试图解决现实世界图像恢复中的泛化问题，特别是在基于扩散的恢复方法应用于分布外真实世界数据时遇到的“生成能力失活”问题。解决方案的关键在于利用文本作为辅助不变表示来重新激活这些模型的生成能力。具体来说，论文提出了两个关键属性：文本输入的丰富性（richness）和相关性（relevance），并探讨了它们对模型性能的影响。基于这些见解，论文引入了一个名为Res-Captioner的模块，该模块生成针对图像内容和退化水平定制的增强文本描述，从而有效缓解响应失败问题。此外，论文还提出了一个新的基准RealIR，用于捕捉多样化的真实世界场景。实验结果表明，Res-Captioner显著增强了基于扩散的恢复模型的泛化能力，同时保持了即插即用的特性。

链接: https://arxiv.org/abs/2412.00878
作者: Haoze Sun,Wenbo Li,Jiayue Liu,Kaiwen Zhou,Yongqiang Chen,Yong Guo,Yanwei Li,Renjing Pei,Long Peng,Yujiu Yang
关键词-EN: central challenge, Abstract, diffusion-based restoration, restoration, diffusion-based restoration methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter “generative capability deactivation” when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.
zh

[CV-152] hermal Vision: Pioneering Non-Invasive Temperature Tracking in Congested Spaces

【速读】：该论文试图解决在密集环境中（如电影院或教室）进行非侵入式体温监测的问题。现有研究主要集中在稀疏环境中，而在密集环境中疾病传播风险更高，因此需要一种专门针对密集环境的温度估计方法。解决方案的关键在于结合热成像相机和边缘设备，利用YOLO模型进行人脸检测，并通过回归框架进行温度估计。该系统在密集和稀疏环境中收集的数据集上进行了评估，人脸检测模型在数据集内和跨数据集评估中均达到了超过84的mAP分数，回归框架则表现出0.18°C的均方误差和0.96的R²分数，显示出其在实际应用中进行连续体温监测的有效性。

链接: https://arxiv.org/abs/2412.00863
作者: Arijit Samal,Haroon R Lone
关键词-EN: isolating symptomatic individuals, individuals plays, symptomatic individuals, Non-invasive temperature, temperature monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Non-invasive temperature monitoring of individuals plays a crucial role in identifying and isolating symptomatic individuals. Temperature monitoring becomes particularly vital in settings characterized by close human proximity, often referred to as dense settings. However, existing research on non-invasive temperature estimation using thermal cameras has predominantly focused on sparse settings. Unfortunately, the risk of disease transmission is significantly higher in dense settings like movie theaters or classrooms. Consequently, there is an urgent need to develop robust temperature estimation methods tailored explicitly for dense settings. Our study proposes a non-invasive temperature estimation system that combines a thermal camera with an edge device. Our system employs YOLO models for face detection and utilizes a regression framework for temperature estimation. We evaluated the system on a diverse dataset collected in dense and sparse settings. Our proposed face detection model achieves an impressive mAP score of over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the regression framework demonstrates remarkable performance with a mean square error of 0.18 ^\circ C and an impressive R^2 score of 0.96. Our experiments’ results highlight the developed system’s effectiveness, positioning it as a promising solution for continuous temperature monitoring in real-world applications. With this paper, we release our dataset and programming code publicly. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.00863 [cs.CV] (or arXiv:2412.00863v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.00863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-153] oward Real-Time Edge AI: Model-Agnostic Task-Oriented Communication with Visual Feature Alignment

【速读】：该论文试图解决面向任务的通信（Task-oriented Communication）在边缘推理系统中跨模型通信时面临的特征空间不一致问题。解决方案的关键在于引入了一种利用共享锚点数据（shared anchor data）的新框架，通过在服务器端和设备端分别实现特征对齐（feature alignment）来确保不同系统间的有效协作。具体来说，服务器端利用视觉特征的线性不变性（linear invariance），通过编码的锚点数据特征来估计线性变换；设备端则利用视觉特征的角度保持性（angle-preserving nature），通过锚点数据编码相对表示来简化跨模型通信，无需在推理过程中进行额外的对齐操作。

链接: https://arxiv.org/abs/2412.00862
作者: Songjie Xie,Hengtao He,Shenghui Song,Jun Zhang,Khaled B. Letaief
关键词-EN: relevant task information, optimizing learning-based modules, transmit relevant task, Task-oriented communication presents, task information
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Task-oriented communication presents a promising approach to improve the communication efficiency of edge inference systems by optimizing learning-based modules to extract and transmit relevant task information. However, real-time applications face practical challenges, such as incomplete coverage and potential malfunctions of edge servers. This situation necessitates cross-model communication between different inference systems, enabling edge devices from one service provider to collaborate effectively with edge servers from another. Independent optimization of diverse edge systems often leads to incoherent feature spaces, which hinders the cross-model inference for existing task-oriented communication. To facilitate and achieve effective cross-model task-oriented communication, this study introduces a novel framework that utilizes shared anchor data across diverse systems. This approach addresses the challenge of feature alignment in both server-based and on-device scenarios. In particular, by leveraging the linear invariance of visual features, we propose efficient server-based feature alignment techniques to estimate linear transformations using encoded anchor data features. For on-device alignment, we exploit the angle-preserving nature of visual features and propose to encode relative representations with anchor data to streamline cross-model communication without additional alignment procedures during the inference. The experimental results on computer vision benchmarks demonstrate the superior performance of the proposed feature alignment approaches in cross-model task-oriented communications. The runtime and computation overhead analysis further confirm the effectiveness of the proposed feature alignment approaches in real-time applications.
zh

[CV-154] Advanced Video Inpainting Using Optical Flow-Guided Efficient Diffusion

【速读】：该论文试图解决基于扩散方法的视频修复任务中存在的两个主要问题：时间一致性维护和计算效率低下。解决方案的关键在于提出了一个名为FloED的光流引导高效扩散框架。FloED采用双分支架构，其中流分支首先恢复受损的光流，并通过多尺度流适配器为主要的修复分支提供运动指导。此外，论文还提出了一种无需训练的潜在插值方法，通过光流扭曲加速多步去噪过程，并引入光流注意力缓存机制，有效降低了光流集成带来的计算成本。实验结果表明，FloED在背景修复和物体移除任务中，无论是在性能还是效率方面，均优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.00857
作者: Bohai Gu,Hao Luo,Song Guo,Peiran Dong
关键词-EN: achieved great improvements, Flow-guided Efficient Diffusion, achieved great, great improvements, Recently
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion-based methods have achieved great improvements in the video inpainting task. However, these methods still face many challenges, such as maintaining temporal consistency and the time-consuming issue. This paper proposes an advanced video inpainting framework using optical Flow-guided Efficient Diffusion, called FloED. Specifically, FloED employs a dual-branch architecture, where a flow branch first restores corrupted flow and a multi-scale flow adapter provides motion guidance to the main inpainting branch. Additionally, a training-free latent interpolation method is proposed to accelerate the multi-step denoising process using flow warping. Further introducing a flow attention cache mechanism, FLoED efficiently reduces the computational cost brought by incorporating optical flow. Comprehensive experiments in both background restoration and object removal tasks demonstrate that FloED outperforms state-of-the-art methods from the perspective of both performance and efficiency.
zh

[CV-155] DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair

【速读】：该论文试图解决在动态环境中，仅使用两张图像且无先验姿态信息的情况下进行3D高斯拟合的问题。解决方案的关键在于两个技术贡献：首先，提出了对象级别的双视图捆绑调整（object-level two-view bundle adjustment），通过将动态场景分解为分段刚性组件，并联合估计相机姿态和动态对象的运动；其次，设计了一种基于SE(3)场驱动的高斯训练方法（SE(3) field-driven Gaussian training method），通过可学习的每个高斯变换实现细粒度的运动建模。这些方法使得在动态场景中能够生成高保真的新视角合成，同时准确保持时间一致性和对象运动。

链接: https://arxiv.org/abs/2412.00851
作者: Weihang Li,Weirong Chen,Shenhan Qian,Jiajie Chen,Daniel Cremers,Haoang Li
关键词-EN: shown promising results, Recent advances, Splatting have shown, Gaussian Splatting, promising results
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting have shown promising results. Existing methods typically assume static scenes and/or multiple images with prior poses. Dynamics, sparse views, and unknown poses significantly increase the problem complexity due to insufficient geometric constraints. To overcome this challenge, we propose a method that can use only two images without prior poses to fit Gaussians in dynamic environments. To achieve this, we introduce two technical contributions. First, we propose an object-level two-view bundle adjustment. This strategy decomposes dynamic scenes into piece-wise rigid components, and jointly estimates the camera pose and motions of dynamic objects. Second, we design an SE(3) field-driven Gaussian training method. It enables fine-grained motion modeling through learnable per-Gaussian transformations. Our method leads to high-fidelity novel view synthesis of dynamic scenes while accurately preserving temporal consistency and object motion. Experiments on both synthetic and real-world datasets demonstrate that our method significantly outperforms state-of-the-art approaches designed for the cases of static environments, multiple images, and/or known poses. Our project page is available at this https URL.
zh

[CV-156] SAGA: Surface-Aligned Gaussian Avatar

【速读】：该论文试图解决单目视频中动态人体重建的问题，特别是在新视角和新姿态合成时，由于高斯分布（Gaussians）在处理动态变化区域（如衣物褶皱或阴影）时容易过拟合，导致几何噪声和突变变形，从而在新视角下表现不佳的问题。解决方案的关键在于提出了表面对齐高斯表示（Surface-Aligned Gaussian Avatar, SAGA），通过将高斯分布与网格对齐，强制实现良好的几何定义和一致的变形，从而提高在新视角和新姿态下的泛化能力。SAGA采用两阶段对齐策略：在第一阶段（Adhered Stage），高斯分布被粘附在网格上并允许在其表面流动，以提高灵活性；在第二阶段（Detached Stage），引入高斯-网格对齐正则化，通过最小化高斯分布与绑定三角形之间的位置和方向偏移，保持几何对齐的同时释放表达能力。此外，还提出了Walking-on-Mesh策略，以动态更新绑定三角形，防止高斯分布在优化过程中偏离。

链接: https://arxiv.org/abs/2412.00845
作者: Ronghan Chen,Yang Cong,Jiayue Liu
关键词-EN: ensuring fast training, Surface-Aligned Gaussian representation, creating animatable human, animatable human avatars, pose synthesis performance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Submitted to TPAMI. Major Revision. Project page: this https URL

点击查看摘要

Abstract:This paper presents a Surface-Aligned Gaussian representation for creating animatable human avatars from monocular videos,aiming at improving the novel view and pose synthesis performance while ensuring fast training and real-time rendering. Recently,3DGS has emerged as a more efficient and expressive alternative to NeRF, and has been used for creating dynamic human avatars. However,when applied to the severely ill-posed task of monocular dynamic reconstruction, the Gaussians tend to overfit the constantly changing regions such as clothes wrinkles or shadows since these regions cannot provide consistent supervision, resulting in noisy geometry and abrupt deformation that typically fail to generalize under novel views and this http URL address these limitations, we present SAGA,i.e.,Surface-Aligned Gaussian Avatar,which aligns the Gaussians with a mesh to enforce well-defined geometry and consistent deformation, thereby improving generalization under novel views and poses. Unlike existing strict alignment methods that suffer from limited expressive power and low realism,SAGA employs a two-stage alignment strategy where the Gaussians are first adhered on while then detached from the mesh, thus facilitating both good geometry and high expressivity. In the Adhered Stage, we improve the flexibility of Adhered-on-Mesh Gaussians by allowing them to flow on the mesh, in contrast to existing methods that rigidly bind Gaussians to fixed location. In the second Detached Stage, we introduce a Gaussian-Mesh Alignment regularization, which allows us to unleash the expressivity by detaching the Gaussians but maintain the geometric alignment by minimizing their location and orientation offsets from the bound triangles. Finally, since the Gaussians may drift outside the bound triangles during optimization, an efficient Walking-on-Mesh strategy is proposed to dynamically update the bound triangles.
zh

[CV-157] AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

【速读】：该论文试图解决动物行为和生物力学的定量分析中，动物姿态和形状估计的准确性问题，特别是在多物种数据集上的应用。解决方案的关键在于提出了AniMer模型，该模型结合了高容量的Transformer骨干网络和动物家族监督对比学习方案，通过统一框架来区分不同四足动物的形状。此外，论文引入了CtrlAni3D，一个大规模的合成数据集，通过扩散基础的条件图像生成管道创建，以增强3D标注数据的多样性。最终，通过整合多个开源数据集，训练和验证集共包含41.3万张标注图像，使得AniMer在多个数据集上超越现有方法，特别是在分布外的Animal Kingdom数据集上表现出色。

链接: https://arxiv.org/abs/2412.00837
作者: Jin Lyu,Tianyi Zhu,Yi Gu,Li Lin,Pujin Cheng,Yebin Liu,Xiaoying Tang,Liang An
关键词-EN: biomechanics requires accurate, Quantitative analysis, requires accurate animal, accurate animal pose, estimation across species
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quantitative analysis of animal behavior and biomechanics requires accurate animal pose and shape estimation across species, and is important for animal welfare and biological research. However, the small network capacity of previous methods and limited multi-species dataset leave this problem underexplored. To this end, this paper presents AniMer to estimate animal pose and shape using family aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. A key insight of AniMer is its integration of a high-capacity Transformer-based backbone and an animal family supervised contrastive learning scheme, unifying the discriminative understanding of various quadrupedal shapes within a single framework. For effective training, we aggregate most available open-sourced quadrupedal datasets, either with 3D or 2D labels. To improve the diversity of 3D labeled data, we introduce CtrlAni3D, a novel large-scale synthetic dataset created through a new diffusion-based conditional image generation pipeline. CtrlAni3D consists of about 10k images with pixel-aligned SMAL labels. In total, we obtain 41.3k annotated images for training and validation. Consequently, the combination of a family aware Transformer network and an expansive dataset enables AniMer to outperform existing methods not only on 3D datasets like Animal3D and CtrlAni3D, but also on out-of-distribution Animal Kingdom dataset. Ablation studies further demonstrate the effectiveness of our network design and CtrlAni3D in enhancing the performance of AniMer for in-the-wild applications. The project page of AniMer is this https URL.
zh

[CV-158] Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

【速读】：该论文试图解决单视角下的物体姿态估计问题，特别是在部分观测、遮挡和物体对称性导致的姿态模糊性方面。解决方案的关键在于训练一个基于扩散模型的生成式模型 (diffusion-based generative model) 用于6D物体姿态估计。该模型在推理过程中能够生成多个姿态假设 (pose hypotheses)，并通过两种新颖且有效的姿态选择策略 (pose selection strategies) 将这些信息提炼为单一的姿态估计。与现有方法主要依赖图像域并仅在最终姿态细化时使用深度信息不同，该模型仅基于点云数据进行操作，利用了点云处理的最新进展，并在SE(3)-等变潜在空间 (SE(3)-equivariant latent space) 上进行操作，从而提高了推理速度并展示了其设计选择的有效性。

链接: https://arxiv.org/abs/2412.00835
作者: Christian Möller,Niklas Funk,Jan Peters
关键词-EN: single view remains, Object pose estimation, challenging problem, view remains, remains a challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at this https URL .
zh

[CV-159] AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

【速读】：该论文试图解决多模态表示融合中的跨模态对齐问题，特别是在Transformer方法的二次计算复杂度限制了其在长序列或大规模数据上的应用，以及Mamba方法在全面建模跨模态关系方面的挑战。解决方案的关键在于提出了AlignMamba方法，通过引入基于最优传输理论的局部跨模态对齐模块来显式学习不同模态间的token级对应关系，并结合基于最大均值差异的全局跨模态对齐损失来隐式强化不同模态分布的一致性。最终，经过局部和全局对齐的单模态表示被传递到Mamba骨干网络中，以进一步进行跨模态交互和多模态融合。

链接: https://arxiv.org/abs/2412.00833
作者: Yan Li,Yifei Xing,Xiangyuan Lan,Xin Li,Haifeng Chen,Dongmei Jiang
关键词-EN: inherent heterogeneity, Cross-modal, representation fusion due, Cross-modal alignment, multimodal fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-modal alignment is crucial for multimodal representation fusion due to the inherent heterogeneity between modalities. While Transformer-based methods have shown promising results in modeling inter-modal relationships, their quadratic computational complexity limits their applicability to long-sequence or large-scale data. Although recent Mamba-based approaches achieve linear complexity, their sequential scanning mechanism poses fundamental challenges in comprehensively modeling cross-modal relationships. To address this limitation, we propose AlignMamba, an efficient and effective method for multimodal fusion. Specifically, grounded in Optimal Transport, we introduce a local cross-modal alignment module that explicitly learns token-level correspondences between different modalities. Moreover, we propose a global cross-modal alignment loss based on Maximum Mean Discrepancy to implicitly enforce the consistency between different modal distributions. Finally, the unimodal representations after local and global alignment are passed to the Mamba backbone for further cross-modal interaction and multimodal fusion. Extensive experiments on complete and incomplete multimodal fusion tasks demonstrate the effectiveness and efficiency of the proposed method.
zh

[CV-160] EventGPT: Event Stream Understanding with Multimodal Large Language Models

【速读】：该论文试图解决现有多模态大语言模型（MLLMs）在处理事件相机数据（event data）时的不足，特别是在光照条件不佳或动态范围高的场景中。解决方案的关键在于引入EventGPT，这是首个用于事件流理解的多模态大语言模型（MLLM）。为了弥合事件数据与语言模型之间的巨大领域差距，论文提出了一种三阶段优化范式：首先利用GPT生成的RGB图像-文本对来预热线性投影器（linear projector），参考LLaVA模型；其次，构建一个合成的大型数据集N-ImageNet-Chat，包含事件帧和对应文本，以训练时空聚合器（spatio-temporal aggregator）和事件-语言适配器（event-language adapter），从而更好地对齐事件特征与语言空间；最后，收集一个包含广泛真实世界数据的指令数据集Event-Chat，对整个模型进行微调，以增强其泛化能力。

链接: https://arxiv.org/abs/2412.00832
作者: Shaoyu Liu,Jianing Li,Guanghui Zhao,Yunjian Zhang,Xin Meng,Fei Richard Yu,Xiangyang Ji,Ming Li
关键词-EN: cameras record visual, record visual information, asynchronous pixel change, pixel change streams, Event cameras record
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. To mitigate the huge domain gaps, we develop a three-stage optimization paradigm to gradually equip a pre-trained LLM with the capability of understanding event-based scenes. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.
zh

[CV-161] Categorical Keypoint Positional Embedding for Robust Animal Re-Identification

【速读】：该论文试图解决动物重识别（Animal Re-identification, ReID）中的关键问题，即由于动物姿态的高度变化、多样化的环境条件以及无法直接应用预训练模型到动物数据上，导致跨物种识别过程复杂且成本高昂。解决方案的关键在于引入了一种创新的关键点传播机制，该机制利用单张标注图像和预训练的扩散模型，在整个数据集中传播关键点，从而显著减少手动标注的成本。此外，通过在视觉变换器（Vision Transformer, ViT）中实施关键点位置编码（Keypoint Positional Encoding, KPE）和分类关键点位置嵌入（Categorical Keypoint Positional Embedding, CKPE），增强了ViT学习更鲁棒和语义感知表示的能力，从而提供更全面和详细的关键点表示，最终实现更准确和高效的动物重识别。

链接: https://arxiv.org/abs/2412.00818
作者: Yuhao Lin,Lingqiao Liu,Javen Shi
关键词-EN: analyzing behavioral patterns, assessing ecological impacts, tracking population dynamics, informed conservation strategies, ecological research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In review

点击查看摘要

Abstract:Animal re-identification (ReID) has become an indispensable tool in ecological research, playing a critical role in tracking population dynamics, analyzing behavioral patterns, and assessing ecological impacts, all of which are vital for informed conservation strategies. Unlike human ReID, animal ReID faces significant challenges due to the high variability in animal poses, diverse environmental conditions, and the inability to directly apply pre-trained models to animal data, making the identification process across species more complex. This work introduces an innovative keypoint propagation mechanism, which utilizes a single annotated image and a pre-trained diffusion model to propagate keypoints across an entire dataset, significantly reducing the cost of manual annotation. Additionally, we enhance the Vision Transformer (ViT) by implementing Keypoint Positional Encoding (KPE) and Categorical Keypoint Positional Embedding (CKPE), enabling the ViT to learn more robust and semantically-aware representations. This provides more comprehensive and detailed keypoint representations, leading to more accurate and efficient re-identification. Our extensive experimental evaluations demonstrate that this approach significantly outperforms existing state-of-the-art methods across four wildlife datasets. The code will be publicly released.
zh

[CV-162] Motion-Aware Optical Camera Communication with Event Cameras

【速读】：该论文试图解决光学相机通信系统在动态环境下（如屏幕刷新和快速相机移动）的性能下降问题。解决方案的关键在于引入事件相机（event camera），并设计基于事件的跟踪算法。事件相机的独特能力能够有效缓解屏幕刷新率和相机运动带来的问题，从而在静态条件下实现高达114 Kbps的数据吞吐量，并在各种相机运动下保持1 cm的定位精度和1%的比特错误率。

链接: https://arxiv.org/abs/2412.00816
作者: Hang Su,Ling Gao,Tao Liu,Laurent Kneip
关键词-EN: Optical Camera Communication, smart mobile devices, mobile devices continues, Camera Communication systems, Communication systems
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As the ubiquity of smart mobile devices continues to rise, Optical Camera Communication systems have gained more attention as a solution for efficient and private data streaming. This system utilizes optical cameras to receive data from digital screens via visible light. Despite their promise, most of them are hindered by dynamic factors such as screen refreshing and rapid camera motion. CMOS cameras, often serving as the receivers, suffer from limited frame rates and motion-induced image blur, which degrade overall performance. To address these challenges, this paper unveils a novel system that utilizes event cameras. We introduce a dynamic visual marker and design event-based tracking algorithms to achieve fast localization and data streaming. Remarkably, the event camera’s unique capabilities mitigate issues related to screen refresh rates and camera motion, enabling a high throughput of up to 114 Kbps in static conditions, and a 1 cm localization accuracy with 1% bit error rate under various camera motions.
zh

[CV-163] Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

【速读】：该论文试图解决视频时刻检索任务中对人工标注数据的高度依赖问题，提出了一种新的预训练范式以降低标注成本。解决方案的关键在于引入了一个大规模的无标签真实世界视频数据集 Video Moment Retrieval Pretraining (Vid-Morp)，并提出了 ReCorrect 算法来处理预训练过程中由于伪标注的不完美性带来的挑战。ReCorrect 算法包括两个主要阶段：语义引导的细化 (semantics-guided refinement) 和记忆共识校正 (memory-consensus correction)。前者通过利用视频帧的语义相似性来清理不匹配的数据并初步调整时间边界，后者则通过记忆库跟踪模型预测，逐步基于记忆中的共识来校正时间边界。实验结果表明，ReCorrect 在多个下游任务中展现了强大的泛化能力，在零样本和无监督设置下分别达到了最佳全监督性能的 75% 和 85%。

链接: https://arxiv.org/abs/2412.00811
作者: Peijun Bao,Chenqi Kong,Zihao Shao,Boon Poh Ng,Meng Hwa Er,Alex C. Kot
关键词-EN: natural language query, video moment retrieval, moment retrieval aims, Moment Retrieval Pretraining, language query
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect’s strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at this https URL.
zh

[CV-164] EDTformer: An Efficient Decoder Transformer for Visual Place Recognition

【速读】：该论文试图解决视觉地点识别 (Visual Place Recognition, VPR) 任务中全局特征表示的问题。解决方案的关键在于提出了一个高效的解码器变压器 (Efficient Decoder Transformer, EDTformer)，用于特征聚合。EDTformer 通过堆叠简化的解码器块，并结合两个线性层，直接生成鲁棒且具有区分性的全局表示。具体来说，EDTformer 将深度特征作为键和值，并将一组独立的可学习参数作为查询，充分利用深度特征中的上下文信息，逐步解码和聚合有效特征，形成最终的全局表示。此外，论文还采用了基于 DINOv2 的基础模型，并通过低秩并行适应 (Low-Rank Parallel Adaptation, LoPA) 方法增强其特征提取能力，以进一步提高模型的鲁棒性。

链接: https://arxiv.org/abs/2412.00784
作者: Tong Jin,Feng Lu,Shuyu Hu,Chun Yuan,Yunpeng Liu
关键词-EN: Visual place recognition, large geo-tagged database, general geographical location, retrieving visually similar, visually similar images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at this https URL.
zh

[CV-165] Memories of Forgotten Concepts

【速读】：该论文试图解决的问题是：尽管扩散模型在文本到图像生成领域占据主导地位，但它们可能生成包含显式内容或私人数据的不可取输出。为缓解这一问题，已有研究探索了概念消融技术以限制某些概念的生成。然而，本文揭示了被消融的概念信息实际上仍存在于模型中，并且可以通过适当的潜在变量生成被消融概念的图像。解决方案的关键在于利用反演方法，证明了存在能够生成高质量被消融概念图像的潜在种子，并且这些潜在变量的似然性与非消融概念图像的似然性重叠。此外，论文还展示了对于每个被消融概念集中的图像，可以生成多个能够生成该概念的种子。鉴于能够生成消融概念图像的潜在变量空间巨大，研究结果表明完全消除概念信息可能是不可行的，从而突显了当前概念消融技术可能存在的漏洞。

链接: https://arxiv.org/abs/2412.00782
作者: Matan Rusanovsky,Shimon Malnick,Amir Jevnisek,Ohad Fried,Shai Avidan
关键词-EN: Diffusion models dominate, produce undesirable outputs, including explicit content, Diffusion models, erased concept
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first three authors contributed equally to this work. Project page: this https URL

点击查看摘要

Abstract:Diffusion models dominate the space of text-to-image generation, yet they may produce undesirable outputs, including explicit content or private data. To mitigate this, concept ablation techniques have been explored to limit the generation of certain concepts. In this paper, we reveal that the erased concept information persists in the model and that erased concept images can be generated using the right latent. Utilizing inversion methods, we show that there exist latent seeds capable of generating high quality images of erased concepts. Moreover, we show that these latents have likelihoods that overlap with those of images outside the erased concept. We extend this to demonstrate that for every image from the erased concept set, we can generate many seeds that generate the erased concept. Given the vast space of latents capable of generating ablated concept images, our results suggest that fully erasing concept information may be intractable, highlighting possible vulnerabilities in current concept ablation techniques.
zh

[CV-166] Local vs. Global: Local Land-Use and Land-Cover Models Deliver Higher Quality Maps

【速读】：该论文试图解决非洲地区土地利用和土地覆盖图（Land-use and Land-cover maps）精度低和不一致的问题，尤其是在缺乏代表性训练数据的情况下。解决方案的关键在于提出了一种数据中心化的框架，采用教师-学生模型（teacher-student model）的设置，利用多种数据源的卫星图像和标签示例来生成局部土地覆盖图。具体来说，该方法训练了一个高分辨率的教师模型（0.331 m/pixel）和一个低分辨率的学生模型（10 m/pixel），并通过知识蒸馏（knowledge distillation）将教师模型的输出作为学生模型的弱标签示例。通过在肯尼亚Murang’a县的案例研究中验证，该框架显著提高了F1分数和交并比（Intersection-over-Union），并揭示了现有全球地图之间的一致性问题。

链接: https://arxiv.org/abs/2412.00777
作者: Girmaw Abebe Tadesse,Caleb Robinson,Charles Mwangi,Esther Maina,Joshua Nyakundi,Luana Marotti,Gilles Quentin Hacheme,Hamed Alemohammad,Rahul Dodhia,Juan M. Lavista Ferres
关键词-EN: million people experienced, people experienced moderate, Africa population suffered, severe food insecurity, suffered from undernourishment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Approximately 20% of Africa’s population suffered from undernourishment, and 868 million people experienced moderate to severe food insecurity in 2022. Land-use and land-cover maps provide crucial insights for addressing food insecurity, e.g., by mapping croplands. The development of global land-cover maps has been facilitated by the increasing availability of earth observation data and advancements in geospatial machine learning. However, these global maps exhibit lower accuracy and inconsistencies in Africa, partly due to the lack of representative training data. To address this issue, we propose a data-centric framework with a teacher-student model setup, which uses diverse data sources of satellite images and label examples to produce local land-cover maps. Our method trains a high-resolution teacher model on images with a resolution of 0.331 m/pixel and a low-resolution student model on publicly available images with a resolution of 10 m/pixel. The student model also utilizes the teacher model’s output as its weak label examples through knowledge distillation. We evaluated our framework using Murang’a County, Kenya, as a use case and achieved significant improvements, i.e., 0.14 in the F1 score and 0.21 in Intersection-over-Union, compared to the best global map. Our evaluation also revealed inconsistencies in existing global maps, with a maximum agreement rate of 0.30 among themselves. Insights obtained from our cross-collaborative work can provide valuable guidance to local and national policymakers in making informed decisions to improve resource utilization and food security.
zh

[CV-167] DIVD: Deblurring with Improved Video Diffusion Model

【速读】：该论文试图解决视频去模糊任务中由于相机抖动和物体运动导致的模糊复杂性问题，特别是在传统基于失真度量（如PSNR）的方法与人眼感知之间存在弱相关性的情况下。解决方案的关键在于引入视频扩散模型（Video Diffusion Model），并对其进行改进以处理相邻帧之间的高相关性信息和时间对齐问题。通过这些改进，论文中的模型在感知度量上超越了现有模型，实现了最先进的性能，同时保留了图像细节并保持了竞争性的失真度量。这是首次将扩散模型应用于视频去模糊任务，以克服传统方法的局限性。

链接: https://arxiv.org/abs/2412.00773
作者: Haoyang Long,Yan Wang,Wendong Wang
关键词-EN: video diffusion models, video diffusion, Video deblurring, Video, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video deblurring presents a considerable challenge owing to the complexity of blur, which frequently results from a combination of camera shakes, and object motions. In the field of video deblurring, many previous works have primarily concentrated on distortion-based metrics, such as PSNR. However, this approach often results in a weak correlation with human perception and yields reconstructions that lack realism. Diffusion models and video diffusion models have respectively excelled in the fields of image and video generation, particularly achieving remarkable results in terms of image authenticity and realistic perception. However, due to the computational complexity and challenges inherent in adapting diffusion models, there is still uncertainty regarding the potential of video diffusion models in video deblurring tasks. To explore the viability of video diffusion models in the task of video deblurring, we introduce a diffusion model specifically for this purpose. In this field, leveraging highly correlated information between adjacent frames and addressing the challenge of temporal misalignment are crucial research directions. To tackle these challenges, many improvements based on the video diffusion model are introduced in this work. As a result, our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics. Our model preserves a significant amount of detail in the images while maintaining competitive distortion metrics. Furthermore, to the best of our knowledge, this is the first time the diffusion model has been applied in video deblurring to overcome the limitations mentioned above.
zh

[CV-168] Explorations in Self-Supervised Learning: Dataset Composition Testing for Object Classification

【速读】：该论文试图解决在自监督学习 (Self-Supervised Learning, SSL) 模型中，不同采样和预训练数据集对物体分类性能的影响问题。解决方案的关键在于通过从Omnidata平台采样具有不同图像特征（如模态、亮度、图像尺寸和相机视野）的公寓数据集，并使用这些数据集预训练SimCLR模型。随后，将预训练模型生成的编码迁移到监督学习的Resnet-50模型中进行物体分类。实验结果表明，深度预训练模型在低分辨率图像上表现更佳，而RGB预训练模型在高分辨率图像上表现更好。此外，增加训练图像的亮度可以提高模型在低分辨率图像上的性能，同时不会对高分辨率图像的性能产生负面影响。

链接: https://arxiv.org/abs/2412.00770
作者: Raynor Kirkson E. Chavez,Kyle Gabriel M. Reynoso
关键词-EN: self-supervised learning, object classification, paper investigates, investigates the impact, impact of sampling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the impact of sampling and pretraining using datasets with different image characteristics on the performance of self-supervised learning (SSL) models for object classification. To do this, we sample two apartment datasets from the Omnidata platform based on modality, luminosity, image size, and camera field of view and use them to pretrain a SimCLR model. The encodings generated from the pretrained model are then transferred to a supervised Resnet-50 model for object classification. Through A/B testing, we find that depth pretrained models are more effective on low resolution images, while RGB pretrained models perform better on higher resolution images. We also discover that increasing the luminosity of training images can improve the performance of models on low resolution images without negatively affecting their performance on higher resolution images.
zh

[CV-169] DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

【速读】：该论文试图解决文本到图像扩散模型在生成图像与人类偏好之间对齐的问题。现有的基于训练的方法受限于高计算成本和数据集需求，而无需训练的对齐方法则缺乏深入研究和准确指导。论文提出的解决方案之关键是DyMO（Dynamic Multi-Objective），这是一种即插即用的无需训练的对齐方法，通过引入语义对齐目标和动态调度多个目标及中间递归步骤，来增强扩散早期阶段的语义对齐，从而在推理过程中实现生成图像与人类偏好的更好对齐。

链接: https://arxiv.org/abs/2412.00759
作者: Xin Xie,Dong Gong
关键词-EN: critical for improving, training-free alignment methods, training-free alignment, generated images, alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion model alignment is critical for improving the alignment between the generated images and human preferences. While training-based methods are constrained by high computational costs and dataset requirements, training-free alignment methods remain underexplored and are often limited by inaccurate guidance. We propose a plug-and-play training-free alignment method, DyMO, for aligning the generated images and human preferences during inference. Apart from text-aware human preference scores, we introduce a semantic alignment objective for enhancing the semantic alignment in the early stages of diffusion, relying on the fact that the attention maps are effective reflections of the semantics in noisy images. We propose dynamic scheduling of multiple objectives and intermediate recurrent steps to reflect the requirements at different steps. Experiments with diverse pre-trained diffusion models and metrics demonstrate the effectiveness and robustness of the proposed method.
zh

[CV-170] CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

【速读】：该论文试图解决的问题是如何在生成式神经辐射场 (Generative Neural Radiance Field, GRAF) 中有效地表示多个场景，并实现对3D几何形状和外观的精确控制。解决方案的关键在于引入了一个可控生成模型 (CtrlNeRF)，该模型使用单一的多层感知器 (MLP) 网络来表示多个场景，并通过共享权重来实现。通过操作形状代码 (z_s) 和外观代码 (z_a)，CtrlNeRF 能够生成具有高保真度和3D一致性的图像，并且能够在训练集之外的视角下合成新视图，通过相机姿态调整和特征插值实现。这一方法在3D感知图像生成方面展示了其相对于现有方法的优越性。

链接: https://arxiv.org/abs/2412.00754
作者: Jian Liu,Zhen Yu
关键词-EN: neural radiance field, radiance field, generative neural radiance, neural radiance, advocates learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbfCtrlNeRF) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.
zh

[CV-171] Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer

【速读】：该论文试图解决深度神经网络在面部对齐任务中表现平庸的问题，特别是在处理具有大姿态或遮挡的面部时。其关键解决方案是提出了一种动态语义聚合变换器 (Dynamic Semantic-Aggregation Transformer, DSAT)，通过动态语义感知 (Dynamic Semantic-Aware, DSA) 模型和动态语义特化 (Dynamic Semantic Specialization, DSS) 模型的结合，实现更具判别性和代表性的特征学习。DSA 模型通过估计特征通道的语义相关性，将样本划分为子集并激活特定路径，从而学习每个子集的特化特征。DSS 模型则挖掘不同尺度特征中的同质信息，消除语义差距和模糊性，增强特征的表示能力。最终，DSAT 通过动态架构和动态参数的方式整合这两个模型，实现更精确的面部对齐。

链接: https://arxiv.org/abs/2412.00740
作者: Jun Wan,He Liu,Yujia Wu,Zhihui Lai,Wenwen Min,Jun Liu
关键词-EN: deep neural network, neural network methods, deep neural, methods have played, played a dominant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:At present, deep neural network methods have played a dominant role in face alignment field. However, they generally use predefined network structures to predict landmarks, which tends to learn general features and leads to mediocre performance, e.g., they perform well on neutral samples but struggle with faces exhibiting large poses or occlusions. Moreover, they cannot effectively deal with semantic gaps and ambiguities among features at different scales, which may hinder them from learning efficient features. To address the above issues, in this paper, we propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature (i.e., specialized feature) learning. Specifically, a Dynamic Semantic-Aware (DSA) model is first proposed to partition samples into subsets and activate the specific pathways for them by estimating the semantic correlations of feature channels, making it possible to learn specialized features from each subset. Then, a novel Dynamic Semantic Specialization (DSS) model is designed to mine the homogeneous information from features at different scales for eliminating the semantic gap and ambiguities and enhancing the representation ability. Finally, by integrating the DSA model and DSS model into our proposed DSAT in both dynamic architecture and dynamic parameter manners, more specialized features can be learned for achieving more precise face alignment. It is interesting to show that harder samples can be handled by activating more feature channels. Extensive experiments on popular face alignment datasets demonstrate that our proposed DSAT outperforms state-of-the-art models in the this http URL code is available at this https URL.
zh

[CV-172] ChatSplat: 3D Conversational Gaussian Splatting

【速读】：该论文试图解决在3D环境中进行多层次语言交互的问题，特别是如何在对象、视图和场景级别上实现丰富的聊天式交互。解决方案的关键在于引入了一个名为ChatSplat的系统，该系统构建了一个3D语言场，能够支持多层次的交互。具体来说，ChatSplat在视图级别通过设计一个编码器将渲染的特征图编码为令牌，并由大型语言模型（LLM）处理以进行对话；在场景级别通过组合多视图令牌，实现考虑整个场景的交互；在对象级别通过使用分块语言嵌入，显式地将语言嵌入解耦为独立的掩码和特征图表示，从而实现更灵活的对象级交互。此外，为了应对LLM中语言嵌入的复杂多样分布带来的学习3D高斯分布的挑战，论文还引入了一种可学习的归一化技术，以标准化这些嵌入，促进有效学习。

链接: https://arxiv.org/abs/2412.00734
作者: Hanlin Chen,Fangyin Wei,Gim Hee Lee
关键词-EN: Humans naturally interact, gained growing interest, Humans naturally, growing interest, language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans naturally interact with their 3D surroundings using language, and modeling 3D language fields for scene understanding and interaction has gained growing interest. This paper introduces ChatSplat, a system that constructs a 3D language field, enabling rich chat-based interaction within 3D space. Unlike existing methods that primarily use CLIP-derived language features focused solely on segmentation, ChatSplat facilitates interaction on three levels: objects, views, and the entire 3D scene. For view-level interaction, we designed an encoder that encodes the rendered feature map of each view into tokens, which are then processed by a large language model (LLM) for conversation. At the scene level, ChatSplat combines multi-view tokens, enabling interactions that consider the entire scene. For object-level interaction, ChatSplat uses a patch-wise language embedding, unlike LangSplat’s pixel-wise language embedding that implicitly includes mask and embedding. Here, we explicitly decouple the language embedding into separate mask and feature map representations, allowing more flexible object-level interaction. To address the challenge of learning 3D Gaussians posed by the complex and diverse distribution of language embeddings used in the LLM, we introduce a learnable normalization technique to standardize these embeddings, facilitating effective learning. Extensive experimental results demonstrate that ChatSplat supports multi-level interactions – object, view, and scene – within 3D space, enhancing both understanding and engagement.
zh

[CV-173] Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

【速读】：该论文试图解决现有肖像动画方法在处理非正面视角、动态物体渲染以及生成沉浸式真实背景时面临的显著挑战。解决方案的关键在于首次应用预训练的基于transformer的视频生成模型，该模型展示了强大的泛化能力，能够生成高度动态和真实的肖像动画视频。具体来说，论文设计了一个由因果3D VAE（Variational Autoencoder）与一系列transformer层组成的身份参考网络，确保视频序列中面部身份的一致性。此外，通过研究多种语音音频条件和运动帧机制，实现了由语音音频驱动的连续视频生成。实验结果表明，该方法在生成具有多样视角的真实肖像视频方面显著优于先前的方法。

链接: https://arxiv.org/abs/2412.00733
作者: Jiahao Cui,Hui Li,Yun Zhan,Hanlin Shang,Kaihui Cheng,Yuqi Ma,Shan Mu,Hang Zhou,Jingdong Wang,Siyu Zhu
关键词-EN: handling non-frontal perspectives, images face significant, face significant challenges, Existing methodologies, rendering dynamic objects
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: this https URL.
zh

[CV-174] Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

【速读】：该论文试图解决从多视角2D RGB图像生成3D模型的问题，这一技术扩展了虚拟现实、机器人视觉和人与机器交互的能力。解决方案的关键在于引入了一种混合策略，结合了卷积神经网络 (CNNs) 和变换器 (transformers)，具体包括一个带有自注意力机制的视觉自编码器和一个3D精炼网络，并通过一种新颖的联合训练分离优化 (Joint Train Separate Optimization, JTSO) 算法进行训练。该方法通过自注意力层将无序输入的编码特征转换为增强的特征图，解码为初始3D体积，并进一步精炼，从而能够从任意视角的单张或多张2D图像生成3D体素。性能评估表明，该方法在单视角和多视角3D重建中均优于现有技术，特别是在单视角重建中，其平均交并比 (IOU) 得分超越了其他模型4.2%。

链接: https://arxiv.org/abs/2412.00731
作者: Ajith Balakrishnan,Sreeja S,Linu Shine
关键词-EN: Robotic Vision, gained significant attention, Virtual Reality, Train Separate Optimization, Joint Train Separate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICVGIP-2024, 8 pages

点击查看摘要

Abstract:Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.
zh

[CV-175] SEED4D: A Synthetic Ego–Exo Dynamic 4D Data Generator Driving Dataset and Benchmark WACV2025

【速读】：该论文试图解决在自主驾驶背景下进行以自我为中心的3D和4D重建时，缺乏复杂、动态和多视角数据的问题。解决方案的关键在于提出了一个名为Synthetic Ego-Exo Dynamic 4D (SEED4D)的数据生成器和数据集。SEED4D数据生成器是一个可定制、易于使用的工具，用于创建时空多视角数据，支持NuScenes、KITTI360和Waymo等常用数据集的相机设置。此外，SEED4D包含两个大规模的多视角合成城市场景数据集：一个静态（3D）数据集包含212k张内外向车辆图像，来自2k个场景；一个动态（4D）数据集包含16.8M张图像，来自10k条轨迹，每条轨迹在100个时间点采样，包含以自我为中心和以他为中心的图像以及LiDAR数据。

链接: https://arxiv.org/abs/2412.00730
作者: Marius Kästingschäfer,Théo Gieruc,Sebastian Bernhard,Dylan Campbell,Eldar Insafutdinov,Eyvaz Najafli,Thomas Brox
关键词-EN: including few-shot interpolation, including few-shot, extrapolation settings, supervision signals, few-shot interpolation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025. Project page: this https URL . Code: this https URL

点击查看摘要

Abstract:Models for egocentric 3D and 4D reconstruction, including few-shot interpolation and extrapolation settings, can benefit from having images from exocentric viewpoints as supervision signals. No existing dataset provides the necessary mixture of complex, dynamic, and multi-view data. To facilitate the development of 3D and 4D reconstruction methods in the autonomous driving context, we propose a Synthetic Ego–Exo Dynamic 4D (SEED4D) data generator and dataset. We present a customizable, easy-to-use data generator for spatio-temporal multi-view data creation. Our open-source data generator allows the creation of synthetic data for camera setups commonly used in the NuScenes, KITTI360, and Waymo datasets. Additionally, SEED4D encompasses two large-scale multi-view synthetic urban scene datasets. Our static (3D) dataset encompasses 212k inward- and outward-facing vehicle images from 2k scenes, while our dynamic (4D) dataset contains 16.8M images from 10k trajectories, each sampled at 100 points in time with egocentric images, exocentric images, and LiDAR data. The datasets and the data generator can be found at this https URL.
zh

[CV-176] Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

【速读】：该论文试图解决视觉-语言模型（如 CLIP）在面临后门攻击时的脆弱性问题。解决方案的关键在于提出了一种名为“扰动与恢复”（Perturb and Recover, PAR）的简单而有效的机制，通过微调（fine-tuning）来清除潜在的后门。PAR 方法在不同编码器和后门攻击类型上进行了广泛的实验，展示了其在保持模型标准性能的同时，能够高效地移除后门。此外，该方法在仅使用合成文本-图像对的情况下依然有效，无需访问真实训练数据。

链接: https://arxiv.org/abs/2412.00727
作者: Naman Deep Singh,Francesco Croce,Matthias Hein
关键词-EN: including strong retrieval, natural language understanding, linking visual perception, enabling sophisticated image-text, sophisticated image-text capabilities
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available at \hrefthis https URLthis https URL.
zh

[CV-177] Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

【速读】：该论文试图解决生成式谈话头视频（talking head video generation）中同时保持人物身份和精确面部细节的挑战。解决方案的关键在于提出了一种联合学习运动和外观编码本（motion and appearance codebooks）的方法，并通过多尺度编码本补偿（multi-scale codebook compensation）来优化面部运动条件和外观特征。具体来说，论文设计了一个统一框架，同时学习多尺度的运动和外观编码本，以存储代表性的全局面部运动流和外观模式。随后，引入了一个基于Transformer的编码本检索策略，用于从两个编码本中查询互补信息，进行联合运动和外观补偿。这种方法能够生成更具灵活性的运动流和更少失真的外观特征，从而实现高质量的谈话头视频生成。

链接: https://arxiv.org/abs/2412.00719
作者: Shuling Zhao,Fa-Ting Hong,Xiaoshui Huang,Dan Xu
关键词-EN: Talking head video, head video generation, video generation aims, realistic talking head, head video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Talking head video generation aims to generate a realistic talking head video that preserves the person’s identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns. Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework. Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors.
zh

[CV-178] Enhancing the Generalization Capability of Skin Lesion Classification Models with Active Domain Adaptation Methods

【速读】：该论文试图解决皮肤病变分类模型在不同数据集上的泛化能力问题。解决方案的关键在于结合自监督学习 (Self-Supervised Learning, SSL)、无监督领域自适应 (Unsupervised Domain Adaptation, UDA) 和主动领域自适应 (Active Domain Adaptation, ADA) 三种技术。具体步骤包括：首先在自然图像数据集上进行SSL预训练，然后在所有可用的皮肤病变数据集上进行SSL再训练，接着在有标签的源域数据上进行微调，随后在目标域数据上应用UDA方法，最后实施ADA方法。通过这种方法，论文展示了其在十个不同皮肤病变数据集上的有效性，证明了该方法能够显著提升皮肤病变分类模型的性能，并有望促进医学影像模型在临床环境中的广泛应用。

链接: https://arxiv.org/abs/2412.00702
作者: Jun Ye
关键词-EN: unsupervised domain adaptation, combining self-supervised learning, active domain adaptation, skin lesion classification, skin lesion datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:We propose a method to improve the generalization ability of skin lesion classification models by combining self-supervised learning (SSL), unsupervised domain adaptation (UDA), and active domain adaptation (ADA). The main steps of the approach include selection of a SSL pretrained model on natural image datasets, subsequent SSL retraining on all available skin lesion datasets, finetuning of the model on source domain data with labels, application of UDA methods on target domain data, and lastly, implementation of ADA methods. The efficacy of the proposed approach is assessed across ten skin lesion datasets of domains, demonstrating its potential for enhancing the performance of skin lesion classification models. This approach holds promise for facilitating the widespread adoption of medical imaging models in clinical settings, thereby amplifying their impact.
zh

[CV-179] Intermediate Outputs Are More Sensitive Than You Think

【速读】：该论文试图解决深度计算机视觉模型在处理敏感数据时，中间层结果可能暴露隐私的问题。传统隐私风险评估技术主要关注整体模型输出的保护，而忽略了中间层表示中的潜在漏洞。论文提出的解决方案关键在于引入了一种基于自由度（Degrees of Freedom, DoF）和中间层输出敏感性的新型隐私风险测量方法，无需依赖对抗攻击模拟。具体来说，该框架利用DoF评估每一层保留的信息量，并结合雅可比矩阵的秩来评估对输入变化的敏感性，从而实现对各层隐私风险的系统性测量。实验验证表明，这种方法能够更深入地洞察中间层表示相关的隐私风险。

链接: https://arxiv.org/abs/2412.00696
作者: Tao Huang,Qingyu Huang,Jiayang Meng
关键词-EN: process sensitive data, significant privacy concerns, raised significant privacy, deep computer vision, increasing reliance
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The increasing reliance on deep computer vision models that process sensitive data has raised significant privacy concerns, particularly regarding the exposure of intermediate results in hidden layers. While traditional privacy risk assessment techniques focus on protecting overall model outputs, they often overlook vulnerabilities within these intermediate representations. Current privacy risk assessment techniques typically rely on specific attack simulations to assess risk, which can be computationally expensive and incomplete. This paper introduces a novel approach to measuring privacy risks in deep computer vision models based on the Degrees of Freedom (DoF) and sensitivity of intermediate outputs, without requiring adversarial attack simulations. We propose a framework that leverages DoF to evaluate the amount of information retained in each layer and combines this with the rank of the Jacobian matrix to assess sensitivity to input variations. This dual analysis enables systematic measurement of privacy risks at various model layers. Our experimental validation on real-world datasets demonstrates the effectiveness of this approach in providing deeper insights into privacy risks associated with intermediate representations.
zh

[CV-180] BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Birds-Eye View

【速读】：该论文试图解决在室内环境中，多目标多相机（MTMC）检测与跟踪中未能充分利用多视角图像的3D信息的问题。解决方案的关键在于提出了一个名为BEV-SUSHI的3D物体检测与跟踪框架。该框架首先通过相机标定参数将多视角图像聚合，生成鸟瞰图（BEV）中的3D物体检测结果，然后利用分层图神经网络（GNNs）在BEV中进行3D检测的跟踪，从而实现MTMC跟踪。与现有方法相比，BEV-SUSHI在不同场景和相机设置下具有显著的通用性，并能有效处理长期关联问题，因此在AICity’24和WildTrack数据集上均达到了新的最先进水平。

链接: https://arxiv.org/abs/2412.00692
作者: Yizhou Wang,Tim Meinhardt,Orcun Cetintas,Cheng-Yen Yang,Sameer Satish Pusegaonkar,Benjamin Missaoui,Sujit Biswas,Zheng Tang,Laura Leal-Taixé
关键词-EN: object detection, retail stores, Object perception, intelligent systems, indoor environments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named BEV-SUSHI, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird’s-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, BEV-SUSHI has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed BEV-SUSHI establishes the new state-of-the-art on the AICity’24 dataset with 81.22 HOTA, and 95.6 IDF1 on the WildTrack dataset.
zh

[CV-181] LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

【速读】：该论文试图解决大型视觉-语言模型（LVLMs）在计数任务中的局限性，特别是在对象数量超出训练数据中常见范围时。解决方案的关键在于采用分治策略，将计数问题分解为子计数任务。与以往方法不同，该方法无需在新数据集上进行额外训练或微调，即可在新数据集上表现出色，从而显著提升了模型在各种数据集和基准测试中的计数能力。

链接: https://arxiv.org/abs/2412.00686
作者: Muhammad Fetrat Qharabagh,Mohammadreza Ghofrani,Kimon Fountoulakis
关键词-EN: real-life applications, fundamental skill, recognition and robust, robust counting capabilities, Counting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 27 Figures, 12 Tables

点击查看摘要

Abstract:Counting is a fundamental skill for various visual tasks in real-life applications, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) struggle with counting tasks, especially when the number of objects exceeds those commonly encountered during training. We enhance LVLMs’ counting abilities using a divide-and-conquer approach, breaking counting problems into sub-counting tasks. Unlike prior methods, which do not generalize well to counting datasets on which they have not been trained, our method performs well on new datasets without any additional training or fine-tuning. We demonstrate that our approach enhances counting capabilities across various datasets and benchmarks.
zh

[CV-182] Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding

【速读】：该论文试图解决在数据稀缺环境下视觉定位（Visual Grounding）的学习问题。解决方案的关键在于提出了一种名为POBF（Paint Outside the Box, then Filter）的新框架。POBF通过在图像边界外进行补绘（inpainting）来合成图像，从而解决以往工作中遇到的标签对齐问题。此外，POBF采用了一种创新的过滤方案，结合硬度评分（hardness score）和过拟合评分（overfitting score），并通过惩罚项进行平衡，以识别最有效的训练数据。实验结果表明，POBF在四个数据集上均表现出色，平均提升了5.83%的准确率，并超越了领先的基线模型2.29%到3.85%。

链接: https://arxiv.org/abs/2412.00684
作者: Zilin Du,Haoxin Li,Jianfei Yu,Boyang Li
关键词-EN: Visual grounding aims, image regions based, Visual grounding, learn visual grounding, textual query
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual grounding aims to localize the image regions based on a textual query. Given the difficulty of large-scale data curation, we investigate how to effectively learn visual grounding under data-scarce settings in this paper. To address data scarcity, we propose a novel framework, POBF (Paint Outside the Box, then Filter). POBF synthesizes images by inpainting outside the box, tackling a label misalignment issue encountered in previous works. Furthermore, POBF leverages an innovative filtering scheme to identify the most effective training data. This scheme combines a hardness score and an overfitting score, balanced by a penalty term. Experimental results show that POBF achieves superior performance across four datasets, delivering an average improvement of 5.83% and outperforming leading baselines by 2.29% to 3.85% in accuracy. Additionally, we validate the robustness and generalizability of POBF across various generative models, data ratios, and model architectures.
zh

[CV-183] DMFourLLIE: Dual-Stage and Multi-Branch Fourier Network for Low-Light Image Enhancement

【速读】：该论文试图解决现有低光图像增强技术在傅里叶频域处理中主要关注幅度分量而忽视相位分量的问题，导致图像增强后出现颜色失真和噪声问题。解决方案的关键在于提出了一个双阶段多分支傅里叶低光图像增强框架（DMFourLLIE），通过强调相位分量在保留图像结构和细节中的作用来克服这些限制。第一阶段通过整合红外图像的结构信息来增强相位分量，并在亮度-色度颜色空间中采用亮度注意力机制来精确控制幅度增强。第二阶段结合多尺度和傅里叶卷积分支进行鲁棒的图像重建，有效恢复空间结构和纹理。这种双分支联合优化过程确保了复杂图像信息的保留，超越了以往方法忽视幅度与相位相互作用的局限性。

链接: https://arxiv.org/abs/2412.00683
作者: Tongshun Zhang,Pingping Liu,Ming Zhao,Haotian Lv
关键词-EN: Fourier frequency domain, phase component, low-light image enhancement, Fourier Low-Light Image, frequency domain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2024

点击查看摘要

Abstract:In the Fourier frequency domain, luminance information is primarily encoded in the amplitude component, while spatial structure information is significantly contained within the phase component. Existing low-light image enhancement techniques using Fourier transform have mainly focused on amplifying the amplitude component and simply replicating the phase component, an approach that often leads to color distortions and noise issues. In this paper, we propose a Dual-Stage Multi-Branch Fourier Low-Light Image Enhancement (DMFourLLIE) framework to address these limitations by emphasizing the phase component’s role in preserving image structure and detail. The first stage integrates structural information from infrared images to enhance the phase component and employs a luminance-attention mechanism in the luminance-chrominance color space to precisely control amplitude enhancement. The second stage combines multi-scale and Fourier convolutional branches for robust image reconstruction, effectively recovering spatial structures and textures. This dual-branch joint optimization process ensures that complex image information is retained, overcoming the limitations of previous methods that neglected the interplay between amplitude and phase. Extensive experiments across multiple datasets demonstrate that DMFourLLIE outperforms current state-of-the-art methods in low-light image enhancement. Our code is available at this https URL.
zh

[CV-184] FlashSLAM: Accelerated RGB-D SLAM for Real-Time 3D Scene Reconstruction with Gaussian Splatting

【速读】：该论文试图解决现有基于3D高斯溅射（3D Gaussian Splatting）的同步定位与地图构建（SLAM）方法在稀疏视图设置和大范围相机移动时表现不佳的问题。解决方案的关键在于结合3D高斯溅射与快速视觉相机跟踪技术，利用预训练的特征匹配模型和点云配准进行精确的姿态估计，从而在不到80毫秒的时间内完成跟踪，相比SplaTAM方法减少了90%的跟踪时间，且无需昂贵的迭代渲染。此外，该方法还考虑了深度传感器中的噪声，增强了在使用非专业设备（如智能手机）时的鲁棒性。

链接: https://arxiv.org/abs/2412.00682
作者: Phu Pham,Damon Conover,Aniket Bera
关键词-EN: Gaussian Splatting, Splatting for efficient, approach that leverages, efficient and robust, SLAM approach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, 13 tables

点击查看摘要

Abstract:We present FlashSLAM, a novel SLAM approach that leverages 3D Gaussian Splatting for efficient and robust 3D scene reconstruction. Existing 3DGS-based SLAM methods often fall short in sparse view settings and during large camera movements due to their reliance on gradient descent-based optimization, which is both slow and inaccurate. FlashSLAM addresses these limitations by combining 3DGS with a fast vision-based camera tracking technique, utilizing a pretrained feature matching model and point cloud registration for precise pose estimation in under 80 ms - a 90% reduction in tracking time compared to SplaTAM - without costly iterative rendering. In sparse settings, our method achieves up to a 92% improvement in average tracking accuracy over previous methods. Additionally, it accounts for noise in depth sensors, enhancing robustness when using unspecialized devices such as smartphones. Extensive experiments show that FlashSLAM performs reliably across both sparse and dense settings, in synthetic and real-world environments. Evaluations on benchmark datasets highlight its superior accuracy and efficiency, establishing FlashSLAM as a versatile and high-performance solution for SLAM, advancing the state-of-the-art in 3D reconstruction across diverse applications.
zh

[CV-185] MIMIC: Multimodal Islamophobic Meme Identification and Classification NEURIPS2024

【速读】：该论文试图解决在表情包（memes）中识别反穆斯林仇恨言论的问题。解决方案的关键在于提出了一个基于视觉与语言转换器（Vision-and-Language Transformer, ViLT）的分类器，该分类器通过整合表情包中的视觉和文本表示，捕捉表情包文化中特有的微妙伊斯兰恐惧症叙事，从而实现高精度的检测和互操作性。

链接: https://arxiv.org/abs/2412.00681
作者: S M Jishanul Islam,Sahid Hossain Mustakim,Sadia Ahmmed,Md. Faiyaz Abdullah Sayeedi,Swapnil Khandoker,Syed Tasdid Azam Dhrubo,Nahid Hossain
关键词-EN: seemingly mimic humor, convey Islamophobic sentiments, Anti-Muslim hate speech, characterized by context-dependent, identify anti-Muslim hate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted (Poster) - NeurIPS 2024 Workshop MusIML

点击查看摘要

Abstract:Anti-Muslim hate speech has emerged within memes, characterized by context-dependent and rhetorical messages using text and images that seemingly mimic humor but convey Islamophobic sentiments. This work presents a novel dataset and proposes a classifier based on the Vision-and-Language Transformer (ViLT) specifically tailored to identify anti-Muslim hate within memes by integrating both visual and textual representations. Our model leverages joint modal embeddings between meme images and incorporated text to capture nuanced Islamophobic narratives that are unique to meme culture, providing both high detection accuracy and interoperability.
zh

[CV-186] 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

【速读】：该论文试图解决在处理大规模二维上下文（如Giga-Pixel Whole Slide Imaging (WSI)和遥感图像）时，Transformer模型面临的二次复杂度问题。解决方案的关键在于提出了2DMamba，这是一种新颖的二维选择性状态空间模型（2D selective State Space Model, SSM）框架。2DMamba通过将图像的二维空间结构融入Mamba模型，并采用高度优化的硬件感知算子，实现了空间连续性和计算效率的结合。该方法在WSI分类和生存分析等多个任务中显著提升了性能，同时在自然图像的语义分割和分类任务中也表现出色。

链接: https://arxiv.org/abs/2412.00678
作者: Jingwei Zhang,Anh Tien Nguyen,Xi Han,Vincent Quoc-Huy Trinh,Hong Qin,Dimitris Samaras,Mahdi S. Hosseini
关键词-EN: Efficiently modeling large, fields including Giga-Pixel, Giga-Pixel Whole Slide, Efficiently modeling, Slide Imaging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submission under review

点击查看摘要

Abstract:Efficiently modeling large 2D contexts is essential for various fields including Giga-Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer-based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba~improves up to 2.48% in AUC, 3.11% in F1 score, 2.47% in accuracy and 5.52% in C-index. Additionally, integrating our method with VMamba for natural imaging yields 0.5 to 0.7 improvements in mIoU on the ADE20k semantic segmentation dataset, and 0.2% accuracy improvement on ImageNet-1K classification dataset. Our code is available at this https URL.
zh

[CV-187] FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

【速读】：该论文试图解决单目深度估计 (Monocular Depth Estimation, MDE) 在处理噪声数据和合成数据集分布差异时面临的低效率、低准确性和缺乏细节的问题。解决方案的关键在于提出了一种利用扩散先验的高效方法，并引入了 FiffDepth 框架，该框架将基于扩散的图像生成器转换为前馈架构，用于详细的深度估计。通过保留关键的生成特征并结合 dinov2 等模型的强大泛化能力，FiffDepth 实现了更高的准确性、稳定性和细粒度细节，显著提升了在各种真实世界场景中的 MDE 性能。

链接: https://arxiv.org/abs/2412.00671
作者: Yunpeng Bai,Qixing Huang
关键词-EN: Monocular Depth Estimation, Monocular Depth, scene reconstruction, autonomous navigation, content creation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) is essential for applications like 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust MDE remains challenging due to noisy real-world data and distribution gaps in synthetic datasets. Existing methods often struggle with low efficiency, reduced accuracy, and lack of detail. To address this, we propose an efficient approach for leveraging diffusion priors and introduce FiffDepth, a framework that transforms diffusion-based image generators into a feedforward architecture for detailed depth estimation. By preserving key generative features and integrating the strong generalization capabilities of models like dinov2, FiffDepth achieves enhanced accuracy, stability, and fine-grained detail, offering a significant improvement in MDE performance across diverse real-world scenarios.
zh

[CV-188] Explaining Object Detectors via Collective Contribution of Pixels

【速读】：该论文试图解决现有对象检测解释方法仅关注单个像素贡献而忽略多像素集体贡献的问题。解决方案的关键在于利用博弈论概念，特别是Shapley值和交互作用，来生成考虑多像素集体贡献的解释。这种方法不仅涵盖了边界框生成，还包括类别确定，从而更准确地识别检测结果中的重要区域。

链接: https://arxiv.org/abs/2412.00666
作者: Toshinori Yamauchi,Hiroshi Kera,Kazuhiko Kawamoto
关键词-EN: Visual explanations, object detectors, enhancing their reliability, crucial for enhancing, Visual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11+14 pages, 15 figures, 8 tables

点击查看摘要

Abstract:Visual explanations for object detectors are crucial for enhancing their reliability. Since object detectors identify and localize instances by assessing multiple features collectively, generating explanations that capture these collective contributions is critical. However, existing methods focus solely on individual pixel contributions, ignoring the collective contribution of multiple pixels. To address this, we proposed a method for object detectors that considers the collective contribution of multiple pixels. Our approach leverages game-theoretic concepts, specifically Shapley values and interactions, to provide explanations. These explanations cover both bounding box generation and class determination, considering both individual and collective pixel contributions. Extensive quantitative and qualitative experiments demonstrate that the proposed method more accurately identifies important regions in detection results compared to current state-of-the-art methods. The code will be publicly available soon.
zh

[CV-189] Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection

【速读】：该论文试图解决现有方法在检测由未见过的扩散模型生成的图像时泛化能力不足的问题。解决方案的关键在于提出了一种名为“Learning on Less (LoL)”的训练方法，该方法通过随机掩码机制限制模型学习特定扩散模型的独特模式，从而使其专注于较少的内容特征。这种方法利用了预训练权重的固有优势，同时实现了更稳定的泛化能力，最终提取出能够区分真实图像与不同扩散模型生成图像的通用特征。实验结果表明，LoL在仅使用1%训练数据的情况下，显著优于当前最先进的方法，平均准确率提高了13.6%。

链接: https://arxiv.org/abs/2412.00665
作者: Yingjian Chen,Lei Zhang,Yakun Niu,Lei Tan,Pei Chen
关键词-EN: eroding public trust, Models enable realistic, Diffusion Models enable, realistic image generation, enable realistic image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust. Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods. To address this issue, we rethink the effectiveness of pre-trained models trained on large-scale, real-world images. Our findings indicate that: 1) Pre-trained models can cluster the features of real images effectively. 2) Models with pre-trained weights can approximate an optimal generalization solution at a specific training step, but it is extremely unstable. Based on these facts, we propose a simple yet effective training method called Learning on Less (LoL). LoL utilizes a random masking mechanism to constrain the model’s learning of the unique patterns specific to a certain type of diffusion model, allowing it to focus on less image content. This leverages the inherent strengths of pre-trained weights while enabling a more stable approach to optimal generalization, which results in the extraction of a universal feature that differentiates various diffusion-generated images from real images. Extensive experiments on the GenImage benchmark demonstrate the remarkable generalization capability of our proposed LoL. With just 1% training data, LoL significantly outperforms the current state-of-the-art, achieving a 13.6% improvement in average ACC across images generated by eight different models.
zh

[CV-190] Improving Decoupled Posterior Sampling for Inverse Problems using Data Consistency Constraint

【速读】：该论文试图解决扩散模型在解决逆问题时早期步骤中存在的误差问题。解决方案的关键在于提出了引导解耦后验采样方法（Guided Decoupled Posterior Sampling, GDPS），通过在反向过程中引入数据一致性约束，使得优化过程更加平滑，从而更有效地收敛到目标分布。此外，该方法还扩展到潜在扩散模型和Tweedie公式，展示了其可扩展性。实验结果表明，GDPS在FFHQ和ImageNet数据集上的多种线性和非线性任务中均达到了最先进的性能，显著提高了现有方法的准确性。

链接: https://arxiv.org/abs/2412.00664
作者: Zhi Qi,Shihong Yuan,Yuyin Yuan,Linling Kuang,Yoshiyuki Kabashima,Xiangming Meng
关键词-EN: Decoupled Posterior Sampling, posterior sampling, Guided Decoupled Posterior, Decoupled Posterior, solving inverse problems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models have shown strong performances in solving inverse problems through posterior sampling while they suffer from errors during earlier steps. To mitigate this issue, several Decoupled Posterior Sampling methods have been recently proposed. However, the reverse process in these methods ignores measurement information, leading to errors that impede effective optimization in subsequent steps. To solve this problem, we propose Guided Decoupled Posterior Sampling (GDPS) by integrating a data consistency constraint in the reverse process. The constraint performs a smoother transition within the optimization process, facilitating a more effective convergence toward the target distribution. Furthermore, we extend our method to latent diffusion models and Tweedie’s formula, demonstrating its scalability. We evaluate GDPS on the FFHQ and ImageNet datasets across various linear and nonlinear tasks under both standard and challenging conditions. Experimental results demonstrate that GDPS achieves state-of-the-art performance, improving accuracy over existing methods.
zh

[CV-191] owards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics

【速读】：该论文试图解决当前多模态预训练模型在计算病理学中主要依赖视觉-语言模型，从而在分子层面存在局限性和性能瓶颈的问题。解决方案的关键在于引入了一个统一分子增强病理图像表示学习框架 (Unified Molecule-enhanced Pathology Image REpresentation Learning framework, UMPIRE)。UMPIRE通过利用基因表达谱的补充信息来指导多模态预训练，增强了病理图像表示学习的分子意识。该框架通过收集约400万条空间转录组学基因表达数据来训练基因编码器，并利用预训练的强大编码器，对超过69.7万对病理图像-基因表达数据进行对齐。UMPIRE在基因表达预测、斑点分类和全切片图像中的突变状态预测等下游分子相关任务中展示了其性能，强调了多模态数据集成在分子视角下增强计算病理学的有效性。

链接: https://arxiv.org/abs/2412.00651
作者: Minghao Han,Dingkang Yang,Jiabei Cheng,Xukun Zhang,Linhao Qu,Zizhi Chen,Lihua Zhang
关键词-EN: Recent advancements, significantly advanced computational, Unified Molecule-enhanced Pathology, significantly advanced, Pathology Image
类目: Computer Vision and Pattern Recognition (cs.CV); Genomics (q-bio.GN)
备注: 21 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at this https URL.
zh

[CV-192] Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis

【速读】：该论文试图解决设计风格化电影图（cinemagraphs）时，定制复杂且富有表现力的流动动作的难题。解决方案的关键在于提出了一个名为Sketch2Cinemagraph的草图引导框架，该框架通过手绘草图（freehand sketches）来传达个性化的设计需求，从而实现对生成的电影图的直观和详细控制。Sketch2Cinemagraph结合了文本提示（text prompts）进行初始内容生成，并通过手绘草图提供空间和运动线索（spatial and motion cues）。其核心技术包括使用潜在扩散模型（latent diffusion model）生成目标风格化的风景图像，以及通过预训练的对象检测模型（object detection model）获取流动区域的掩码（masks）。此外，论文还提出了一种新颖的潜在运动扩散模型（latent motion diffusion model），用于估计生成风景图像中流体区域的流动场（motion field），并通过输入的运动草图来控制生成的矢量场。最终，通过帧生成器（frame generator）将流体区域内的像素逐帧扭曲到目标位置，合成电影图帧。实验结果表明，Sketch2Cinemagraph能够从直观的草图输入中生成高保真度和美学上吸引人的风格化电影图，具有连续的时间流动效果。

链接: https://arxiv.org/abs/2412.00638
作者: Hao Jin,Hengyuan Chang,Xiaoxuan Xie,Zhengyang Wang,Xusheng Du,Shaojun Hu,Haoran Xie
关键词-EN: Designing stylized cinemagraphs, Designing stylized, stylized cinemagraphs, challenging due, difficulty in customizing
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 20 figures

点击查看摘要

Abstract:Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow motions. To achieve intuitive and detailed control of the generated cinemagraphs, freehand sketches can provide a better solution to convey personalized design requirements than only text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial content generation and provides hand-drawn sketch controls for both spatial and motion cues. The latent diffusion model is adopted to generate target stylized landscape images along with realistic versions. Then, a pre-trained object detection model is utilized to segment and obtain masks for the flow regions. We proposed a novel latent motion diffusion model to estimate the motion field in the fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated vector fields in the masked fluid regions with the prompt. To synthesize the cinemagraph frames, the pixels within fluid regions are subsequently warped to the target locations for each timestep using a frame generator. The results verified that Sketch2Cinemagraph can generate high-fidelity and aesthetically appealing stylized cinemagraphs with continuous temporal flow from intuitive sketch inputs. We showcase the advantages of Sketch2Cinemagraph through quantitative comparisons against the state-of-the-art generation approaches.
zh

[CV-193] MambaNUT: Nighttime UAV Tracking via Mamba and Adaptive Curriculum Learning

【速读】：该论文试图解决夜间无人机跟踪中过度依赖图像增强、高质量夜间数据稀缺以及忽视昼夜跟踪器关系的问题。解决方案的关键在于提出了一种基于Mamba的新型跟踪框架（MambaNUT），该框架采用线性复杂度的状态空间模型作为骨干，结合单流架构，将特征学习和模板搜索耦合在Vision Mamba中。此外，论文引入了一种自适应课程学习（ACL）方法，通过动态调整采样策略和损失权重，提升模型的泛化能力。ACL包括两个层次的课程调度器：（1）采样调度器，将数据分布从失衡调整为平衡，并从简单（白天）样本过渡到复杂（夜间）样本；（2）损失调度器，根据数据频率和IOU动态分配权重。实验结果表明，MambaNUT在多个夜间无人机跟踪基准测试中实现了最先进的性能，同时降低了计算成本。

链接: https://arxiv.org/abs/2412.00626
作者: You Wu,Xiangyang Yang,Xucheng Wang,Hengzhou Ye,Dan Zeng,Shuiwang Li
关键词-EN: Harnessing low-light enhancement, made substantial strides, Harnessing low-light, domain adaptation, substantial strides
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Harnessing low-light enhancement and domain adaptation, nighttime UAV tracking has made substantial strides. However, over-reliance on image enhancement, scarcity of high-quality nighttime data, and neglecting the relationship between daytime and nighttime trackers, which hinders the development of an end-to-end trainable framework. Moreover, current CNN-based trackers have limited receptive fields, leading to suboptimal performance, while ViT-based trackers demand heavy computational resources due to their reliance on the self-attention mechanism. In this paper, we propose a novel pure Mamba-based tracking framework (\textbfMambaNUT) that employs a state space model with linear complexity as its backbone, incorporating a single-stream architecture that integrates feature learning and template-search coupling within Vision Mamba. We introduce an adaptive curriculum learning (ACL) approach that dynamically adjusts sampling strategies and loss weights, thereby improving the model’s ability of generalization. Our ACL is composed of two levels of curriculum schedulers: (1) sampling scheduler that transforms the data distribution from imbalanced to balanced, as well as from easier (daytime) to harder (nighttime) samples; (2) loss scheduler that dynamically assigns weights based on data frequency and the IOU. Exhaustive experiments on multiple nighttime UAV tracking benchmarks demonstrate that the proposed MambaNUT achieves state-of-the-art performance while requiring lower computational costs. The code will be available.
zh

[CV-194] VideoSAVi: Self-Aligned Video Language Models without Human Supervision

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在视频理解任务中，由于高昂的标注成本和视频时间信息的复杂性，导致指令调优数据集多样性不足的问题。解决方案的关键在于提出了VideoSAVi（Self-Aligned Video Language Model），这是一个新颖的自训练流程，通过三个阶段（生成多样化的视频特定问题、产生多个候选答案、评估这些答案与视频内容的匹配度）来生成训练数据，而无需大量手动标注。这种自生成数据随后用于直接偏好优化（Direct Preference Optimization, DPO），使模型能够自我精炼高质量输出，并增强与视频内容的匹配度。实验结果表明，即使较小的模型（0.5B和7B参数）也能有效利用这种自训练方法，超越以往方法，并在多个基准测试中取得显著提升。

链接: https://arxiv.org/abs/2412.00624
作者: Yogesh Kulkarni,Pooyan Fazli
关键词-EN: Recent advances, significantly enhanced video, advances in vision-language, significantly enhanced, video understanding tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have significantly enhanced video understanding tasks. Instruction tuning (i.e., fine-tuning models on datasets of instructions paired with desired outputs) has been key to improving model performance. However, creating diverse instruction-tuning datasets is challenging due to high annotation costs and the complexity of capturing temporal information in videos. Existing approaches often rely on large language models to generate instruction-output pairs, which can limit diversity and lead to responses that lack grounding in the video content. To address this, we propose VideoSAVi (Self-Aligned Video Language Model), a novel self-training pipeline that enables VLMs to generate their own training data without extensive manual annotation. The process involves three stages: (1) generating diverse video-specific questions, (2) producing multiple candidate answers, and (3) evaluating these responses for alignment with the video content. This self-generated data is then used for direct preference optimization (DPO), allowing the model to refine its own high-quality outputs and improve alignment with video content. Our experiments demonstrate that even smaller models (0.5B and 7B parameters) can effectively use this self-training approach, outperforming previous methods and achieving results comparable to those trained on proprietary preference data. VideoSAVi shows significant improvements across multiple benchmarks: up to 28% on multi-choice QA, 8% on zero-shot open-ended QA, and 12% on temporal reasoning benchmarks. These results demonstrate the effectiveness of our self-training approach in enhancing video understanding while reducing dependence on proprietary models.
zh

[CV-195] A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

【速读】：该论文试图解决从单张2D图像生成三维结构的问题，这一问题由于2D到3D的映射具有固有的歧义性而显得不适定。解决方案的关键在于引入了一种名为SplatDiffusion的扩散模型，该模型专门用于处理高斯点云（Gaussian Splats）。与现有的依赖于确定性前馈预测的方法不同，SplatDiffusion通过一种新颖的训练策略，将去噪模态与监督模态解耦，从而克服了3D数据稀缺的问题。具体来说，该方法利用一个确定性模型作为噪声教师来生成噪声信号，并通过从单步到多步去噪的过程，以图像渲染损失为监督，显著提升了性能。此外，该方法具有灵活性，能够从多种3D高斯点云教师模型中学习，且无需大量适应。通过引入多视图信息聚合的指导机制，进一步提高了重建质量。实验结果表明，该框架在物体级和场景级数据集上均表现出色。

链接: https://arxiv.org/abs/2412.00623
作者: Chensheng Peng,Ido Sobol,Masayoshi Tomizuka,Kurt Keutzer,Chenfeng Xu,Or Litany
关键词-EN: Gaussian Splats, including Gaussian splats, addressing the ill-posed, nature of lifting, enable generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images, addressing the ill-posed nature of lifting 2D inputs to 3D. Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data. Diffusion models have recently shown promise as powerful generative models for 3D data, including Gaussian splats; however, standard diffusion frameworks typically require the target signal and denoised signal to be in the same modality, which is challenging given the scarcity of 3D data. To overcome this, we propose a novel training strategy that decouples the denoised modality from the supervision modality. By using a deterministic model as a noisy teacher to create the noised signal and transitioning from single-step to multi-step denoising supervised by an image rendering loss, our approach significantly enhances performance compared to the deterministic teacher. Additionally, our method is flexible, as it can learn from various 3D Gaussian Splat (3DGS) teachers with minimal adaptation; we demonstrate this by surpassing the performance of two different deterministic models as teachers, highlighting the potential generalizability of our framework. Our approach further incorporates a guidance mechanism to aggregate information from multiple views, enhancing reconstruction quality when more than one view is available. Experimental results on object-level and scene-level datasets demonstrate the effectiveness of our framework.
zh

[CV-196] Visual Modality Prompt for Adapting Vision-Language Object Detectors

【速读】：该论文试图解决在不同视觉模态（如红外和深度）下，目标检测器的零样本性能下降的问题。解决方案的关键是提出了一种名为ModPrompt的视觉提示策略，该策略通过编码器-解码器结构生成视觉提示，并结合推理友好的任务残差，以在不损害零样本能力的前提下，使视觉-语言检测器适应新的视觉模态。实验结果表明，该方法在YOLO-World和Grounding DINO检测器上，对红外和深度数据集的模态适应性能接近全量微调，同时保持了模型的零样本能力。

链接: https://arxiv.org/abs/2412.00622
作者: Heitor R. Medeiros,Atif Belal,Srikanth Muralidharan,Eric Granger,Marco Pedersoli
关键词-EN: object detectors degrades, Grounding DINO, degrades when tested, detectors, vision-language detectors
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches tend to compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly task residuals, facilitating more robust adaptation. Empirically, we benchmark our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) data, achieving performance comparable to full fine-tuning while preserving the model’s zero-shot capability. Our code is available at: this https URL
zh

[CV-197] PhyT2V: LLM -Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

【速读】：该论文试图解决当前文本到视频 (Text-to-Video, T2V) 生成模型在遵循现实世界常识和物理规则方面的不足，特别是在物理现实性和时间建模方面的局限性。解决方案的关键在于提出了一种新的数据独立 (data-independent) 的 T2V 技术，称为 PhyT2V，通过在 T2V 提示中引入链式思维 (chain-of-thought) 和回退推理 (step-back reasoning)，扩展了现有 T2V 模型在生成视频时的能力，使其能够更好地适应分布外 (out-of-distribution) 领域。实验结果表明，PhyT2V 将现有 T2V 模型对现实世界物理规则的遵循度提高了 2.3 倍，并比 T2V 提示增强器实现了 35% 的改进。

链接: https://arxiv.org/abs/2412.00596
作者: Qiyao Xue,Xiangyu Yin,Boyuan Yang,Wei Gao
关键词-EN: transformer-based diffusion models, models lack capabilities, real-world common knowledge, temporal modeling, recently enabled
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model’s capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models’ adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: this https URL.
zh

[CV-198] Generative LiDAR Editing with Controllable Novel Object Layouts ICRA

【速读】：该论文试图解决在保持真实背景环境的前提下，编辑现实世界中的激光雷达扫描数据以生成具有新颖物体布局的场景的问题。解决方案的关键在于提出了一种框架，该框架通过生成式背景修复和物体点云补全技术，支持物体移除和插入操作，并基于球形体素化构建了整个处理流程，从而实现了激光雷达投影几何的正确性。这种方法不仅确保了生成数据与特定环境的关联性，还为生成的数据提供了标签，有助于在真实世界场景中开发和评估基于激光雷达的自动驾驶系统算法。

链接: https://arxiv.org/abs/2412.00592
作者: Shing-Hei Ho,Bao Thach,Minghan Zhu
关键词-EN: edit real-world Lidar, background environment, realistic background environment, real-world Lidar scans, Lidar scans
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Submitted to IEEE International Conference on Robotics and Automation (ICRA). 6 pages, 7 figures

点击查看摘要

Abstract:We propose a framework to edit real-world Lidar scans with novel object layouts while preserving a realistic background environment. Compared to the synthetic data generation frameworks where Lidar point clouds are generated from scratch, our framework focuses on new scenario generation in a given background environment, and our method also provides labels for the generated data. This approach ensures the generated data remains relevant to the specific environment, aiding both the development and the evaluation of algorithms in real-world scenarios. Compared with novel view synthesis, our framework allows the creation of counterfactual scenarios with significant changes in the object layout and does not rely on multi-frame optimization. In our framework, the object removal and insertion are supported by generative background inpainting and object point cloud completion, and the entire pipeline is built upon spherical voxelization, which realizes the correct Lidar projective geometry by construction. Experiments show that our framework generates realistic Lidar scans with object layout changes and benefits the development of Lidar-based self-driving systems.
zh

[CV-199] Continuous Concepts Removal in Text-to-image Diffusion Models

【速读】：该论文试图解决文本到图像扩散模型在连续移除特定概念时，导致文本提示与生成图像之间对齐质量下降的问题。解决方案的关键在于提出了一种名为CCRT的新方法，该方法通过设计的知识蒸馏范式，利用遗传算法生成的文本提示集和设计的模糊策略，约束了连续概念移除过程中的文本-图像对齐行为。实验结果表明，CCRT能够在有效移除目标概念的同时，保持模型生成图像的高质量（如文本-图像对齐）。

链接: https://arxiv.org/abs/2412.00580
作者: Tingxu Han,Weisong Sun,Yanrong Hu,Chunrong Fang,Yonglong Zhang,Shiqing Ma,Tao Zheng,Zhenyu Chen,Zhenting Wang
关键词-EN: input textual descriptions, generate high-quality images, textual descriptions, shown an impressive, impressive ability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have shown an impressive ability to generate high-quality images from input textual descriptions. However, concerns have been raised about the potential for these models to create content that infringes on copyrights or depicts disturbing subject matter. Removing specific concepts from these models is a promising potential solution to this problem. However, existing methods for concept removal do not work well in practical but challenging scenarios where concepts need to be continuously removed. Specifically, these methods lead to poor alignment between the text prompts and the generated image after the continuous removal process. To address this issue, we propose a novel approach called CCRT that includes a designed knowledge distillation paradigm. It constrains the text-image alignment behavior during the continuous concept removal process by using a set of text prompts generated through our genetic algorithm, which employs a designed fuzzing strategy. We conduct extensive experiments involving the removal of various concepts. The results evaluated through both algorithmic metrics and human studies demonstrate that our CCRT can effectively remove the targeted concepts in a continuous manner while maintaining the high generation quality (e.g., text-image alignment) of the model.
zh

[CV-200] Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3D-GS）技术在实时渲染新视角时存在的渲染速度和模型大小瓶颈问题，尤其是在资源受限的环境中。解决方案的关键在于两个方面：首先，优化渲染流水线以精确地定位场景中的高斯分布，从而在不改变视觉保真度的前提下提升渲染速度；其次，引入一种新的剪枝技术并将其集成到训练流水线中，显著减少模型大小和训练时间，同时进一步提高渲染速度。通过这些改进，Speedy-Splat方法在Mip-NeRF 360、Tanks & Temples和Deep Blending数据集上的平均渲染速度提升了6.71倍，且使用的基本单元数量比3D-GS减少了10.6倍。

链接: https://arxiv.org/abs/2412.00578
作者: Alex Hanson,Allen Tu,Geng Lin,Vasu Singla,Matthias Zwicker,Tom Goldstein
关键词-EN: parametric point clouds, Gaussian Splatting, enables real-time rendering, clouds of differentiable, rendering speed
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3D-GS) is a recent 3D scene reconstruction technique that enables real-time rendering of novel views by modeling scenes as parametric point clouds of differentiable 3D Gaussians. However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings. In this paper, we identify and address two key inefficiencies in 3D-GS, achieving substantial improvements in rendering speed, model size, and training time. First, we optimize the rendering pipeline to precisely localize Gaussians in the scene, boosting rendering speed without altering visual fidelity. Second, we introduce a novel pruning technique and integrate it into the training pipeline, significantly reducing model size and training time while further raising rendering speed. Our Speedy-Splat approach combines these techniques to accelerate average rendering speed by a drastic 6.71\times across scenes from the Mip-NeRF 360, Tanks Temples, and Deep Blending datasets with 10.6\times fewer primitives than 3D-GS.
zh

[CV-201] Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion

【速读】：该论文试图解决盲逆问题（Blind Inverse Problems），即在目标数据和前向算子均未知的情况下，应用于计算机视觉中的关键问题。解决方案的关键在于提出了一个无需训练的框架LADiBI，该框架利用大规模文本到图像扩散模型（text-to-image diffusion models），通过自然语言提示（natural language prompts）来联合建模目标图像和算子的先验知识，从而在最小假设条件下灵活适应各种任务。此外，论文还提出了一种新的后验采样方法，结合有效的算子初始化和迭代优化，使得LADiBI能够在没有预定义算子形式的情况下运行，并展示了其在广泛图像恢复任务中的有效性，包括线性和非线性问题。

链接: https://arxiv.org/abs/2412.00557
作者: Michail Dontas,Yutong He,Naoki Murata,Yuki Mitsufuji,J. Zico Kolter,Ruslan Salakhutdinov
关键词-EN: computer vision applications, Blind inverse problems, vision applications, Blind inverse, data and forward
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.
zh

[CV-202] Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理高分辨率图像时，由于视觉标记（vision tokens）数量随图像分辨率平方增长而导致的巨大计算成本问题。解决方案的关键在于：(I) 在不降低性能的情况下减少计算成本；(II) 在给定预算内提高性能。论文提出了一种基于注意力分数排序的假设，即各层视觉标记的重要性排序在除第一层外是相似的，因此核心视觉标记数量不会随层数增加而增加。基于此，论文设计了两种策略：(I) 使用贪心搜索算法（G-Search）从浅层到深层逐层减少视觉标记数量，以达到最优的减少策略；(II) 设计了一个参数化的Sigmoid函数（P-Sigmoid），通过贝叶斯优化（Bayesian Optimization）优化其参数，以在各层指导视觉标记的减少，从而在有限的预算内实现性能提升。实验结果表明，该方法能够在不降低性能的情况下显著加速MLLMs，如LLaVA和InternVL2模型，且在预算有限时表现优于其他标记减少方法。

链接: https://arxiv.org/abs/2412.00556
作者: Shiyu Zhao,Zhenting Wang,Felix Juefei-Xu,Xide Xia,Miao Liu,Xiaofang Wang,Mingfu Liang,Ning Zhang,Dimitris N. Metaxas,Licheng Yu
关键词-EN: Multimodal Large Language, Prevailing Multimodal Large, Large Language Models, Large Language, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report, 18 pages

点击查看摘要

Abstract:Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM’s efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep. Interestingly, G-Search is able to reach the optimal reduction strategy based on our assumption. For Scenario II, based on the reduction strategy from G-Search, we design a parametric sigmoid function (P-Sigmoid) to guide the reduction at each layer of the MLLM, whose parameters are optimized by Bayesian Optimization. Extensive experiments demonstrate that our approach can significantly accelerate those popular MLLMs, e.g. LLaVA, and InternVL2 models, by more than 2 \times without performance drops. Our approach also far outperforms other token reduction methods when budgets are limited, achieving a better trade-off between efficiency and effectiveness.
zh

[CV-203] Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning

【速读】：该论文试图解决现有视频生成模型（world models）在生成视频时缺乏逻辑和物理一致性的问题。解决方案的关键在于提出了一个两阶段的视频生成框架——Motion Dreamer。第一阶段，模型基于输入图像和运动条件生成中间运动表示（如分割图或深度图），专注于运动本身；第二阶段，模型利用这一中间运动表示作为条件生成高细节视频。通过将运动推理与高保真视频合成解耦，该方法能够生成更准确且物理上更合理的运动。

链接: https://arxiv.org/abs/2412.00547
作者: Tianshuo Xu,Zhifei Chen,Leyi Wu,Hao Lu,Yuying Chen,Lihui Jiang,Bingbing Liu,Yingcong Chen
关键词-EN: Recent numerous video, Recent numerous, motion, demonstrated the ability, numerous video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent numerous video generation models, also known as world models, have demonstrated the ability to generate plausible real-world videos. However, many studies have shown that these models often produce motion results lacking logical or physical coherence. In this paper, we revisit video generation models and find that single-stage approaches struggle to produce high-quality results while maintaining coherent motion reasoning. To address this issue, we propose \textbfMotion Dreamer, a two-stage video generation framework. In Stage I, the model generates an intermediate motion representation-such as a segmentation map or depth map-based on the input image and motion conditions, focusing solely on the motion itself. In Stage II, the model uses this intermediate motion representation as a condition to generate a high-detail video. By decoupling motion reasoning from high-fidelity video synthesis, our approach allows for more accurate and physically plausible motion generation. We validate the effectiveness of our approach on the Physion dataset and in autonomous driving scenarios. For example, given a single push, our model can synthesize the sequential toppling of a set of dominoes. Similarly, by varying the movements of ego-cars, our model can produce different effects on other vehicles. Our work opens new avenues in creating models that can reason about physical interactions in a more coherent and realistic manner.
zh

[CV-204] RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification NEURIPS2024

【速读】：该论文试图解决火箭体分类缺乏全面基准的问题，解决方案的关键在于引入了基于光变曲线的火箭体分类数据集RoBo6。该数据集从Mini Mega Tortora数据库中提取，包含六种火箭体类别（CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, Delta 4）的光变曲线，并通过重采样、归一化和滤波技术处理数据不一致性。论文评估了多种机器学习模型，包括卷积神经网络（CNN）和基于变换器的模型，其中Astroconformer表现最佳，为未来的火箭体分类任务提供了一个标准化的基准。

链接: https://arxiv.org/abs/2412.00544
作者: Daniel Kyselica,Marek Šuppa,Jiří Šilha,Roman Ďurikovič
关键词-EN: Space debris presents, standardized identification methods, future space missions, rocket body classification, space missions
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 6 pages, 1 figure, 5 tables, Accepted on Machine Learning and the Physical Sciences Workshop, NeurIPS 2024

点击查看摘要

Abstract:Space debris presents a critical challenge for the sustainability of future space missions, emphasizing the need for robust and standardized identification methods. However, a comprehensive benchmark for rocket body classification remains absent. This paper addresses this gap by introducing the RoBo6 dataset for rocket body classification based on light curves. The dataset, derived from the Mini Mega Tortora database, includes light curves for six rocket body classes: CZ-3B, Atlas 5 Centaur, Falcon 9, H-2A, Ariane 5, and Delta 4. With 5,676 training and 1,404 test samples, it addresses data inconsistencies using resampling, normalization, and filtering techniques. Several machine learning models were evaluated, including CNN and transformer-based approaches, with Astroconformer reporting the best performance. The dataset establishes a common benchmark for future comparisons and advancements in rocket body classification tasks.
zh

[CV-205] Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective

【速读】：该论文试图解决自监督学习中普遍存在的泛化性（generalizability）和区分性（discriminability）之间的互斥关系问题。解决方案的关键在于结合进化博弈论（Evolutionary Game Theory, EGT）和强化学习的优势，提出一种新的自监督学习方法。该方法通过EGT的动态系统建模来分析和优化泛化性与区分性之间的权衡点，同时利用强化学习在预训练过程中逐步优化模型，以实现对特定目标域的泛化性和区分性的持续提升。理论分析表明，该方法能够收紧自监督学习的泛化误差上界，并在多个基准测试中达到最先进的性能。

链接: https://arxiv.org/abs/2412.00542
作者: Jiangmeng Li,Zehua Zang,Qirui Ji,Chuxiong Sun,Wenwen Qiang,Junge Zhang,Changwen Zheng,Fuchun Sun,Hui Xiong
关键词-EN: self-supervised learning, self-supervised, learning, approaches are generally, generally considered
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCV, 2024

点击查看摘要

Abstract:Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations jointly possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a two-player game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks.
zh

[CV-206] Human Action CLIPS: Detecting AI-generated Human Motion

【速读】：该论文试图解决如何有效区分真实人类动作视频与AI生成的视频的问题。解决方案的关键在于利用多模态语义嵌入（multi-modal semantic embedding）技术，这种方法能够抵御通常使低至中级方法失效的“洗白”（laundering）手段，从而提高区分真实与AI生成视频的鲁棒性。通过对比由七种文本到视频AI模型生成的视频片段与匹配的真实视频片段，该方法在自建数据集上进行了评估。

链接: https://arxiv.org/abs/2412.00526
作者: Matyas Bohacek,Hany Farid
关键词-EN: Full-blown AI-generated video, video generation continues, Full-blown AI-generated, indistinguishable from reality, generation continues
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.
zh

[CV-207] Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

【速读】：该论文试图解决3D形状编辑（3D shape editing）中耗时过长的问题，特别是通过生成式扩散模型（SDS type of optimization）进行编辑时所需的时间。解决方案的关键在于将3D编辑问题转化为多视角图像修复（multiview image inpainting）问题，利用现有的重建模型（Large Reconstruction Models）将修复后的图像映射回任意3D表示形式（如网格、NeRF或高斯Splats）。通过设计不同的微调策略和修复掩码（inpainting mask），论文实现了在约3秒内完成高质量的3D形状编辑，显著提升了编辑效率和结果质量。

链接: https://arxiv.org/abs/2412.00518
作者: Amir Barda,Matheus Gadelha,Vladimir G. Kim,Noam Aigerman,Amit H. Bermano,Thibault Groueix
关键词-EN: Gaussian Splats, running an SDS, Large Reconstruction Models, represented as meshes, SDS type
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL

点击查看摘要

Abstract:We propose a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.
zh

[CV-208] Good Cheap and Fast: Overfitted Image Compression with Wasserstein Distortion CVPR2025

【速读】：该论文试图解决生成式图像压缩模型在追求高质量图像时带来的高计算复杂度问题。解决方案的关键在于优化C3图像编解码器，使其专注于视觉感知而非数据分布，从而在保持与生成式压缩模型（如HiFiC）相似的视觉质量和比特率的同时，大幅降低解压缩所需的乘积累加操作（MACs）。通过采用Wasserstein Distortion (WD) 作为优化目标，并结合人类评分研究，论文展示了WD在预测人类评分方面的优越性，其Pearson相关系数超过94%，优于其他感知质量指标如LPIPS、DISTS和MS-SSIM。

链接: https://arxiv.org/abs/2412.00505
作者: Jona Ballé,Luca Versari,Emilien Dupont,Hyunjik Kim,Matthias Bauer
关键词-EN: compression increasingly focuses, recent work, leading to excellent, natural image distribution, work on learned
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 13 pages, 9 figures. Submitted to CVPR 2025

点击查看摘要

Abstract:Inspired by the success of generative image models, recent work on learned image compression increasingly focuses on better probabilistic models of the natural image distribution, leading to excellent image quality. This, however, comes at the expense of a computational complexity that is several orders of magnitude higher than today’s commercial codecs, and thus prohibitive for most practical applications. With this paper, we demonstrate that by focusing on modeling visual perception rather than the data distribution, we can achieve a very good trade-off between visual quality and bit rate similar to “generative” compression models such as HiFiC, while requiring less than 1% of the multiply-accumulate operations (MACs) for decompression. We do this by optimizing C3, an overfitted image codec, for Wasserstein Distortion (WD), and evaluating the image reconstructions with a human rater study. The study also reveals that WD outperforms other perceptual quality metrics such as LPIPS, DISTS, and MS-SSIM, both as an optimization objective and as a predictor of human ratings, achieving over 94% Pearson correlation with Elo scores.
zh

[CV-209] Density-aware Global-Local Attention Network for Point Cloud Segmentation

链接: https://arxiv.org/abs/2412.00489
作者: Chade Li,Pengju Zhang,Yihong Wu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-210] LineGS : 3D Line Segment Representation on 3D Gaussian Splatting

【速读】：该论文试图解决现有3D重建方法在处理线性结构时面临的稳定性问题和噪声干扰问题。解决方案的关键在于引入了一种名为LineGS的方法，该方法结合了几何引导的3D线重建与3D高斯喷射模型（3D Gaussian splatting model）。通过利用场景边缘的高斯点密度，LineGS能够细化初始线性段，使其更精确地与场景的几何特征对齐，从而提高3D结构的拟合精度，提供更高效和可靠的3D场景抽象表示。

链接: https://arxiv.org/abs/2412.00477
作者: Chenggang Yang,Yuang Shi,Wei Tsang Ooi
关键词-EN: computer vision, supporting tasks, tasks like mapping, essential in computer, surface reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Abstract representations of 3D scenes are essential in computer vision, supporting tasks like mapping, localization, and surface reconstruction. Line segments are commonly used to capture scene structure, but existing 3D reconstruction methods often face limitations, either from instability in 2D projections or noise in direct 3D data. This paper introduces LineGS, a method that integrates geometry-guided 3D line reconstruction with a 3D Gaussian splatting model to improve accuracy. By leveraging Gaussian point densities along scene edges, LineGS refines initial line segments, aligning them more closely with the scene’s geometric features. Experiments confirm that this approach enhances the fit to 3D structures, providing an efficient and reliable abstract representation of 3D scenes.
zh

[CV-211] Jailbreak Large Visual Language Models Through Multi-Modal Linkage

链接: https://arxiv.org/abs/2412.00473
作者: Yu Wang,Xiaofei Zhou,Yichen Wang,Geyuan Zhang,Tianxing He
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-212] Enhancing Skin Cancer Diagnosis (SCD) Using Late Discrete Wavelet Transform (DWT) and New Swarm-Based Optimizers

链接: https://arxiv.org/abs/2412.00472
作者: Ramin Mousa,Saeed Chamani,Mohammad Morsali,Mohammad Kazzazi,Parsa Hatami,Soroush Sarabi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-213] AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models ECCV2024

链接: https://arxiv.org/abs/2412.00465
作者: Yutong Zhou,Masahiro Ryo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPPA @ECCV2024. Dataset: this https URL

点击查看摘要

[CV-214] BGM: Background Mixup for X-ray Prohibited Items Detection

【速读】：该论文试图解决基于X射线图像的违禁品检测方法缺乏全面数据驱动探索的问题。解决方案的关键在于提出了一种名为背景混合（Background Mixup, BGM）的新型数据增强技术，该技术利用X射线图像的独特物理特性，如X射线透射图像的复合信息和基于材料的伪彩色渲染，通过在图像块级别进行混合，引入行李轮廓信息和材料信息的变异，从而增强模型对前景物体的关注。BGM方法具有即插即用、无参数、高度通用化的特点，能够有效克服传统视觉增强方法在非反射光图像中的局限性，并在不同设备和环境下的X射线数据集上实现一致的性能提升。

链接: https://arxiv.org/abs/2412.00460
作者: Weizhe Liu,Renshuai Tao,Hongguang Zhu,Yunda Sun,Yao Zhao,Yunchao Wei
关键词-EN: Prohibited item detection, ensuring public safety, comprehensive data-driven exploration, lack comprehensive data-driven, current X-ray image-based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prohibited item detection is crucial for ensuring public safety, yet current X-ray image-based detection methods often lack comprehensive data-driven exploration. This paper introduces a novel data augmentation approach tailored for prohibited item detection, leveraging unique characteristics inherent to X-ray imagery. Our method is motivated by observations of physical properties including: 1) X-ray Transmission Imagery: Unlike reflected light images, transmitted X-ray pixels represent composite information from multiple materials along the imaging path. 2) Material-based Pseudo-coloring: Pseudo-color rendering in X-ray images correlates directly with material properties, aiding in material distinction. Building on a novel perspective from physical properties, we propose a simple yet effective X-ray image augmentation technique, Background Mixup (BGM), for prohibited item detection in security screening contexts. The essence is the rich background simulation of X-ray images to induce the model to increase its attention to the foreground. The approach introduces 1) contour information of baggage and 2) variation of material information into the original image by Mixup at patch level. Background Mixup is plug-and-play, parameter-free, highly generalizable and provides an effective solution to the limitations of classical visual augmentations in non-reflected light imagery. When implemented with different high-performance detectors, our augmentation method consistently boosts performance across diverse X-ray datasets from various devices and environments. Extensive experimental results demonstrate that our approach surpasses strong baselines while maintaining similar training resources.
zh

[CV-215] Learning Locally Revising Globally: Global Reviser for Federated Learning with Noisy Labels

链接: https://arxiv.org/abs/2412.00452
作者: Yuxin Tian,Mouxing Yang,Yuhao Zhou,Jian Wang,Qing Ye,Tongliang Liu,Gang Niu,Jiancheng Lv
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

[CV-216] A conditional Generative Adversarial network model for the Weather4Cast 2024 Challenge

链接: https://arxiv.org/abs/2412.00451
作者: Atharva Deshpande,Kaushik Gopalan,Jeet Shah,Hrishikesh Simu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-217] ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

链接: https://arxiv.org/abs/2412.00447
作者: Xubing Ye,Yukang Gan,Yixiao Ge,Xiao-Ping Zhang,Yansong Tang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

[CV-218] Hybrid Local-Global Context Learning for Neural Video Compression

链接: https://arxiv.org/abs/2412.00446
作者: Yongqi Zhai,Jiayu Yang,Wei Jiang,Chunhui Yang,Luyang Tang,Ronggang Wang
关键词-EN:
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to DCC 2024

点击查看摘要

[CV-219] wo Models for Surface Segmentation using the Total Variation of the Normal Vector

链接: https://arxiv.org/abs/2412.00445
作者: Lukas Baumgärtner,Ronny Bergmann,Roland Herzog,Stephan Schmidt,Manuel Weiß
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

[CV-220] Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

链接: https://arxiv.org/abs/2412.00440
作者: Haicheng Wang,Chen Ju,Weixiong Lin,Shuai Xiao,Mengting Chen,Yixuan Huang,Chang Liu,Mingshuai Yao,Jinsong Lan,Ying Chen,Qingwen Liu,Yanfeng Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-221] Dynamic Token Selection for Aerial-Ground Person Re-Identification

链接: https://arxiv.org/abs/2412.00433
作者: Yuhai Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-222] Learner Attentiveness and Engagement Analysis in Online Education Using Computer Vision

链接: https://arxiv.org/abs/2412.00429
作者: Sharva Gogawale,Madhura Deshpande,Parteek Kumar,Irad Ben-Gal
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-223] FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

链接: https://arxiv.org/abs/2412.00427
作者: Teng-Fang Hsiao,Bo-Kai Ruan,Sung-Lin Tsai,Yi-Lun Wu,Hong-Han Shuai
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-224] AROT: Targeted Data Selection via Optimal Transport

链接: https://arxiv.org/abs/2412.00420
作者: Lan Feng,Fan Nie,Yuejiang Liu,Alexandre Alahi
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-225] Hard-Label Black-Box Attacks on 3D Point Clouds

【速读】：该论文试图解决在3D点云模型中存在的对抗攻击问题，特别是在实际应用中难以部署的白盒和黑盒攻击方法。解决方案的关键在于提出了一种新的硬标签黑盒攻击方法，即攻击者只能访问3D输入的预测标签。具体来说，论文引入了一种基于频谱感知决策边界算法的新型3D攻击方法，通过构建类别感知的模型决策边界，并采用可学习的频谱融合策略来适应性地融合不同类别的点云，从而在不扭曲原始几何形状的情况下生成中间样本。随后，通过迭代坐标-频谱优化方法和曲率感知的边界搜索，沿着决策边界移动中间样本，生成具有微小扰动的对抗点云。实验结果表明，该方法在攻击性能和对抗样本质量方面优于现有的白盒和黑盒攻击方法。

链接: https://arxiv.org/abs/2412.00404
作者: Daizong Liu,Yunbo Tao,Pan Zhou,Wei Hu
关键词-EN: safety-critical applications, maturity of depth, depth sensors, decision boundary, point
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the maturity of depth sensors in various 3D safety-critical applications, 3D point cloud models have been shown to be vulnerable to adversarial attacks. Almost all existing 3D attackers simply follow the white-box or black-box setting to iteratively update coordinate perturbations based on back-propagated or estimated gradients. However, these methods are hard to deploy in real-world scenarios (no model details are provided) as they severely rely on parameters or output logits of victim models. To this end, we propose point cloud attacks from a more practical setting, i.e., hard-label black-box attack, in which attackers can only access the prediction label of 3D input. We introduce a novel 3D attack method based on a new spectrum-aware decision boundary algorithm to generate high-quality adversarial samples. In particular, we first construct a class-aware model decision boundary, by developing a learnable spectrum-fusion strategy to adaptively fuse point clouds of different classes in the spectral domain, aiming to craft their intermediate samples without distorting the original geometry. Then, we devise an iterative coordinate-spectrum optimization method with curvature-aware boundary search to move the intermediate sample along the decision boundary for generating adversarial point clouds with trivial perturbations. Experiments demonstrate that our attack competitively outperforms existing white/black-box attackers in terms of attack performance and adversary quality.
zh

[CV-226] DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses

链接: https://arxiv.org/abs/2412.00397
作者: Yatian Pang,Bin Zhu,Bin Lin,Mingzhe Zheng,Francis E. H. Tay,Ser-Nam Lim,Harry Yang,Li Yuan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-227] GradiSeg: Gradient-Guided Gaussian Segmentation with Enhanced 3D Boundary Precision

链接: https://arxiv.org/abs/2412.00392
作者: Zehao Li,Wenwei Han,Yujun Cai,Hao Jiang,Baolong Bi,Shuqin Gao,Honglong Zhao,Zhaoqi Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-228] DogLayout: Denoising Diffusion GAN for Discrete and Continuous Layout Generation

链接: https://arxiv.org/abs/2412.00381
作者: Zhaoxing Gan,Guangnan Ye
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

[CV-229] Bi-Band ECoGNet for ECoG Decoding on Classification Task

链接: https://arxiv.org/abs/2412.00378
作者: Changqing Ji,Keisuke Kawasaki,Isao Hasegwa,Takayuki Okatani
关键词-EN:
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-230] Implementation of neural network operators with applications to remote sensing data

链接: https://arxiv.org/abs/2412.00375
作者: Danilo Costarelli,Michele Piconi
关键词-EN:
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-231] LQ-Adapter: ViT-Adapter with Learnable Queries for Gallbladder Cancer Detection from Ultrasound Image WACV2025

链接: https://arxiv.org/abs/2412.00374
作者: Chetan Madan,Mayuna Gupta,Soumen Basu,Pankaj Gupta,Chetan Arora
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025

点击查看摘要

[CV-232] LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2412.00364
作者: Huadong Tang,Youpeng Zhao,Yan Huang,Min Xu,Jun Wang,Qiang Wu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-233] Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2412.00357
作者: Sanghyun Kim,Moonseok Choi,Jinwoo Shin,Juho Lee
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 18 figures

点击查看摘要

[CV-234] Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

链接: https://arxiv.org/abs/2412.00348
作者: Wei Zhou,Lei Zhao,Runyu Zhang,Yifan Cui,Hongpu Huang,Kun Qie,Chen Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-235] Fusing Physics-Driven Strategies and Cross-Modal Adversarial Learning: Toward Multi-Domain Applications

链接: https://arxiv.org/abs/2412.00341
作者: Hana Satou,Alan Mitkiy
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-236] EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices

链接: https://arxiv.org/abs/2412.00334
作者: Meihan Wu,Tao Chang,Cui Miao,Jie Zhou,Chun Li,Xiangyu Xu,Ming Li,Xiaodong Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-237] Gaussians on their Way: Wasserstein-Constrained 4D Gaussian Splatting with State-Space Modeling

链接: https://arxiv.org/abs/2412.00333
作者: Junli Deng,Yihao Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-238] owards Pixel-Level Prediction for Gaze Following: Benchmark and Approach

链接: https://arxiv.org/abs/2412.00309
作者: Feiyang Liu,Dan Guo,Jingyuan Xu,Zihao He,Shengeng Tang,Kun Li,Meng Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-239] Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

链接: https://arxiv.org/abs/2412.00306
作者: Yizhi Song,Liu He,Zhifei Zhang,Soo Ye Kim,He Zhang,Wei Xiong,Zhe Lin,Brian Price,Scott Cohen,Jianming Zhang,Daniel Aliaga
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-240] HSLiNets: Hyperspectral Image and LiDAR Data Fusion Using Efficient Dual Linear Feature Learning Networks

链接: https://arxiv.org/abs/2412.00302
作者: Judy X Yang,Jing Wang,Chen Hong Sui,Zekun Long,Jun Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 2 figues

点击查看摘要

[CV-241] Real-Time Metric-Semantic Mapping for Autonomous Navigation in Outdoor Environments

链接: https://arxiv.org/abs/2412.00291
作者: Jianhao Jiao,Ruoyu Geng,Yuanhang Li,Ren Xin,Bowen Yang,Jin Wu,Lujia Wang,Ming Liu,Rui Fan,Dimitrios Kanoulas
关键词-EN:
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 9 figures, accepted to IEEE Transactions on Automation Science and Engineering

点击查看摘要

[CV-242] Adapting the re-ID challenge for static sensors

链接: https://arxiv.org/abs/2412.00290
作者: Avirath Sundaresan,Jason R. Parham,Jonathan Crall,Rosemary Warungu,Timothy Muthami,Margaret Mwangi,Jackson Miliko,Jason Holmberg,Tanya Y. Berger-Wolf,Daniel Rubenstein,Charles V. Stewart,Sara Beery
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 11 figures. Submitted to the IET Computer Vision Special Issue on Camera Traps, AI, and Ecology. Extended version of a workshop paper presented at Camera Traps, AI, and Ecology 2023

点击查看摘要

[CV-243] SS Linear Fusion Model: Hyperspectral Imaging Efficient Spatial and Spectral Linear Model with Bidirectional Feature Learning

链接: https://arxiv.org/abs/2412.00283
作者: Judy X Yang,Jing Wang,Zekun Long,Chenhong Sui,Jun Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures and 10 tables

点击查看摘要

[CV-244] Facial Expression Recognition with Controlled Privacy Preservation and Feature Compensation WACV2025

链接: https://arxiv.org/abs/2412.00277
作者: Feng Xu,David Ahmedt-Aristizabal,Peterson Lars,Dadong Wang,Xun Li
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV2025 accepted

点击查看摘要

[CV-245] Real-to-Sim via End-to-End Differentiable Simulation and Rendering

链接: https://arxiv.org/abs/2412.00259
作者: Yifan Zhu,Tianyi Xiang,Aaron Dollar,Zherong Pan
关键词-EN:
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

[CV-246] Excretion Detection in Pigsties Using Convolutional and Transformerbased Deep Neural Networks STOC

链接: https://arxiv.org/abs/2412.00256
作者: Simon Mielke,Anthony Stein
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Keywords: Artificial Intelligence, Objected detection, Pig, Urine puddle, Thermal IR data, CNN vs Transformer, Precision Livestock Farming; Stats: 54 pages, 13 figures, 1 graphical abstract

点击查看摘要

[CV-247] Uni-SLAM: Uncertainty-Aware Neural Implicit SLAM for Real-Time Dense Indoor Scene Reconstruction WACV2025

链接: https://arxiv.org/abs/2412.00242
作者: Shaoxiang Wang,Yaxu Xie,Chun-Peng Chang,Christen Millerdurai,Alain Pagani,Didier Stricker
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

[CV-248] wisted Convolutional Networks (TCNs): Enhancing Feature Interactions for Non-Spatial Data Classification

链接: https://arxiv.org/abs/2412.00238
作者: Junbo Jacob Lian
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The source code for the TCNs can be accessed at this https URL

点击查看摘要

[CV-249] Hybrid Spiking Neural Network – Transformer Video Classification Model

链接: https://arxiv.org/abs/2412.00237
作者: Aaron Bateni
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 37 pages, 11 figures. BSc Thesis in Computer Science. Code available

点击查看摘要

[CV-250] Diffusion Model Guided Sampling with Pixel-Wise Aleatoric Uncertainty Estimation WACV2025

链接: https://arxiv.org/abs/2412.00205
作者: Michele De Vita,Vasileios Belagiannis
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted at WACV 2025

点击查看摘要

[CV-251] LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

链接: https://arxiv.org/abs/2412.00177
作者: Xiaoyan Xing,Konrad Groh,Sezer Karagolu,Theo Gevers,Anand Bhattad
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

[CV-252] Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

链接: https://arxiv.org/abs/2412.00176
作者: Hui Ren,Joanna Materzynska,Rohit Gandikota,David Bau,Antonio Torralba
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-253] Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning

链接: https://arxiv.org/abs/2412.00175
作者: Dragos-Alexandru Boldisor,Stefan Smeu,Dan Oneata,Elisabeta Oneata
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-254] SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

链接: https://arxiv.org/abs/2412.00174
作者: Jianping Jiang,Weiye Xiao,Zhengyu Lin,Huaizhong Zhang,Tianxiang Ren,Yang Gao,Zhiqian Lin,Zhongang Cai,Lei Yang,Ziwei Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-255] RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World

链接: https://arxiv.org/abs/2412.00171
作者: Weixin Mao,Weiheng Zhong,Zhou Jiang,Dong Fang,Zhongyue Zhang,Zihan Lan,Fan Jia,Tiancai Wang,Haoqiang Fan,Osamu Yoshie
关键词-EN:
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 16 figures

点击查看摘要

[CV-256] STEP: Enhancing Video-LLM s Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

链接: https://arxiv.org/abs/2412.00161
作者: Haiyi Qiu,Minghe Gao,Long Qian,Kaihang Pan,Qifan Yu,Juncheng Li,Wenjie Wang,Siliang Tang,Yueting Zhuang,Tat-Seng Chua
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-257] AerialGo: Walking-through City View Generation from Aerial Perspectives

链接: https://arxiv.org/abs/2412.00157
作者: Fuqiang Zhao,Yijing Guo,Siyuan Yang,Xi Chen,Luo Wang,Lan Xu,Yingliang Zhang,Yujiao Shi,Jingyi Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 7 figures

点击查看摘要

[CV-258] VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

链接: https://arxiv.org/abs/2412.00156
作者: Taesung Kwon,Jong Chul Ye
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 13 pages, 9 figures

点击查看摘要

[CV-259] -3DGS: Removing Transient Objects for 3D Scene Reconstruction

链接: https://arxiv.org/abs/2412.00155
作者: Vadim Pryadilshchikov,Alexander Markin,Artem Komarichev,Ruslan Rakhimov,Peter Wonka,Evgeny Burnaev
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-260] ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

链接: https://arxiv.org/abs/2412.00153
作者: Kunyang Han,Yibo Hu,Mengxue Qu,Hailin Shi,Yao Zhao,Yunchao Wei
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-261] DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

链接: https://arxiv.org/abs/2412.00151
作者: Ahmad Mohammadshirazi,Pinaki Prasad Guha Neogi,Ser-Nam Lim,Rajiv Ramnath
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-262] Curriculum Fine-tuning of Vision Foundation Model for Medical Image Classification Under Label Noise NEURIPS2024

链接: https://arxiv.org/abs/2412.00150
作者: Yeonguk Yu,Minhwan Ko,Sungho Shin,Kangmin Kim,Kyoobin Lee
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted at NeurIPS 2024

点击查看摘要

[CV-263] Motion Modes: What Could Happen Next?

链接: https://arxiv.org/abs/2412.00148
作者: Karran Pandey,Matheus Gadelha,Yannick Hold-Geoffroy,Karan Singh,Niloy J. Mitra,Paul Guerrero
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-264] MPQ-Diff: Mixed Precision Quantization for Diffusion Models

链接: https://arxiv.org/abs/2412.00144
作者: Rocco Manz Maruzzelli,Basile Lewandowski,Lydia Y. Chen
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-265] Is Oracle Pruning the True Oracle? ACL

链接: https://arxiv.org/abs/2412.00143
作者: Sicheng Feng,Keda Tao,Huan Wang
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

[CV-266] Differentiable Topology Estimating from Curvatures for 3D Shapes

链接: https://arxiv.org/abs/2412.00140
作者: Yihao Luo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

[CV-267] EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval

链接: https://arxiv.org/abs/2412.00139
作者: Muhammad Huzaifa,Yova Kementchedjhieva
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-268] Unleashing the Power of Data Synthesis in Visual Localization

链接: https://arxiv.org/abs/2412.00138
作者: Sihang Li,Siqi Tan,Bowen Chang,Jing Zhang,Chen Feng,Yiming Li
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 24 pages, 21 figures

点击查看摘要

[CV-269] FonTS: Text Rendering with Typography and Style Controls

链接: https://arxiv.org/abs/2412.00136
作者: Wenda Shi,Yiren Song,Dengming Zhang,Jiaming Liu,Xingxing Zou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-270] PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

链接: https://arxiv.org/abs/2412.00134
作者: ShuaiHeng Li,Qing Cai,Fan Zhang,Menghuan Zhang,Yangyang Shu,Zhi Liu,Huafeng Li,Lingqiao Liu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-271] Event-based Tracking of Any Point with Motion-Robust Correlation Features

链接: https://arxiv.org/abs/2412.00133
作者: Friedhelm Hamann,Daniel Gehrig,Filbert Febryanto,Kostas Daniilidis,Guillermo Gallego
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 14 pages, 12 figures, 7 tables

点击查看摘要

[CV-272] Open-Sora Plan: Open-Source Large Video Generation Model

链接: https://arxiv.org/abs/2412.00131
作者: Bin Lin,Yunyang Ge,Xinhua Cheng,Zongjian Li,Bin Zhu,Shaodong Wang,Xianyi He,Yang Ye,Shenghai Yuan,Liuhan Chen,Tanghui Jia,Junwu Zhang,Zhenyu Tang,Yatian Pang,Bin She,Cen Yan,Zhiheng Hu,Xiaoyi Dong,Lin Chen,Zhang Pan,Xing Zhou,Shaoling Dong,Yonghong Tian,Li Yuan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: v1.3

点击查看摘要

[CV-273] Auto-Encoded Supervision for Perceptual Image Super-Resolution

链接: https://arxiv.org/abs/2412.00124
作者: MinKyu Lee,Sangeek Hyun,Woojin Jun,Jae-Pil Heo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Codes are available at this https URL

点击查看摘要

[CV-274] Bridging the Gap: Aligning Text-to-Image Diffusion Models with Specific Feedback

链接: https://arxiv.org/abs/2412.00122
作者: Xuexiang Niu,Jinping Tang,Lei Wang,Ge Zhu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-275] Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning

链接: https://arxiv.org/abs/2412.00121
作者: Yang Liu,Xinshuo Wang,Jiale Du,Xinbo Gao,Jungong Han
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-276] Relation-Aware Meta-Learning for Zero-shot Sketch-Based Image Retrieval

链接: https://arxiv.org/abs/2412.00120
作者: Yang Liu,Jiale Du,Xinbo Gao,Jungong Han
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-277] raining Multi-Layer Binary Neural Networks With Local Binary Error Signals

链接: https://arxiv.org/abs/2412.00119
作者: Luca Colombo,Fabrizio Pittorino,Manuel Roveri
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-278] OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

链接: https://arxiv.org/abs/2412.00115
作者: Hui Li,Mingwang Xu,Yun Zhan,Shan Mu,Jiaye Li,Kaihui Cheng,Yuxuan Chen,Tan Chen,Mao Ye,Jingdong Wang,Siyu Zhu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, 5 tables

点击查看摘要

[CV-279] SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

链接: https://arxiv.org/abs/2412.00114
作者: Yue Cao,Yun Xing,Jie Zhang,Di Lin,Tianwei Zhang,Ivor Tsang,Yang Liu,Qing Guo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-280] BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

链接: https://arxiv.org/abs/2412.00112
作者: Seong-Eun Hong,Soobin Lim,Juyeong Hwang,Minwook Chang,Hyeongyeop Kang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

[CV-281] Video Set Distillation: Information Diversification and Temporal Densification

链接: https://arxiv.org/abs/2412.00111
作者: Yinjie Zhao,Heng Zhao,Bihan Wen,Yew-Soon Ong,Joey Tianyi Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-282] Demographic Predictability in 3D CT Foundation Embeddings

链接: https://arxiv.org/abs/2412.00110
作者: Guangyao Zheng,Michael A. Jacobs,Vishwa S. Parekh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: submitted to Radiology Cardiothoracic Imaging

点击查看摘要

[CV-283] Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

链接: https://arxiv.org/abs/2412.00100
作者: Maitreya Patel,Song Wen,Dimitris N. Metaxas,Yezhou Yang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Project Page: this https URL

点击查看摘要

[CV-284] OFCap:Object-aware Fusion for Image Captioning

链接: https://arxiv.org/abs/2412.00095
作者: Feiyang Huang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-285] A Novel Approach to Image Steganography Using Generative Adversarial Networks

链接: https://arxiv.org/abs/2412.00094
作者: Waheed Rehman
关键词-EN:
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-286] Graph Canvas for Controllable 3D Scene Generation

链接: https://arxiv.org/abs/2412.00091
作者: Libin Liu,Shen Chen,Sen Jia,Jingzhe Shi,Zhongyu Jiang,Can Jin,Wu Zongkai,Jenq-Neng Hwang,Lei Li
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

[CV-287] Residual Attention Single-Head Vision Transformer Network for Rolling Bearing Fault Diagnosis in Noisy Environments

链接: https://arxiv.org/abs/2412.00085
作者: Songjiang Lai,Tsun-Hin Cheung,Jiayi Zhao,Kaiwen Xue,Ka-Chun Fung,Kin-Man Lam
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 24 pages, 14 figures, 3 tables

点击查看摘要

[CV-288] Visual Error Patterns in Multi-Modal AI: A Statistical Approach

链接: https://arxiv.org/abs/2412.00083
作者: Ching-Yi Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

[CV-289] Selfish Evolution: Making Discoveries in Extreme Label Noise with the Help of Overfitting Dynamics

链接: https://arxiv.org/abs/2412.00077
作者: Nima Sedaghat,Tanawan Chatchadanoraset,Colin Orion Chandler,Ashish Mahabal,Maryam Eslami
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-290] Flaws of ImageNet Computer Visions Favourite Dataset

链接: https://arxiv.org/abs/2412.00076
作者: Nikita Kisel,Illia Volkov,Katerina Hanzelkova,Klara Janouskova,Jiri Matas
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-291] Addressing Vulnerabilities in AI-Image Detection: Challenges and Proposed Solutions

链接: https://arxiv.org/abs/2412.00073
作者: Justin Jiang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-292] he Muon Space GNSS-R Surface Soil Moisture Product

链接: https://arxiv.org/abs/2412.00072
作者: Max Roberts,Ian Colwell,Clara Chew,Dallas Masters,Karl Nordstrom
关键词-EN:
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Space Physics (physics.space-ph)
备注: 23 pages, 10 figures

点击查看摘要

[CV-293] Enhanced Lung Cancer Survival Prediction using Semi-Supervised Pseudo-Labeling and Learning from Diverse PET/CT Datasets

链接: https://arxiv.org/abs/2412.00068
作者: Mohammad R. Salmanpour,Arman Gorji,Amin Mousavi,Ali Fathi Jouzdani,Nima Sanati,Mehdi Maghsudi,Bonnie Leung,Cheryl Ho,Ren Yuan,Arman Rahmim
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)
备注: 12 pages and 7 figures

点击查看摘要

[CV-294] argeted Therapy in Data Removal: Object Unlearning Based on Scene Graphs

链接: https://arxiv.org/abs/2412.00067
作者: Chenhan Zhang,Benjamin Zi Hao Zhao,Hassan Asghar,Dali Kaafar
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-295] DiffGuard: Text-Based Safety Checker for Diffusion Models

链接: https://arxiv.org/abs/2412.00064
作者: Massine El Khader,Elias Al Bouzidi,Abdellah Oumida,Mohammed Sbaihi,Eliott Binard,Jean-Philippe Poli,Wassila Ouerdane,Boussad Addad,Katarzyna Kapusta
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-296] MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

链接: https://arxiv.org/abs/2412.00060
作者: Shezheng Song,Chengxiang He,Shasha Li,Shan Zhao,Chengyu Wang,Tianwei Yan,Xiaopeng Li,Qian Wan,Jun Ma,Jie Yu,Xiaoguang Mao
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-297] Improving Medical Diagnostics with Vision-Language Models: Convex Hull-Based Uncertainty Analysis

链接: https://arxiv.org/abs/2412.00056
作者: Ferhat Ozgur Catak,Murat Kuzlu,Taylor Patrick
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

[CV-298] Brick Kiln Dataset for Pakistans IGP Region Using AI

链接: https://arxiv.org/abs/2412.00052
作者: Muhammad Suleman Ali Hamdani,Khizer Zakir,Neetu Kushwaha,Syeda Eman Fatima,Hassan Aftab Sheikh
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Nature Scientific Data - Under Review 25 pages in total, 6 images, 4 tables and 1 supplementary document

点击查看摘要

[CV-299] ransFair: Transferring Fairness from Ocular Disease Classification to Progression Prediction

链接: https://arxiv.org/abs/2412.00051
作者: Leila Gheisi,Henry Chu,Raju Gottumukkala,Xingquan Zhu,Mengyu Wang,Min Shi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 12 pages

点击查看摘要

[CV-300] Mapping waterways worldwide with deep learning

链接: https://arxiv.org/abs/2412.00050
作者: Matthew Pierson,Zia Mehrabi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages, 6 figures

点击查看摘要

[CV-301] A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning

链接: https://arxiv.org/abs/2412.00049
作者: Luis Vilaca,Yi Yu,Paula Vinan
关键词-EN:
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: arXiv admin note: text overlap with arXiv:2202.13673

点击查看摘要

[CV-302] Disparity-based Stereo Image Compression with Aligned Cross-View Priors

链接: https://arxiv.org/abs/2212.00459
作者: Yongqi Zhai,Luyang Tang,Yi Ma,Rui Peng,Ronggang Wang
关键词-EN:
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, published to ACM Multimedia 2022

点击查看摘要

[CV-303] ake Your Steps: Hierarchically Efficient Pulmonary Disease Screening via CT Volume Compression

链接: https://arxiv.org/abs/2412.01525
作者: Qian Shao,Kai Zhang,Bang Du,Zepeng Li,Yixuan Wu,Qiyuan Chen,Jian Wu,Jintai Chen,Honghao Gao,Hongxia Xu
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

[CV-304] 3D Spine Shape Estimation from Single 2D DXA

链接: https://arxiv.org/abs/2412.01504
作者: Emmanuelle Bourigault,Amir Jamaludin,Andrew Zisserman
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

[CV-305] Embryo 2.0: Merging Synthetic and Real Data for Advanced AI Predictions

链接: https://arxiv.org/abs/2412.01255
作者: Oriana Presacan,Alexandru Dorobantiu,Vajira Thambawita,Michael A. Riegler,Mette H. Stensen,Mario Iliceto,Alexandru C. Aldea,Akriti Sharma
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-306] DPE-Net: Dual-Parallel Encoder Based Network for Semantic Segmentation of Polyps

链接: https://arxiv.org/abs/2412.00888
作者: Malik Abdul Manan,Feng Jinchao,Shahzad Ahmed,Abdul Raheem
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-307] DVasMesh: Deep Structured Mesh Reconstruction from Vascular Images for Dynamics Modeling of Vessels MICCAI2024

链接: https://arxiv.org/abs/2412.00840
作者: Dengqiang Jia,Xinnian Yang,Xiaosong Xiong,Shijie Huang,Feiyu Hou,Li Qin,Kaicong Sun,Kannie Wai Yan Chan,Dinggang Shen
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, MICCAI2024 Workshop, GRAIL

点击查看摘要

[CV-308] SUBF-Net: Trans-Spatial UNet-like Network with Bi-direction Fusion for Segmentation of Adenoid Hypertrophy in CT

链接: https://arxiv.org/abs/2412.00787
作者: Rulin Zhou,Yingjie Feng,Guankun Wang,Xiaopin Zhong,Zongze Wu,Qiang Wu,Xi Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-309] A Semi-Supervised Approach with Error Reflection for Echocardiography Segmentation

链接: https://arxiv.org/abs/2412.00715
作者: Xiaoxiang Han,Yiman Liu,Jiang Shang,Qingli Li,Jiangang Chen,Menghan Hu,Qi Zhang,Yuqi Zhang,Yan Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figure, accepted by 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2024)

点击查看摘要

[CV-310] Photoacoustic Iterative Optimization Algorithm with Shape Prior Regularization

链接: https://arxiv.org/abs/2412.00705
作者: Yu Zhang,Shuang Li,Yibing Wang,Yu Sun,Wenyi Xiang
关键词-EN:
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-311] Deep Learning for Longitudinal Gross Tumor Volume Segmentation in MRI-Guided Adaptive Radiotherapy for Head and Neck Cancer

链接: https://arxiv.org/abs/2412.00663
作者: Xin Tie,Weijie Chen,Zachary Huemann,Brayden Schott,Nuohao Liu,Tyler J. Bradshaw
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 12 pages, 4 figures, 4 tables

点击查看摘要

[CV-312] Multi-resolution Guided 3D GANs for Medical Image Translation

链接: https://arxiv.org/abs/2412.00575
作者: Juhyung Ha,Jong Sung Park,David Crandall,Eleftherios Garyfallidis,Xuhong Zhang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-313] Energy-Based Prior Latent Space Diffusion model for Reconstruction of Lumbar Vertebrae from Thick Slice MRI

链接: https://arxiv.org/abs/2412.00511
作者: Yanke Wang,Yolanne Y. R. Lee,Aurelio Dolfini,Markus Reischl,Ender Konukoglu,Kyriakos Flouris
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-314] DeepFGS: Fine-Grained Scalable Coding for Learned Image Compression

链接: https://arxiv.org/abs/2412.00437
作者: Yongqi Zhai,Yi Ma,Luyang Tang,Wei Jiang,Ronggang Wang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to DCC 2025

点击查看摘要

[CV-315] Multi-scale Feature Enhancement in Multi-task Learning for Medical Image Analysis

链接: https://arxiv.org/abs/2412.00351
作者: Phuoc-Nguyen Bui,Duc-Tai Le,Junghyun Bum,Hyunseung Choo
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-316] DYRECT Computed Tomography: DYnamic Reconstruction of Events on a Continuous Timescale

链接: https://arxiv.org/abs/2412.00065
作者: Wannes Goethals,Tom Bultreys,Steffen Berg,Matthieu N. Boone,Jan Aelterman
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures, article. Submitted to IEEE Transactions on Computational Imaging 23/10/2024

点击查看摘要

[CV-317] Real-time volumetric free-hand ultrasound imaging for large-sized organs: A study of imaging the whole spine

链接: https://arxiv.org/abs/2412.00058
作者: Caozhe Li,Enxiang Shen,Haoyang Wang,Yuxin Wang,Jie Yuan,Li Gong,Di Zhao,Weijing Zhang,Zhibin Jin
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

人工智能

[AI-0] HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems

链接: https://arxiv.org/abs/2412.01799
作者: Jacky Kwok,Shulu Li,Marten Lohstroh,Edward A. Lee
关键词-EN: Robot Operating System, extensive sensor data, robust communication middleware, ensure real-time processing, Operating System
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 7 pages

点击查看摘要

Abstract:The rise of intelligent autonomous systems, especially in robotics and autonomous agents, has created a critical need for robust communication middleware that can ensure real-time processing of extensive sensor data. Current robotics middleware like Robot Operating System (ROS) 2 faces challenges with nondeterminism and high communication latency when dealing with large data across multiple subscribers on a multi-core compute platform. To address these issues, we present High-Performance Robotic Middleware (HPRM), built on top of the deterministic coordination language Lingua Franca (LF). HPRM employs optimizations including an in-memory object store for efficient zero-copy transfer of large payloads, adaptive serialization to minimize serialization overhead, and an eager protocol with real-time sockets to reduce handshake latency. Benchmarks show HPRM achieves up to 173x lower latency than ROS2 when broadcasting large messages to multiple nodes. We then demonstrate the benefits of HPRM by integrating it with the CARLA simulator and running reinforcement learning agents along with object detection workloads. In the CARLA autonomous driving application, HPRM attains 91.1% lower latency than ROS2. The deterministic coordination semantics of HPRM, combined with its optimized IPC mechanisms, enable efficient and predictable real-time communication for intelligent autonomous systems.

[AI-1] From ChebNet to ChebGibbsNet

链接: https://arxiv.org/abs/2412.01789
作者: Jie Zhang,Min-Te Sun
关键词-EN: Graph Convolutional Networks, Spectral Graph Convolutional, Convolutional Networks, representation learning tasks, Recent advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 2 figures, and 7 tables

点击查看摘要

Abstract:Recent advancements in Spectral Graph Convolutional Networks (SpecGCNs) have led to state-of-the-art performance in various graph representation learning tasks. To exploit the potential of SpecGCNs, we analyze corresponding graph filters via polynomial interpolation, the cornerstone of graph signal processing. Different polynomial bases, such as Bernstein, Chebyshev, and monomial basis, have various convergence rates that will affect the error in polynomial interpolation. Although adopting Chebyshev basis for interpolation can minimize maximum error, the performance of ChebNet is still weaker than GPR-GNN and BernNet. \textbfWe point out it is caused by the Gibbs phenomenon, which occurs when the graph frequency response function approximates the target function. It reduces the approximation ability of a truncated polynomial interpolation. In order to mitigate the Gibbs phenomenon, we propose to add the Gibbs damping factor with each term of Chebyshev polynomials on ChebNet. As a result, our lightweight approach leads to a significant performance boost. Afterwards, we reorganize ChebNet via decoupling feature propagation and transformation. We name this variant as \textbfChebGibbsNet. Our experiments indicate that ChebGibbsNet is superior to other advanced SpecGCNs, such as GPR-GNN and BernNet, in both homogeneous graphs and heterogeneous graphs.

[AI-2] Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models NEURIPS2024

链接: https://arxiv.org/abs/2412.01784
作者: Cameron Tice,Philipp Alexander Kreer,Nathan Helm-Burger,Prithviraj Singh Shahani,Fedor Ryzhenkov,Jacob Haimes,Felix Hofstätter,Teun van der Weij
关键词-EN: sandbagging. We present, play a critical, ensuring the safe, safe deployment, undermined by intentional
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Published at NeurIPS 2024, SATA and SoLaR workshop, 6 pages, 4 figures, 1 table, code available at this https URL

点击查看摘要

Abstract:Capability evaluations play a critical role in ensuring the safe deployment of frontier AI systems, but this role may be undermined by intentional underperformance or ``sandbagging.‘’ We present a novel model-agnostic method for detecting sandbagging behavior using noise injection. Our approach is founded on the observation that introducing Gaussian noise into the weights of models either prompted or fine-tuned to sandbag can considerably improve their performance. We test this technique across a range of model sizes and multiple-choice question benchmarks (MMLU, AI2, WMDP). Our results demonstrate that noise injected sandbagging models show performance improvements compared to standard models. Leveraging this effect, we develop a classifier that consistently identifies sandbagging behavior. Our unsupervised technique can be immediately implemented by frontier labs or regulatory bodies with access to weights to improve the trustworthiness of capability evaluations.

[AI-3] HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing

链接: https://arxiv.org/abs/2412.01778
作者: Lajos Muzsai,David Imolai,András Lukács
关键词-EN: Large Language Model, Large Language, Language Model, based agent capable, penetration testing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:We introduce HackSynth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. HackSynth’s dual-module architecture includes a Planner and a Summarizer, which enable it to generate commands and process feedback iteratively. To benchmark HackSynth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire. These benchmarks include two hundred challenges across diverse domains and difficulties, providing a standardized framework for evaluating LLM-based penetration testing agents. Based on these benchmarks, extensive experiments are presented, analyzing the core parameters of HackSynth, including creativity (temperature and top-p) and token utilization. Multiple open source and proprietary LLMs were used to measure the agent’s capabilities. The experiments show that the agent performed best with the GPT-4o model, better than what the GPT-4o’s system card suggests. We also discuss the safety and predictability of HackSynth’s actions. Our findings indicate the potential of LLM-based agents in advancing autonomous penetration testing and the importance of robust safeguards. HackSynth and the benchmarks are publicly available to foster research on autonomous cybersecurity solutions.

[AI-4] Robot Learning with Super-Linear Scaling

链接: https://arxiv.org/abs/2412.01770
作者: Marcel Torne,Arhan Jain,Jiayi Yuan,Vidaaranya Macha,Lars Ankile,Anthony Simeonov,Pulkit Agrawal,Abhishek Gupta
关键词-EN: robot learning requires, human effort, Scaling robot learning, Amortizing Human Effort, human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling robot learning requires data collection pipelines that scale favorably with human effort. In this work, we propose Crowdsourcing and Amortizing Human Effort for Real-to-Sim-to-Real(CASHER), a pipeline for scaling up data collection and learning in simulation where the performance scales superlinearly with human effort. The key idea is to crowdsource digital twins of real-world scenes using 3D reconstruction and collect large-scale data in simulation, rather than the real-world. Data collection in simulation is initially driven by RL, bootstrapped with human demonstrations. As the training of a generalist policy progresses across environments, its generalization capabilities can be used to replace human effort with model generated demonstrations. This results in a pipeline where behavioral data is collected in simulation with continually reducing human effort. We show that CASHER demonstrates zero-shot and few-shot scaling laws on three real-world tasks across diverse scenarios. We show that CASHER enables fine-tuning of pre-trained policies to a target scenario using a video scan without any additional human effort. See our project website: this https URL

[AI-5] Commit0: Library Generation from Scratch

链接: https://arxiv.org/abs/2412.01769
作者: Wenting Zhao,Nan Jiang,Celine Lee,Justin T Chiu,Claire Cardie,Matthias Gallé,Alexander M Rush
关键词-EN: software development ability, benchmarking generative systems, expert software development, unit tests, development ability
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the goal of benchmarking generative systems beyond expert software development ability, we introduce Commit0, a benchmark that challenges AI agents to write libraries from scratch. Agents are provided with a specification document outlining the library’s API as well as a suite of interactive unit tests, with the goal of producing an implementation of this API accordingly. The implementation is validated through running these unit tests. As a benchmark, Commit0 is designed to move beyond static one-shot code generation towards agents that must process long-form natural language specifications, adapt to multi-stage feedback, and generate code with complex dependencies. Commit0 also offers an interactive environment where models receive static analysis and execution feedback on the code they generate. Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use.

[AI-6] Efficient Compression of Sparse Accelerator Data Using Implicit Neural Representations and Importance Sampling

链接: https://arxiv.org/abs/2412.01754
作者: Xihaier Luo,Samuel Lurvey,Yi Huang,Yihui Ren,Jin Huang,Byung-Jun Yoon
关键词-EN: high-energy physics generate, large-scale particle colliders, physics generate data, high-energy physics, colliders in nuclear
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:High-energy, large-scale particle colliders in nuclear and high-energy physics generate data at extraordinary rates, reaching up to 1 terabyte and several petabytes per second, respectively. The development of real-time, high-throughput data compression algorithms capable of reducing this data to manageable sizes for permanent storage is of paramount importance. A unique characteristic of the tracking detector data is the extreme sparsity of particle trajectories in space, with an occupancy rate ranging from approximately 10^-6 to 10% . Furthermore, for downstream tasks, a continuous representation of this data is often more useful than a voxel-based, discrete representation due to the inherently continuous nature of the signals involved. To address these challenges, we propose a novel approach using implicit neural representations for data learning and compression. We also introduce an importance sampling technique to accelerate the network training process. Our method is competitive with traditional compression algorithms, such as MGARD, SZ, and ZFP, while offering significant speed-ups and maintaining negligible accuracy loss through our importance sampling strategy.

[AI-7] Digital Epidemiology: Leveraging Social Media for Insight into Epilepsy and Mental Health

链接: https://arxiv.org/abs/2412.01692
作者: Liza Dahiya,Rachit Bagga
关键词-EN: Social media platforms, Social media, media platforms, offer a unique, unique perspective
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social media platforms, particularly Reddit’s r/Epilepsy community, offer a unique perspective into the experiences of individuals with epilepsy (PWE) and their caregivers. This study analyzes 57k posts and 533k comments to explore key themes across demographics such as age, gender, and relationships. Our findings highlight significant discussions on epilepsy-related challenges, including depression (with 39.75% of posts indicating severe symptoms), driving restrictions, workplace concerns, and pregnancy-related issues in women with epilepsy. We introduce a novel engagement metric, F§, which incorporates post length, sentiment scores, and readability to quantify community interaction. This analysis underscores the importance of integrated care addressing both neurological and mental health challenges faced by PWE. The insights from this study inform strategies for targeted support and awareness interventions.

[AI-8] Causal Discovery by Interventions via Integer Programming

链接: https://arxiv.org/abs/2412.01674
作者: Abdelmonem Elrefaey,Rong Pan
关键词-EN: uncover causal structures, discovery is essential, scientific fields, fields to uncover, Causal discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal discovery is essential across various scientific fields to uncover causal structures within data. Traditional methods relying on observational data have limitations due to confounding variables. This paper presents an optimization-based approach using integer programming (IP) to design minimal intervention sets that ensure causal structure identifiability. Our method provides exact and modular solutions that can be adjusted to different experimental settings and constraints. We demonstrate its effectiveness through comparative analysis across different settings, demonstrating its applicability and robustness.

[AI-9] PassionNet: An Innovative Framework for Duplicate and Conflicting Requirements Identification

链接: https://arxiv.org/abs/2412.01657
作者: Summra Saleem,Muhammad Nabeel Asim,Andreas Dengel
关键词-EN: Early detection, significantly enhance project, enhance project efficiency, leveraging Artificial Intelligence, predictive pipelines
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection and resolution of duplicate and conflicting requirements can significantly enhance project efficiency and overall software quality. Researchers have developed various computational predictors by leveraging Artificial Intelligence (AI) potential to detect duplicate and conflicting requirements. However, these predictors lack in performance and requires more effective approaches to empower software development processes. Following the need of a unique predictor that can accurately identify duplicate and conflicting requirements, this research offers a comprehensive framework that facilitate development of 3 different types of predictive pipelines: language models based, multi-model similarity knowledge-driven and large language models (LLMs) context + multi-model similarity knowledge-driven. Within first type predictive pipelines landscape, framework facilitates conflicting/duplicate requirements identification by leveraging 8 distinct types of LLMs. In second type, framework supports development of predictive pipelines that leverage multi-scale and multi-model similarity knowledge, ranging from traditional similarity computation methods to advanced similarity vectors generated by LLMs. In the third type, the framework synthesizes predictive pipelines by integrating contextual insights from LLMs with multi-model similarity knowledge. Across 6 public benchmark datasets, extensive testing of 760 distinct predictive pipelines demonstrates that hybrid predictive pipelines consistently outperforms other two types predictive pipelines in accurately identifying duplicate and conflicting requirements. This predictive pipeline outperformed existing state-of-the-art predictors performance with an overall performance margin of 13% in terms of F1-score

[AI-10] Command-line Risk Classification using Transformer-based Neural Architectures

链接: https://arxiv.org/abs/2412.01655
作者: Paolo Notaro,Soroush Haeri,Jorge Cardoso,Michael Gerndt
关键词-EN: Operations and Maintenance, increasing computing demand, protect large-scale computing, large-scale computing environments, meet increasing computing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To protect large-scale computing environments necessary to meet increasing computing demand, cloud providers have implemented security measures to monitor Operations and Maintenance (OM) activities and therefore prevent data loss and service interruption. Command interception systems are used to intercept, assess, and block dangerous Command-line Interface (CLI) commands before they can cause damage. Traditional solutions for command risk assessment include rule-based systems, which require expert knowledge and constant human revision to account for unseen commands. To overcome these limitations, several end-to-end learning systems have been proposed to classify CLI commands. These systems, however, have several other limitations, including the adoption of general-purpose text classifiers, which may not adapt to the language characteristics of scripting languages such as Bash or PowerShell, and may not recognize dangerous commands in the presence of an unbalanced class distribution. In this paper, we propose a transformer-based command risk classification system, which leverages the generalization power of Large Language Models (LLM) to provide accurate classification and the ability to identify rare dangerous commands effectively, by exploiting the power of transfer learning. We verify the effectiveness of our approach on a realistic dataset of production commands and show how to apply our model for other security-related tasks, such as dangerous command interception and auditing of existing rule-based systems.

[AI-11] Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks

链接: https://arxiv.org/abs/2412.01650
作者: Wenhan Dong,Chao Lin,Xinlei He,Xinyi Huang,Shengmin Xu
关键词-EN: Privacy-preserving federated learning, Privacy-preserving federated, aims to train, train a global, global model
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy-preserving federated learning (PPFL) aims to train a global model for multiple clients while maintaining their data privacy. However, current PPFL protocols exhibit one or more of the following insufficiencies: considerable degradation in accuracy, the requirement for sharing keys, and cooperation during the key generation or decryption processes. As a mitigation, we develop the first protocol that utilizes neural networks to implement PPFL, as well as incorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of PPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which demonstrate that neural networks are capable of performing tasks similar to multi-key homomorphic encryption (MK-HE) while solving the problems of key distribution and collaborative decryption. Our experiments show that HANs are robust against privacy attacks. Compared with non-private federated learning, experiments conducted on multiple datasets demonstrate that HANs exhibit a negligible accuracy loss (at most 1.35%). Compared to traditional MK-HE schemes, HANs increase encryption aggregation speed by 6,075 times while incurring a 29.2 times increase in communication overhead.

[AI-12] Optimizing LoRa for Edge Computing with TinyML Pipeline for Channel Hopping

链接: https://arxiv.org/abs/2412.01609
作者: Marla Grunewald,Mounir Bensalem,Admela Jukan
关键词-EN: integrate long-distance LongRange, edge computing system, open source implementations, long-distance LongRange, edge computing
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Performance (cs.PF)
*备注: This paper is uploaded here for research community, thus it is for non-commercial purposes

点击查看摘要

Abstract:We propose to integrate long-distance LongRange (LoRa) communication solution for sending the data from IoT to the edge computing system, by taking advantage of its unlicensed nature and the potential for open source implementations that are common in edge computing. We propose a channel hoping optimization model and apply TinyML-based channel hoping model based for LoRa transmissions, as well as experimentally study a fast predictive algorithm to find free channels between edge and IoT devices. In the open source experimental setup that includes LoRa, TinyML and IoT-edge-cloud continuum, we integrate a novel application workflow and cloud-friendly protocol solutions in a case study of plant recommender application that combines concepts of microfarming and urban computing. In a LoRa-optimized edge computing setup, we engineer the application workflow, and apply collaborative filtering and various machine learning algorithms on application data collected to identify and recommend the planting schedule for a specific microfarm in an urban area. In the LoRa experiments, we measure the occurrence of packet loss, RSSI, and SNR, using a random channel hoping scheme to compare with our proposed TinyML method. The results show that it is feasible to use TinyML in microcontrollers for channel hopping, while proving the effectiveness of TinyML in learning to predict the best channel to select for LoRa transmission, and by improving the RSSI by up to 63 %, SNR by up to 44 % in comparison with a random hopping mechanism.

[AI-13] Agent ic-HLS: An agent ic reasoning based high-level synthesis system using large language models (AI for EDA workshop 2024) NEURIPS2024

链接: https://arxiv.org/abs/2412.01604
作者: Ali Emre Oztas,Mahdi Jelodari
关键词-EN: utilization rate, rate of BRAM, digital signal processors, Contest for Chip, Chip Design
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: AI4EDA co-located with 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Our aim for the ML Contest for Chip Design with HLS 2024 was to predict the validity, running latency in the form of cycle counts, utilization rate of BRAM (util-BRAM), utilization rate of lookup tables (uti-LUT), utilization rate of flip flops (util-FF), and the utilization rate of digital signal processors (util-DSP). We used Chain-of-thought techniques with large language models to perform classification and regression tasks. Our prediction is that with larger models reasoning was much improved. We release our prompts and propose a HLS benchmarking task for LLMs.

[AI-14] Handwriting-based Automated Assessment and Grading of Degree of Handedness: A Pilot Study

链接: https://arxiv.org/abs/2412.01587
作者: Smriti Bala,Venugopalan Y. Vishnu,Deepak Joshi
关键词-EN: Hand preference, aspects of human, human behavior, Convolutional Neural Network, Partially Unidextrous
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Hand preference and degree of handedness (DoH) are two different aspects of human behavior which are often confused to be one. DoH is a person’s inherent capability of the brain; affected by nature and nurture. In this study, we used dominant and non-dominant handwriting traits to assess DoH for the first time, on 43 subjects of three categories- Unidextrous, Partially Unidextrous, and Ambidextrous. Features extracted from the segmented handwriting signals called strokes were used for DoH quantification. Davies Bouldin Index, Multilayer perceptron, and Convolutional Neural Network (CNN) were used for automated grading of DoH. The outcomes of these methods were compared with the widely used DoH assessment questionnaires from Edinburgh Inventory (EI). The CNN based automated grading outperformed other computational methods with an average classification accuracy of 95.06% under stratified 10-fold cross-validation. The leave-one-subject-out strategy on this CNN resulted in a test individual’s DoH score which was converted into a 4-point score. Around 90% of the obtained scores from all the implemented computational methods were found to be in accordance with the EI scores under 95% confidence interval. Automated grading of degree of handedness using handwriting signals can provide more resolution to the Edinburgh Inventory scores. This could be used in multiple applications concerned with neuroscience, rehabilitation, physiology, psychometry, behavioral sciences, and forensics.

[AI-15] MBA-RAG: a Bandit Approach for Adaptive Retrieval-Augmented Generation through Question Complexity

链接: https://arxiv.org/abs/2412.01572
作者: Xiaqiang Tang,Qiang Gao,Jian Li,Nan Du,Qi Li,Sihong Xie
关键词-EN: Retrieval Augmented Generation, Augmented Generation, knowledge-intensive tasks, existing RAG framework, Retrieval Augmented
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has proven to be highly effective in boosting the generative performance of language model in knowledge-intensive tasks. However, existing RAG framework either indiscriminately perform retrieval or rely on rigid single-class classifiers to select retrieval methods, leading to inefficiencies and suboptimal performance across queries of varying complexity. To address these challenges, we propose a reinforcement learning-based framework that dynamically selects the most suitable retrieval strategy based on query complexity. % our solution Our approach leverages a multi-armed bandit algorithm, which treats each retrieval method as a distinct ``arm’’ and adapts the selection process by balancing exploration and exploitation. Additionally, we introduce a dynamic reward function that balances accuracy and efficiency, penalizing methods that require more retrieval steps, even if they lead to a correct result. Our method achieves new state of the art results on multiple single-hop and multi-hop datasets while reducing retrieval costs. Our code are available at this https URL .

[AI-16] Improved Large Language Model Jailbreak Detection via Pretrained Embeddings

链接: https://arxiv.org/abs/2412.01547
作者: Erick Galinkin,Martin Sablotny
关键词-EN: capable agentic systems, agentic systems necessitates, systems necessitates research, customer service chat, service chat bots
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to AICS 2025: this https URL

点击查看摘要

Abstract:The adoption of large language models (LLMs) in many applications, from customer service chat bots and software development assistants to more capable agentic systems necessitates research into how to secure these systems. Attacks like prompt injection and jailbreaking attempt to elicit responses and actions from these models that are not compliant with the safety, privacy, or content policies of organizations using the model in their application. In order to counter abuse of LLMs for generating potentially harmful replies or taking undesirable actions, LLM owners must apply safeguards during training and integrate additional tools to block the LLM from generating text that abuses the model. Jailbreaking prompts play a vital role in convincing an LLM to generate potentially harmful content, making it important to identify jailbreaking attempts to block any further steps. In this work, we propose a novel approach to detect jailbreak prompts based on pairing text embeddings well-suited for retrieval with traditional machine learning classification algorithms. Our approach outperforms all publicly available methods from open source LLM security applications.

[AI-17] owards Type Agnostic Cyber Defense Agents

链接: https://arxiv.org/abs/2412.01542
作者: Erick Galinkin,Emmanouil Pountrourakis,Spiros Mancoridis
关键词-EN: ubiquitous across government, critical component, reinforcement learning, computing now ubiquitous, cybersecurity
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Submitted to AICS 2025: this https URL

点击查看摘要

Abstract:With computing now ubiquitous across government, industry, and education, cybersecurity has become a critical component for every organization on the planet. Due to this ubiquity of computing, cyber threats have continued to grow year over year, leading to labor shortages and a skills gap in cybersecurity. As a result, many cybersecurity product vendors and security organizations have looked to artificial intelligence to shore up their defenses. This work considers how to characterize attackers and defenders in one approach to the automation of cyber defense – the application of reinforcement learning. Specifically, we characterize the types of attackers and defenders in the sense of Bayesian games and, using reinforcement learning, derive empirical findings about how to best train agents that defend against multiple types of attackers.

[AI-18] Effectiveness of L2 Regularization in Privacy-Preserving Machine Learning

链接: https://arxiv.org/abs/2412.01541
作者: Nikolaos Chandrinos(1),Iliana Loi(2),Panagiotis Zachos(2),Ioannis Symeonidis(1),Aristotelis Spiliotis(1),Maria Panou(1),Konstantinos Moustakas(2) ((1) Human Factors and Vehicle Technology, Hellenic Institute of Transport, Centre for Research and Technology Hellas, Thermi, Greece, (2) Wire Communications and Information Technology Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, Patras, Greece)
关键词-EN: Artificial intelligence, Membership Inference Attack, Membership Inference, handle sensitive data, status quo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Artificial intelligence, machine learning, and deep learning as a service have become the status quo for many industries, leading to the widespread deployment of models that handle sensitive data. Well-performing models, the industry seeks, usually rely on a large volume of training data. However, the use of such data raises serious privacy concerns due to the potential risks of leaks of highly sensitive information. One prominent threat is the Membership Inference Attack, where adversaries attempt to deduce whether a specific data point was used in a model’s training process. An adversary’s ability to determine an individual’s presence represents a significant privacy threat, especially when related to a group of users sharing sensitive information. Hence, well-designed privacy-preserving machine learning solutions are critically needed in the industry. In this work, we compare the effectiveness of L2 regularization and differential privacy in mitigating Membership Inference Attack risks. Even though regularization techniques like L2 regularization are commonly employed to reduce overfitting, a condition that enhances the effectiveness of Membership Inference Attacks, their impact on mitigating these attacks has not been systematically explored.

[AI-19] CopyrightShield: Spatial Similarity Guided Backdoor Defense against Copyright Infringement in Diffusion Models

链接: https://arxiv.org/abs/2412.01528
作者: Zhixiang Guo,Siyuan Liang,Aishan Liu,Dacheng Tao
关键词-EN: gained significant attention, significant attention due, remarkable data generation, data generation ability, copyright infringement
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The diffusion model has gained significant attention due to its remarkable data generation ability in fields such as image synthesis. However, its strong memorization and replication abilities with respect to the training data also make it a prime target for copyright infringement attacks. This paper provides an in-depth analysis of the spatial similarity of replication in diffusion model and leverages this key characteristic to design a method for detecting poisoning data. By employing a joint assessment of spatial-level and feature-level information from the detected segments, we effectively identify covertly dispersed poisoned samples. Building upon detected poisoning data, we propose a novel defense method specifically targeting copyright infringement attacks by introducing a protection constraint term into the loss function to mitigate the impact of poisoning. Extensive experimental results demonstrate that our approach achieves an average F1 score of 0.709 in detecting copyright infringement backdoors, resulting in an average increase of 68.1% in First-Attack Epoch (FAE) and an average decrease of 51.4% in Copyright Infringement Rate (CIR) of the poisoned model, effectively defending against copyright infringement. Additionally, we introduce the concept of copyright feature inversion, which aids in determining copyright responsibility and expands the application scenarios of defense strategies.

[AI-20] Addressing Data Leakage in HumanEval Using Combinatorial Test Design

链接: https://arxiv.org/abs/2412.01526
作者: Jeremy S. Bradbury,Riddhi More
关键词-EN: including Software Engineering, Software Engineering, large language models, including Software, tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the black-box nature of LLM training data which makes it difficult to even know if data leakage has occurred. To address the data leakage problem, we propose a new benchmark construction method where a benchmark is composed of template tasks that can be instantiated into new concrete tasks using combinatorial test design. Concrete tasks for the same template task must be different enough that data leakage has minimal impact and similar enough that the tasks are interchangeable with respect to performance evaluation. To assess our benchmark construction method, we propose HumanEval_T, an alternative benchmark to HumanEval that was constructed using template tasks and combinatorial test design.

[AI-21] Adversarial Attacks on Hyperbolic Networks

链接: https://arxiv.org/abs/2412.01495
作者: Max van Spengler,Jan Zahálka,Pascal Mettes
关键词-EN: deep learning grows, grows in popularity, non-Euclidean geometry, PGD adversarial attacks, FGM and PGD
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As hyperbolic deep learning grows in popularity, so does the need for adversarial robustness in the context of such a non-Euclidean geometry. To this end, this paper proposes hyperbolic alternatives to the commonly used FGM and PGD adversarial attacks. Through interpretable synthetic benchmarks and experiments on existing datasets, we show how the existing and newly proposed attacks differ. Moreover, we investigate the differences in adversarial robustness between Euclidean and fully hyperbolic networks. We find that these networks suffer from different types of vulnerabilities and that the newly proposed hyperbolic attacks cannot address these differences. Therefore, we conclude that the shifts in adversarial robustness are due to the models learning distinct patterns resulting from their different geometries.

[AI-22] IIntelligent Spark Agents : A Modular LangGraph Framework for Scalable Visualized and Enhanced Big Data Machine Learning Workflows

链接: https://arxiv.org/abs/2412.01490
作者: Jialin Wang,Zhihua Duan
关键词-EN: memory-distributed data sets, load data mining, Apache Spark, machine learning, Spark
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Apache Spark is better suited for load data mining and machine learning that require a lot of iteration by using memory-distributed data sets. Due to the complexity of Spark, the high learning threshold of Scala, and the low reusability of its code, this paper designs and implements a Spark-based visual process AI+machine learning method under a big data environment. On the one hand, it designs component models to describe the basic steps of machine learning, including data preprocessing, feature processing, and model training. Practice and validate evaluation. On the other hand, a visual process modeling tool is provided to support analysts to design machine learning processes, which can be translated automatically into Spark platform code for efficient execution. This tool can greatly improve the AI machine learning efficiency of the Spark platform. This paper introduces the method theory, key technologies, and effectiveness of the tool. This paper explores the application of Spark in the field of large model agents. Langchain, as an open-source framework, is committed to simplifying the development of end-to-end applications based on language models. It provides interfaces for interacting with a variety of large language models, optimizing prompt engineering, and endowing large models with the ability to invoke external tools. LangGraph demonstrates its powerful state management and graph construction capabilities by defining node functions and graphs to build complex agent applications. The development of Spark agent applications based on LangGraph has further promoted the development of AI applications in the big data analysis environment . Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.01490 [cs.AI] (or arXiv:2412.01490v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.01490 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] FastRM: An efficient and automatic explainability framework for multimodal generative models

链接: https://arxiv.org/abs/2412.01487
作者: Gabriela Ben-Melech Stan,Estelle Aflalo,Man Luo,Shachar Rosenman,Tiep Le,Sayak Paul,Shao-Yen Tseng,Vasudev Lal
关键词-EN: Large Vision Language, Vision Language Models, Large Vision, Vision Language, Language Models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Large Vision Language Models (LVLMs) have become masterly capable in reasoning over human prompts and visual inputs, they are still prone to producing responses that contain misinformation. Identifying incorrect responses that are not grounded in evidence has become a crucial task in building trustworthy AI. Explainability methods such as gradient-based relevancy maps on LVLM outputs can provide an insight on the decision process of models, however these methods are often computationally expensive and not suited for on-the-fly validation of outputs. In this work, we propose FastRM, an effective way for predicting the explainable Relevancy Maps of LVLM models. Experimental results show that employing FastRM leads to a 99.8% reduction in compute time for relevancy map generation and an 44.4% reduction in memory footprint for the evaluated LVLM, making explainable AI more efficient and practical, thereby facilitating its deployment in real-world applications.

[AI-24] Misalignments in AI Perception: Quantitative Findings and Visual Mapping of How Experts and the Public Differ in Expectations and Risks Benefits and Value Judgments

链接: https://arxiv.org/abs/2412.01459
作者: Philipp Brauner,Felix Glawe,Gian Luca Liehner,Luisa Vervier,Martina Ziefle
关键词-EN: Artificial Intelligence, raising critical questions, diverse societal domains, transforming diverse societal, raising critical
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is transforming diverse societal domains, raising critical questions about its risks and benefits and the misalignments between public expectations and academic visions. This study examines how the general public (N=1110) – people using or being affected by AI – and academic AI experts (N=119) – people shaping AI development – perceive AI’s capabilities and impact across 71 scenarios, including sustainability, healthcare, job performance, societal divides, art, and warfare. Participants evaluated each scenario on four dimensions: expected probability, perceived risk and benefit, and overall sentiment (or value). The findings reveal significant quantitative differences: experts anticipate higher probabilities, perceive lower risks, report greater utility, and express more favorable sentiment toward AI compared to the non-experts. Notably, risk-benefit tradeoffs differ: the public assigns risk half the weight of benefits, while experts assign it only a third. Visual maps of these evaluations highlight areas of convergence and divergence, identifying potential sources of public concern. These insights offer actionable guidance for researchers and policymakers to align AI development with societal values, fostering public trust and informed governance.

[AI-25] LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

链接: https://arxiv.org/abs/2412.01441
作者: Anian Ruoss,Fabio Pardo,Harris Chan,Bonnie Li,Volodymyr Mnih,Tim Genewein
关键词-EN: possess good factual, good factual knowledge, increasingly general capabilities, Today largest foundation, largest foundation models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today’s largest foundation models have increasingly general capabilities, yet when used as agents, they often struggle with simple reasoning and decision-making tasks, even though they possess good factual knowledge of the task and how to solve it. In this paper, we present a benchmark to pressure-test these models’ multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether they can learn from a large number of expert demonstrations in their context. We evaluate a wide range of state-of-the-art frontier models as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We measure the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, o1-mini, and o1-preview under increasing amounts of expert demonstrations in the context \unicodex2013 from no demonstrations up to 512 full episodes, pushing these models’ multimodal long-context reasoning capabilities to their limits. Across our tasks, today’s frontier models rarely manage to fully reach expert performance, showcasing the difficulty of our benchmark. Presenting more demonstrations often has little effect, but some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. Overall, our results suggest that even today’s most capable models often struggle to imitate desired behavior by generalizing purely from in-context demonstrations. To help quantify the impact of other approaches and future innovations aiming to tackle this problem, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

[AI-26] Reject Threshold Adaptation for Open-Set Model Attribution of Deepfake Audio

链接: https://arxiv.org/abs/2412.01425
作者: Xinrui Yan,Jiangyan Yi,Jianhua Tao,Yujie Chen,Hao Gu,Guanjun Li,Junzuo Zhou,Yong Ren,Tao Xu
关键词-EN: emerging research topic, Open environment oriented, environment oriented open, open set model, oriented open set
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ISCSLP 2024

点击查看摘要

Abstract:Open environment oriented open set model attribution of deepfake audio is an emerging research topic, aiming to identify the generation models of deepfake audio. Most previous work requires manually setting a rejection threshold for unknown classes to compare with predicted probabilities. However, models often overfit training instances and generate overly confident predictions. Moreover, thresholds that effectively distinguish unknown categories in the current dataset may not be suitable for identifying known and unknown categories in another data distribution. To address the issues, we propose a novel framework for open set model attribution of deepfake audio with rejection threshold adaptation (ReTA). Specifically, the reconstruction error learning module trains by combining the representation of system fingerprints with labels corresponding to either the target class or a randomly chosen other class label. This process generates matching and non-matching reconstructed samples, establishing the reconstruction error distributions for each class and laying the foundation for the reject threshold calculation module. The reject threshold calculation module utilizes gaussian probability estimation to fit the distributions of matching and non-matching reconstruction errors. It then computes adaptive reject thresholds for all classes through probability minimization criteria. The experimental results demonstrate the effectiveness of ReTA in improving the open set model attributes of deepfake audio.

[AI-27] CSP-AIT-Net: A contrastive learning-enhanced spatiotemporal graph attention framework for short-term metro OD flow prediction with asynchronous inflow tracking

链接: https://arxiv.org/abs/2412.01419
作者: Yichen Wang,Chengcheng Yu
关键词-EN: Accurate origin-destination, improving passenger experiences, flow prediction, flow, Accurate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate origin-destination (OD) passenger flow prediction is crucial for enhancing metro system efficiency, optimizing scheduling, and improving passenger experiences. However, current models often fail to effectively capture the asynchronous departure characteristics of OD flows and underutilize the inflow and outflow data, which limits their prediction accuracy. To address these issues, we propose CSP-AIT-Net, a novel spatiotemporal graph attention framework designed to enhance OD flow prediction by incorporating asynchronous inflow tracking and advanced station semantics representation. Our framework restructures the OD flow prediction paradigm by first predicting outflows and then decomposing OD flows using a spatiotemporal graph attention mechanism. To enhance computational efficiency, we introduce a masking mechanism and propose asynchronous passenger flow graphs that integrate inflow and OD flow with conservation constraints. Furthermore, we employ contrastive learning to extract high-dimensional land use semantics of metro stations, enriching the contextual understanding of passenger mobility patterns. Validation of the Shanghai metro system demonstrates improvement in short-term OD flow prediction accuracy over state-of-the-art methods. This work contributes to enhancing metro operational efficiency, scheduling precision, and overall system safety.

[AI-28] Learning Elementary Cellular Automata with Transformers

链接: https://arxiv.org/abs/2412.01417
作者: Mikhail Burtsev
关键词-EN: Large Language Models, Large Language, Language Models demonstrate, demonstrate remarkable mathematical, remarkable mathematical capabilities
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Large Language Models demonstrate remarkable mathematical capabilities but at the same time struggle with abstract reasoning and planning. In this study, we explore whether Transformers can learn to abstract and generalize the rules governing Elementary Cellular Automata. By training Transformers on state sequences generated with random initial conditions and local rules, we show that they can generalize across different Boolean functions of fixed arity, effectively abstracting the underlying rules. While the models achieve high accuracy in next-state prediction, their performance declines sharply in multi-step planning tasks without intermediate context. Our analysis reveals that including future states or rule prediction in the training loss enhances the models’ ability to form internal representations of the rules, leading to improved performance in longer planning horizons and autoregressive generation. Furthermore, we confirm that increasing the model’s depth plays a crucial role in extended sequential computations required for complex reasoning tasks. This highlights the potential to improve LLM with inclusion of longer horizons in loss function, as well as incorporating recurrence and adaptive computation time for dynamic control of model depth.

[AI-29] A Survey on Deep Neural Networks in Collaborative Filtering Recommendation Systems

链接: https://arxiv.org/abs/2412.01378
作者: Pang Li,Shahrul Azman Mohd Noah,Hafiz Mohd Sarim
关键词-EN: Neural Networks, Deep Neural Networks, Convolutional Neural Networks, Recurrent Neural Networks, Graph Neural Networks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 32 pages, 12 figures

点击查看摘要

Abstract:This survey provides an examination of the use of Deep Neural Networks (DNN) in Collaborative Filtering (CF) recommendation systems. As the digital world increasingly relies on data-driven approaches, traditional CF techniques face limitations in scalability and flexibility. DNNs can address these challenges by effectively modeling complex, non-linear relationships within the data. We begin by exploring the fundamental principles of both collaborative filtering and deep neural networks, laying the groundwork for understanding their integration. Subsequently, we review key advancements in the field, categorizing various deep learning models that enhance CF systems, including Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Graph Neural Networks (GNN), autoencoders, Generative Adversarial Networks (GAN), and Restricted Boltzmann Machines (RBM). The paper also discusses evaluation protocols, various publicly available auxiliary information, and data features. Furthermore, the survey concludes with a discussion of the challenges and future research opportunities in enhancing collaborative filtering systems with deep learning.

[AI-30] Convolutional Transformer Neural Collaborative Filtering

链接: https://arxiv.org/abs/2412.01376
作者: Pang Li,Shahrul Azman Mohd Noah,Hafiz Mohd Sarim
关键词-EN: Neural Collaborative Filtering, Transformer Neural Collaborative, Convolutional Transformer Neural, Convolutional Neural Networks, introduce Convolutional Transformer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:In this study, we introduce Convolutional Transformer Neural Collaborative Filtering (CTNCF), a novel approach aimed at enhancing recommendation systems by effectively capturing high-order structural information in user?item interactions. CTNCF represents a significant advancement over the traditional Neural Collaborative Filtering (NCF) model by seamlessly integrating Convolutional Neural Networks (CNNs) and Transformer layers. This sophisticated integration enables the model to adeptly capture and understand complex interaction patterns inherent in recommendation systems. Specifically, CNNs are employed to extract local features from user and item embeddings, allowing the model to capture intricate spatial dependencies within the data. Furthermore, the utilization of Transformer layers enables the model to capture long-range dependencies and interactions among user and item features, thereby enhancing its ability to understand the underlying relationships in the data. To validate the effectiveness of our proposed CTNCF framework, we conduct extensive experiments on two real-world datasets. The results demonstrate that CTNCF significantly outperforms state-of-the-art approaches, highlighting its efficacy in improving recommendation system performance.

[AI-31] Research on Cervical Cancer p16/Ki-67 Immunohistochemical Dual-Staining Image Recognition Algorithm Based on YOLO

链接: https://arxiv.org/abs/2412.01372
作者: Xiao-Jun Wu,Cai-Jun Zhao,Chun Meng,Hang Wang
关键词-EN: dual staining method, cervical cancer screening, dual staining, sensitivity and specificity, staining method
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The p16/Ki-67 dual staining method is a new approach for cervical cancer screening with high sensitivity and specificity. However, there are issues of mis-detection and inaccurate recognition when the YOLOv5s algorithm is directly applied to dual-stained cell images. This paper Proposes a novel cervical cancer dual-stained image recognition (DSIR-YOLO) model based on an YOLOv5. By fusing the Swin-Transformer module, GAM attention mechanism, multi-scale feature fusion, and EIoU loss function, the detection performance is significantly improved, with mAP@0.5 and mAP@0.5:0.95 reaching 92.6% and 70.5%, respectively. Compared with YOLOv5s in five-fold cross-validation, the accuracy, recall, mAP@0.5, and mAP@0.5:0.95 of the improved algorithm are increased by 2.3%, 4.1%, 4.3%, and 8.0%, respectively, with smaller variances and higher stability. Compared with other detection algorithms, DSIR-YOLO in this paper sacrifices some performance requirements to improve the network recognition effect. In addition, the influence of dataset quality on the detection results is studied. By controlling the sealing property of pixels, scale difference, unlabelled cells, and diagonal annotation, the model detection accuracy, recall, mAP@0.5, and mAP@0.5:0.95 are improved by 13.3%, 15.3%, 18.3%, and 30.5%, respectively.

[AI-32] An overview of diffusion models for generative artificial intelligence

链接: https://arxiv.org/abs/2412.01371
作者: Davide Gallon,Arnulf Jentzen,Philippe von Wurstemberger
关键词-EN: generative artificial intelligence, mathematically rigorous introduction, diffusion probabilistic models, denoising diffusion probabilistic, diffusion probabilistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 56 pages, 5 figures

点击查看摘要

Abstract:This article provides a mathematically rigorous introduction to denoising diffusion probabilistic models (DDPMs), sometimes also referred to as diffusion probabilistic models or diffusion models, for generative artificial intelligence. We provide a detailed basic mathematical framework for DDPMs and explain the main ideas behind training and generation procedures. In this overview article we also review selected extensions and improvements of the basic framework from the literature such as improved DDPMs, denoising diffusion implicit models, classifier-free diffusion guidance models, and latent diffusion models.

[AI-33] Behavior Backdoor for Deep Learning Models

链接: https://arxiv.org/abs/2412.01369
作者: Jiakai Wang,Pengfei Zhang,Renshuai Tao,Jian Yang,Hao Liu,Xianglong Liu,Yunchao Wei,Yao Zhao
关键词-EN: artificial intelligence technology, increasingly important role, main development directions, pre-train large models, backdoor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The various post-processing methods for deep-learning-based models, such as quantification, pruning, and fine-tuning, play an increasingly important role in artificial intelligence technology, with pre-train large models as one of the main development directions. However, this popular series of post-processing behaviors targeting pre-training deep models has become a breeding ground for new adversarial security issues. In this study, we take the first step towards ``behavioral backdoor’’ attack, which is defined as a behavior-triggered backdoor model training procedure, to reveal a new paradigm of backdoor attacks. In practice, we propose the first pipeline of implementing behavior backdoor, i.e., the Quantification Backdoor (QB) attack, upon exploiting model quantification method as the set trigger. Specifically, to adapt the optimization goal of behavior backdoor, we introduce the behavior-driven backdoor object optimizing method by a bi-target behavior backdoor training loss, thus we could guide the poisoned model optimization direction. To update the parameters across multiple models, we adopt the address-shared backdoor model training, thereby the gradient information could be utilized for multimodel collaborative optimization. Extensive experiments have been conducted on different models, datasets, and tasks, demonstrating the effectiveness of this novel backdoor attack and its potential application threats.

[AI-34] Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability

链接: https://arxiv.org/abs/2412.01365
作者: Wen-Dong Jiang,Chih-Yung Chang,Show-Jane Yen,Diptendu Sinha Roy
关键词-EN: achieved remarkable success, achieved remarkable, remarkable success, success in processing, processing and managing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Deep learning has achieved remarkable success in processing and managing unstructured data. However, its “black box” nature imposes significant limitations, particularly in sensitive application domains. While existing interpretable machine learning methods address some of these issues, they often fail to adequately consider feature correlations and provide insufficient evaluation of model decision paths. To overcome these challenges, this paper introduces Real Explainer (RealExp), an interpretability computation method that decouples the Shapley Value into individual feature importance and feature correlation importance. By incorporating feature similarity computations, RealExp enhances interpretability by precisely quantifying both individual feature contributions and their interactions, leading to more reliable and nuanced explanations. Additionally, this paper proposes a novel interpretability evaluation criterion focused on elucidating the decision paths of deep learning models, going beyond traditional accuracy-based metrics. Experimental validations on two unstructured data tasks – image classification and text sentiment analysis – demonstrate that RealExp significantly outperforms existing methods in interpretability. Case studies further illustrate its practical value: in image classification, RealExp aids in selecting suitable pre-trained models for specific tasks from an interpretability perspective; in text classification, it enables the optimization of models and approximates the performance of a fine-tuned GPT-Ada model using traditional bag-of-words approaches.

[AI-35] Su-RoBERTa: A Semi-supervised Approach to Predicting Suicide Risk through Social Media using Base Language Models

链接: https://arxiv.org/abs/2412.01353
作者: Chayan Tank,Shaina Mehta,Sarthak Pol,Vinayak Katoch,Avinash Anand,Raj Jaiswal,Rajiv Ratn Shah
关键词-EN: social media platforms, recent times, people are posting, Base language models, media platforms
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 8 pages, 7 figures, Accepted at IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:In recent times, more and more people are posting about their mental states across various social media platforms. Leveraging this data, AI-based systems can be developed that help in assessing the mental health of individuals, such as suicide risk. This paper is a study done on suicidal risk assessments using Reddit data leveraging Base language models to identify patterns from social media posts. We have demonstrated that using smaller language models, i.e., less than 500M parameters, can also be effective in contrast to LLMs with greater than 500M parameters. We propose Su-RoBERTa, a fine-tuned RoBERTa on suicide risk prediction task that utilized both the labeled and unlabeled Reddit data and tackled class imbalance by data augmentation using GPT-2 model. Our Su-RoBERTa model attained a 69.84% weighted F1 score during the Final evaluation. This paper demonstrates the effectiveness of Base language models for the analysis of the risk factors related to mental health with an efficient computation pipeline

[AI-36] A multi-criteria decision support system to evaluate the effectiveness of training courses on citizens employability

链接: https://arxiv.org/abs/2412.01351
作者: Maria C. Bas,Vicente J. Bolos,Alvaro E. Prieto,Roberto Rodriguez-Echeverria,Fernando Sanchez-Figueroa
关键词-EN: lifelong learning, lives of employed, employed and unemployed, unweighted TOPSIS method, TOPSIS method
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 24 pages, 12 figures

点击查看摘要

Abstract:This study examines the impact of lifelong learning on the professional lives of employed and unemployed individuals. Lifelong learning is a crucial factor in securing employment or enhancing one’s existing career prospects. To achieve this objective, this study proposes the implementation of a multi-criteria decision support system for the evaluation of training courses in accordance with their capacity to enhance the employability of the students. The methodology is delineated in four stages. Firstly, a `working life curve’ was defined to provide a quantitative description of an individual’s working life. Secondly, an analysis based on K-medoids clustering defined a control group for each individual for comparison. Thirdly, the performance of a course according to each of the four predefined criteria was calculated using a t-test to determine the mean performance value of those who took the course. Ultimately, the unweighted TOPSIS method was used to evaluate the efficacy of the various training courses in relation to the four criteria. This approach effectively addresses the challenge of using extensive datasets within a system while facilitating the application of a multi-criteria unweighted TOPSIS method. The results of the multi-criteria TOPSIS method indicated that training courses related to the professional fields of administration and management, hostel and tourism and community and sociocultural services have positive impact on employability and improving the working conditions of citizens. However, courses that demonstrate the greatest effectiveness in ranking are the least demanded by citizens. The results will help policymakers evaluate the effectiveness of each training course offered by the regional government.

[AI-37] Hierarchical Object-Oriented POMDP Planning for Object Rearrangement ICLR2025

链接: https://arxiv.org/abs/2412.01348
作者: Rajesh Mangannavar,Alan Fern,Prasad Tadepalli
关键词-EN: solving multi-object rearrangement, online planning framework, Markov Decision Process, Observed Markov Decision, Partially Observed Markov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 17 pages, 2 Figures. Preprint. Under review at ICLR 2025

点击查看摘要

Abstract:We present an online planning framework for solving multi-object rearrangement problems in partially observable, multi-room environments. Current object rearrangement solutions, primarily based on Reinforcement Learning or hand-coded planning methods, often lack adaptability to diverse challenges. To address this limitation, we introduce a novel Hierarchical Object-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning approach. This approach comprises of (a) an object-oriented POMDP planner generating sub-goals, (b) a set of low-level policies for sub-goal achievement, and © an abstraction system converting the continuous low-level world into a representation suitable for abstract planning. We evaluate our system on varying numbers of objects, rooms, and problem types in AI2-THOR simulated environments with promising results.

[AI-38] Explainable fault and severity classification for rolling element bearings using Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2412.01322
作者: Spyros Rigas,Michalis Papachristou,Ioannis Sotiropoulos,Georgios Alexandridis
关键词-EN: Rolling element bearings, Rolling element, performance directly influencing, industrial systems, critical components
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rolling element bearings are critical components of rotating machinery, with their performance directly influencing the efficiency and reliability of industrial systems. At the same time, bearing faults are a leading cause of machinery failures, often resulting in costly downtime, reduced productivity, and, in extreme cases, catastrophic damage. This study presents a methodology that utilizes Kolmogorov-Arnold Networks to address these challenges through automatic feature selection, hyperparameter tuning and interpretable fault analysis within a unified framework. By training shallow network architectures and minimizing the number of selected features, the framework produces lightweight models that deliver explainable results through feature attribution and symbolic representations of their activation functions. Validated on two widely recognized datasets for bearing fault diagnosis, the framework achieved perfect F1-Scores for fault detection and high performance in fault and severity classification tasks, including 100% F1-Scores in most cases. Notably, it demonstrated adaptability by handling diverse fault types, such as imbalance and misalignment, within the same dataset. The symbolic representations enhanced model interpretability, while feature attribution offered insights into the optimal feature types or signals for each studied task. These results highlight the framework’s potential for practical applications, such as real-time machinery monitoring, and for scientific research requiring efficient and explainable models.

[AI-39] RL2: Reinforce Large Language Model to Assist Safe Reinforcement Learning for Energy Management of Active Distribution Networks

链接: https://arxiv.org/abs/2412.01303
作者: Xu Yang,Chenhui Lin,Haotian Liu,Wenchuan Wu
关键词-EN: active distribution networks, traditional distribution networks, increasingly prominent compared, distribution networks, large-scale distributed energy
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large-scale distributed energy resources are integrated into the active distribution networks (ADNs), effective energy management in ADNs becomes increasingly prominent compared to traditional distribution networks. Although advanced reinforcement learning (RL) methods, which alleviate the burden of complicated modelling and optimization, have greatly improved the efficiency of energy management in ADNs, safety becomes a critical concern for RL applications in real-world problems. Since the design and adjustment of penalty functions, which correspond to operational safety constraints, requires extensive domain knowledge in RL and power system operation, the emerging ADN operators call for a more flexible and customized approach to address the penalty functions so that the operational safety and efficiency can be further enhanced. Empowered with strong comprehension, reasoning, and in-context learning capabilities, large language models (LLMs) provide a promising way to assist safe RL for energy management in ADNs. In this paper, we introduce the LLM to comprehend operational safety requirements in ADNs and generate corresponding penalty functions. In addition, we propose an RL2 mechanism to refine the generated functions iteratively and adaptively through multi-round dialogues, in which the LLM agent adjusts the functions’ pattern and parameters based on training and test performance of the downstream RL agent. The proposed method significantly reduces the intervention of the ADN operators. Comprehensive test results demonstrate the effectiveness of the proposed method.

[AI-40] FedAH: Aggregated Head for Personalized Federated Learning

链接: https://arxiv.org/abs/2412.01295
作者: Pengzhan Zhou,Yuepeng He,Yijun Zhai,Kaixin Gao,Chao Chen,Zhida Qin,Chong Zhang,Songtao Guo
关键词-EN: collaborative learning capabilities, Federated Learning, Personalized Federated Learning, called Federated Learning, head
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Recently, Federated Learning (FL) has gained popularity for its privacy-preserving and collaborative learning capabilities. Personalized Federated Learning (PFL), building upon FL, aims to address the issue of statistical heterogeneity and achieve personalization. Personalized-head-based PFL is a common and effective PFL method that splits the model into a feature extractor and a head, where the feature extractor is collaboratively trained and shared, while the head is locally trained and not shared. However, retaining the head locally, although achieving personalization, prevents the model from learning global knowledge in the head, thus affecting the performance of the personalized model. To solve this problem, we propose a novel PFL method called Federated Learning with Aggregated Head (FedAH), which initializes the head with an Aggregated Head at each iteration. The key feature of FedAH is to perform element-level aggregation between the local model head and the global model head to introduce global information from the global model head. To evaluate the effectiveness of FedAH, we conduct extensive experiments on five benchmark datasets in the fields of computer vision and natural language processing. FedAH outperforms ten state-of-the-art FL methods in terms of test accuracy by 2.87%. Additionally, FedAH maintains its advantage even in scenarios where some clients drop out unexpectedly. Our code is open-accessed at this https URL.

[AI-41] Learning Smooth Distance Functions via Queries

链接: https://arxiv.org/abs/2412.01290
作者: Akash Kumar,Sanjoy Dasgupta
关键词-EN: query-based learning framework, pose triplet queries, establish formal guarantees, query complexity required, learning distance functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: 40 pages, 1 figure

点击查看摘要

Abstract:In this work, we investigate the problem of learning distance functions within the query-based learning framework, where a learner is able to pose triplet queries of the form: ``Is x_i closer to x_j or x_k ?‘’ We establish formal guarantees on the query complexity required to learn smooth, but otherwise general, distance functions under two notions of approximation: \omega -additive approximation and (1 + \omega) -multiplicative approximation. For the additive approximation, we propose a global method whose query complexity is quadratic in the size of a finite cover of the sample space. For the (stronger) multiplicative approximation, we introduce a method that combines global and local approaches, utilizing multiple Mahalanobis distance functions to capture local geometry. This method has a query complexity that scales quadratically with both the size of the cover and the ambient space dimension of the sample space.

[AI-42] FedPAW: Federated Learning with Personalized Aggregation Weights for Urban Vehicle Speed Prediction

链接: https://arxiv.org/abs/2412.01281
作者: Yuepeng He,Pengzhan Zhou,Yijun Zhai,Fang Qu,Zhida Qin,Mingyan Li,Songtao Guo
关键词-EN: intelligent transportation systems, Vehicle speed prediction, accurately predicting future, transportation systems, promoting more reliable
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Vehicle speed prediction is crucial for intelligent transportation systems, promoting more reliable autonomous driving by accurately predicting future vehicle conditions. Due to variations in drivers’ driving styles and vehicle types, speed predictions for different target vehicles may significantly differ. Existing methods may not realize personalized vehicle speed prediction while protecting drivers’ data privacy. We propose a Federated learning framework with Personalized Aggregation Weights (FedPAW) to overcome these challenges. This method captures client-specific information by measuring the weighted mean squared error between the parameters of local models and global models. The server sends tailored aggregated models to clients instead of a single global model, without incurring additional computational and communication overhead for clients. To evaluate the effectiveness of FedPAW, we collected driving data in urban scenarios using the autonomous driving simulator CARLA, employing an LSTM-based Seq2Seq model with a multi-head attention mechanism to predict the future speed of target vehicles. The results demonstrate that our proposed FedPAW ranks lowest in prediction error within the time horizon of 10 seconds, with a 0.8% reduction in test MAE, compared to eleven representative benchmark baselines. The source code of FedPAW and dataset CarlaVSP are open-accessed at: this https URL and this https URL.

[AI-43] Uncertainty-Aware Artificial Intelligence for Gear Fault Diagnosis in Motor Drives

链接: https://arxiv.org/abs/2412.01272
作者: Subham Sahoo,Huai Wang,Frede Blaabjerg
关键词-EN: Bayesian neural networks, drives using Bayesian, Bayesian neural, paper introduces, motor drives
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: The manuscript has been accepted for publication in 2025 IEEE Applied Power Electronics Conference and Exposition (APEC)

点击查看摘要

Abstract:This paper introduces a novel approach to quantify the uncertainties in fault diagnosis of motor drives using Bayesian neural networks (BNN). Conventional data-driven approaches used for fault diagnosis often rely on point-estimate neural networks, which merely provide deterministic outputs and fail to capture the uncertainty associated with the inference process. In contrast, BNNs offer a principled framework to model uncertainty by treating network weights as probability distributions rather than fixed values. It offers several advantages: (a) improved robustness to noisy data, (b) enhanced interpretability of model predictions, and © the ability to quantify uncertainty in the decision-making processes. To test the robustness of the proposed BNN, it has been tested under a conservative dataset of gear fault data from an experimental prototype of three fault types at first, and is then incrementally trained on new fault classes and datasets to explore its uncertainty quantification features and model interpretability under noisy data and unseen fault scenarios.

[AI-44] Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input

链接: https://arxiv.org/abs/2412.01250
作者: Francesco Taioli,Edoardo Zorzi,Gianni Franchi,Alberto Castellini,Alessandro Farinelli,Marco Cristani,Yiming Wang
关键词-EN: Existing embodied instance, Existing embodied, embodied instance goal, Vision Language Models, Large Language Models
类目: Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Existing embodied instance goal navigation tasks, driven by natural language, assume human users to provide complete and nuanced instance descriptions prior to the navigation, which can be impractical in the real world as human instructions might be brief and ambiguous. To bridge this gap, we propose a new task, Collaborative Instance Navigation (CoIN), with dynamic agent-human interaction during navigation to actively resolve uncertainties about the target instance in natural, template-free, open-ended dialogues. To address CoIN, we propose a novel method, Agent-user Interaction with UncerTainty Awareness (AIUTA), leveraging the perception capability of Vision Language Models (VLMs) and the capability of Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue to obtain a complete and accurate observation description, while a novel uncertainty estimation technique mitigates inaccurate VLM perception. Then, an Interaction Trigger module determines whether to ask a question to the user, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, a benchmark supporting both real and simulated humans. AIUTA achieves competitive performance in instance navigation against state-of-the-art methods, demonstrating great flexibility in handling user inputs.

[AI-45] Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

链接: https://arxiv.org/abs/2412.01245
作者: Jinouwen Zhang,Rongkun Xue,Yazhe Niu,Yun Chen,Jing Yang,Hongsheng Li,Yu Liu
关键词-EN: continuous action spaces, achieved remarkable success, drawing significant interest, Generative Model Policy, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

[AI-46] Best Practices for Large Language Models in Radiology

链接: https://arxiv.org/abs/2412.01233
作者: Christian Bluethgen,Dave Van Veen,Cyril Zakka,Katherine Link,Aaron Fanous,Roxana Daneshjou,Thomas Frauenfelder,Curtis Langlotz,Sergios Gatidis,Akshay Chaudhari
关键词-EN: integrating complex imaging, complex imaging data, produce actionable insights, heart of radiological, challenge of integrating
类目: Artificial Intelligence (cs.AI)
*备注: A redacted version of this preprint has been accepted for publication in Radiology

点击查看摘要

Abstract:At the heart of radiological practice is the challenge of integrating complex imaging data with clinical information to produce actionable insights. Nuanced application of language is key for various activities, including managing requests, describing and interpreting imaging findings in the context of clinical data, and concisely documenting and communicating the outcomes. The emergence of large language models (LLMs) offers an opportunity to improve the management and interpretation of the vast data in radiology. Despite being primarily general-purpose, these advanced computational models demonstrate impressive capabilities in specialized language-related tasks, even without specific training. Unlocking the potential of LLMs for radiology requires basic understanding of their foundations and a strategic approach to navigate their idiosyncrasies. This review, drawing from practical radiology and machine learning expertise and recent literature, provides readers insight into the potential of LLMs in radiology. It examines best practices that have so far stood the test of time in the rapidly evolving landscape of LLMs. This includes practical advice for optimizing LLM characteristics for radiology practices along with limitations, effective prompting, and fine-tuning strategies.

[AI-47] FD-LLM : Large Language Model for Fault Diagnosis of Machines

链接: https://arxiv.org/abs/2412.01218
作者: Hamzah A.A.M. Qaid,Bo Zhang,Dan Li,See-Kiong Ng,Wei Li
关键词-EN: Large language models, Large language, valuable conceptual representations, language models, capturing complex
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 2 figures, 16 tables, including the tables in the appendix

点击查看摘要

Abstract:Large language models (LLMs) are effective at capturing complex, valuable conceptual representations from textual data for a wide range of real-world applications. However, in fields like Intelligent Fault Diagnosis (IFD), incorporating additional sensor data-such as vibration signals, temperature readings, and operational metrics-is essential but it is challenging to capture such sensor data information within traditional text corpora. This study introduces a novel IFD approach by effectively adapting LLMs to numerical data inputs for identifying various machine faults from time-series sensor data. We propose FD-LLM, an LLM framework specifically designed for fault diagnosis by formulating the training of the LLM as a multi-class classification problem. We explore two methods for encoding vibration signals: the first method uses a string-based tokenization technique to encode vibration signals into text representations, while the second extracts statistical features from both the time and frequency domains as statistical summaries of each signal. We assess the fault diagnosis capabilities of four open-sourced LLMs based on the FD-LLM framework, and evaluate the models’ adaptability and generalizability under various operational conditions and machine components, namely for traditional fault diagnosis, cross-operational conditions, and cross-machine component settings. Our results show that LLMs such as Llama3 and Llama3-instruct demonstrate strong fault detection capabilities and significant adaptability across different operational conditions, outperforming state-of-the-art deep learning (DL) approaches in many cases.

[AI-48] A Semantic Communication System for Real-time 3D Reconstruction Tasks

链接: https://arxiv.org/abs/2412.01191
作者: Jiaxing Zhang,Luosong Guo,Kun Zhu,Houming Qiu
关键词-EN: increasingly important role, real-time semantic mapping, semantic mapping tasks, semantic mapping, high-precision robot localization
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 11 figures, acceptted by 2024 8th International Conference on Communication and Information Systems (ICCIS 2024)

点击查看摘要

Abstract:3D semantic maps have played an increasingly important role in high-precision robot localization and scene understanding. However, real-time construction of semantic maps requires mobile edge devices with extremely high computing power, which are expensive and limit the widespread application of semantic mapping. In order to address this limitation, inspired by cloud-edge collaborative computing and the high transmission efficiency of semantic communication, this paper proposes a method to achieve real-time semantic mapping tasks with limited-resource mobile devices. Specifically, we design an encoding-decoding semantic communication framework for real-time semantic mapping tasks under limited-resource situations. In addition, considering the impact of different channel conditions on communication, this paper designs a module based on the attention mechanism to achieve stable data transmission under various channel conditions. In terms of simulation experiments, based on the TUM dataset, it was verified that the system has an error of less than 0.1% compared to the groundtruth in mapping and localization accuracy and is superior to some novel semantic communication algorithms in real-time performance and channel adaptation. Besides, we implement a prototype system to verify the effectiveness of the proposed framework and designed module in real indoor scenarios. The results show that our system can complete real-time semantic mapping tasks for common indoor objects (chairs, computers, people, etc.) with a limited-resource device, and the mapping update time is less than 1 second.

[AI-49] raining Stiff Neural Ordinary Differential Equations with Explicit Exponential Integration Methods

链接: https://arxiv.org/abs/2412.01181
作者: Colby Fronk,Linda Petzold
关键词-EN: ordinary differential equations, ODE approaches struggle, Stiff ordinary differential, standard neural ODE, neural ODE approaches
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Stiff ordinary differential equations (ODEs) are common in many science and engineering fields, but standard neural ODE approaches struggle to accurately learn these stiff systems, posing a significant barrier to widespread adoption of neural ODEs. In our earlier work, we addressed this challenge by utilizing single-step implicit methods for solving stiff neural ODEs. While effective, these implicit methods are computationally costly and can be complex to implement. This paper expands on our earlier work by exploring explicit exponential integration methods as a more efficient alternative. We evaluate the potential of these explicit methods to handle stiff dynamics in neural ODEs, aiming to enhance their applicability to a broader range of scientific and engineering problems. We found the integrating factor Euler (IF Euler) method to excel in stability and efficiency. While implicit schemes failed to train the stiff Van der Pol oscillator, the IF Euler method succeeded, even with large step sizes. However, IF Euler’s first-order accuracy limits its use, leaving the development of higher-order methods for stiff neural ODEs an open research problem.

[AI-50] Superhypergraph Neural Networks and Plithogenic Graph Neural Networks: Theoretical Foundations

链接: https://arxiv.org/abs/2412.01176
作者: Takaaki Fujita
关键词-EN: Graph Neural Networks, Neural networks, Hypergraphs extend traditional, connect multiple nodes, Hypergraph Neural Networks
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Combinatorics (math.CO); Logic (math.LO)
*备注: 77 pages; 3 figures

点击查看摘要

Abstract:Hypergraphs extend traditional graphs by allowing edges to connect multiple nodes, while superhypergraphs further generalize this concept to represent even more complex relationships. Neural networks, inspired by biological systems, are widely used for tasks such as pattern recognition, data classification, and prediction. Graph Neural Networks (GNNs), a well-established framework, have recently been extended to Hypergraph Neural Networks (HGNNs), with their properties and applications being actively studied. The Plithogenic Graph framework enhances graph representations by integrating multi-valued attributes, as well as membership and contradiction functions, enabling the detailed modeling of complex relationships. In the context of handling uncertainty, concepts such as Fuzzy Graphs and Neutrosophic Graphs have gained prominence. It is well established that Plithogenic Graphs serve as a generalization of both Fuzzy Graphs and Neutrosophic Graphs. Furthermore, the Fuzzy Graph Neural Network has been proposed and is an active area of research. This paper establishes the theoretical foundation for the development of SuperHyperGraph Neural Networks (SHGNNs) and Plithogenic Graph Neural Networks, expanding the applicability of neural networks to these advanced graph structures. While mathematical generalizations and proofs are presented, future computational experiments are anticipated.

[AI-51] R.I.P.: A Simple Black-box Attack on Continual Test-time Adaptation

链接: https://arxiv.org/abs/2412.01154
作者: Trung-Hieu Hoang,Duc Minh Vo,Minh N. Do
关键词-EN: Test-time adaptation, continual TTA model, continual domain shift, continual TTA, unlabeled testing data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) has emerged as a promising solution to tackle the continual domain shift in machine learning by allowing model parameters to change at test time, via self-supervised learning on unlabeled testing data. At the same time, it unfortunately opens the door to unforeseen vulnerabilities for degradation over time. Through a simple theoretical continual TTA model, we successfully identify a risk in the sampling process of testing data that could easily degrade the performance of a continual TTA model. We name this risk as Reusing of Incorrect Prediction (RIP) that TTA attackers can employ or as a result of the unintended query from general TTA users. The risk posed by RIP is also highly realistic, as it does not require prior knowledge of model parameters or modification of testing samples. This simple requirement makes RIP as the first black-box TTA attack algorithm that stands out from existing white-box attempts. We extensively benchmark the performance of the most recent continual TTA approaches when facing the RIP attack, providing insights on its success, and laying out potential roadmaps that could enhance the resilience of future continual TTA systems.

[AI-52] RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

链接: https://arxiv.org/abs/2412.01129
作者: Geonho Lee,Janghwan Lee,Sukjin Hong,Minsoo Kim,Euijai Ahn,Du-Seong Chang,Jungwook Choi
关键词-EN: quantization error compensation, LoRA-based quantization error, quantization error, error compensation, parameter-efficient LLM fine-tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss’s rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ’s consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.

[AI-53] AS-TsC: A Data-Driven Framework for Estimating Time of Arrival Using Temporal-Attribute-Spatial Tri-space Coordination of Truck Trajectories

链接: https://arxiv.org/abs/2412.01122
作者: Mengran Li,Junzhou Chen,Guanying Jiang,Fuliang Li,Ronghui Zhang,Siyuan Gong,Zhihan Lv
关键词-EN: Accurately estimating time, optimizing transportation efficiency, Accurately estimating, efficiency in logistics, crucial for optimizing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately estimating time of arrival (ETA) for trucks is crucial for optimizing transportation efficiency in logistics. GPS trajectory data offers valuable information for ETA, but challenges arise due to temporal sparsity, variable sequence lengths, and the interdependencies among multiple trucks. To address these issues, we propose the Temporal-Attribute-Spatial Tri-space Coordination (TAS-TsC) framework, which leverages three feature spaces-temporal, attribute, and spatial-to enhance ETA. Our framework consists of a Temporal Learning Module (TLM) using state space models to capture temporal dependencies, an Attribute Extraction Module (AEM) that transforms sequential features into structured attribute embeddings, and a Spatial Fusion Module (SFM) that models the interactions among multiple trajectories using graph representation this http URL modules collaboratively learn trajectory embeddings, which are then used by a Downstream Prediction Module (DPM) to estimate arrival times. We validate TAS-TsC on real truck trajectory datasets collected from Shenzhen, China, demonstrating its superior performance compared to existing methods.

[AI-54] How the use of feature selection methods influences the efficiency and accuracy of complex network simulations

链接: https://arxiv.org/abs/2412.01096
作者: Katarzyna Musial,Jiaqi Wen,Andreas Gwyther-Gouriotis
关键词-EN: perfectly emulate real-world, network systems’ models, Complex network, Complex network systems’, link prediction
类目: Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Complex network systems’ models are designed to perfectly emulate real-world networks through the use of simulation and link prediction. Complex network systems are defined by nodes and their connections where both have real-world features that result in a heterogeneous network in which each of the nodes has distinct characteristics. Thus, incorporating real-world features is an important component to achieve a simulation which best represents the real-world. Currently very few complex network systems implement real-world features, thus this study proposes feature selection methods which utilise unsupervised filtering techniques to rank real-world node features alongside a wrapper function to test combinations of the ranked features. The chosen method was coined FS-SNS which improved 8 out of 10 simulations of real-world networks. A consistent threshold of included features was also discovered which saw a threshold of 4 features to achieve the most accurate simulation for all networks. Through these findings the study also proposes future work and discusses how the findings can be used to further the Digital Twin and complex network system field.

[AI-55] A Hierarchical Heuristic for Clustered Steiner Trees in the Plane with Obstacles

链接: https://arxiv.org/abs/2412.01094
作者: Victor Parque
关键词-EN: Euclidean Steiner trees, real-world applications ubiquitously, model minimal networks, disjoint Euclidean Steiner, Euclidean Steiner
类目: Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Article accepted/presented as Long Paper at The Twelfth International Symposium on Computing and Networking (CANDAR2024)

点击查看摘要

Abstract:Euclidean Steiner trees are relevant to model minimal networks in real-world applications ubiquitously. In this paper, we study the feasibility of a hierarchical approach embedded with bundling operations to compute multiple and mutually disjoint Euclidean Steiner trees that avoid clutter and overlapping with obstacles in the plane, which is significant to model the decentralized and the multipoint coordination of agents in constrained 2D domains. Our computational experiments using arbitrary obstacle configuration with convex and non-convex geometries show the feasibility and the attractive performance when computing multiple obstacle-avoiding Steiner trees in the plane. Our results offer the mechanisms to elucidate new operators for obstacle-avoiding Steiner trees.

[AI-56] A Hybrid Evolutionary Approach for Multi Robot Coordinated Planning at Intersections

链接: https://arxiv.org/abs/2412.01082
作者: Victor Parque
关键词-EN: Coordinated multi-robot motion, multi-robot motion planning, factories and warehouses, Coordinated multi-robot, mobility in roads
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Computation (stat.CO)
*备注: Paper accepted/presented as a regular paper at The Twelfth International Symposium on Computing and Networking (CANDAR 2024)

点击查看摘要

Abstract:Coordinated multi-robot motion planning at intersections is key for safe mobility in roads, factories and warehouses. The rapidly exploring random tree (RRT) algorithms are popular in multi-robot motion planning. However, generating the graph configuration space and searching in the composite tensor configuration space is computationally expensive for large number of sample points. In this paper, we propose a new evolutionary-based algorithm using a parametric lattice-based configuration and the discrete-based RRT for collision-free multi-robot planning at intersections. Our computational experiments using complex planning intersection scenarios have shown the feasibility and the superiority of the proposed algorithm compared to seven other related approaches. Our results offer new sampling and representation mechanisms to render optimization-based approaches for multi-robot navigation.

[AI-57] Multi-Agent Deep Reinforcement Learning for Distributed and Autonomous Platoon Coordination via Speed-regulation over Large-scale Transportation Networks

链接: https://arxiv.org/abs/2412.01075
作者: Dixiao Wei(1),Peng Yi(1 and 2),Jinlong Lei(1 and 2),Xingyi Zhu(3) ((1) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, China, (2) Department of Control Science and Engineering, Tongji University, China, (3) Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, China)
关键词-EN: improve traffic flow, platooning technology enables, traffic flow efficiency, Truck platooning technology, improve safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Truck platooning technology enables a group of trucks to travel closely together, with which the platoon can save fuel, improve traffic flow efficiency, and improve safety. In this paper, we consider the platoon coordination problem in a large-scale transportation network, to promote cooperation among trucks and optimize the overall efficiency. Involving the regulation of both speed and departure times at hubs, we formulate the coordination problem as a complicated dynamic stochastic integer programming under network and information constraints. To get an autonomous, distributed, and robust platoon coordination policy, we formulate the problem into a model of the Decentralized-Partial Observable Markov Decision Process. Then, we propose a Multi-Agent Deep Reinforcement Learning framework named Trcuk Attention-QMIX (TA-QMIX) to train an efficient online decision policy. TA-QMIX utilizes the attention mechanism to enhance the representation of truck fuel gains and delay times, and provides explicit truck cooperation information during the training process, promoting trucks’ willingness to cooperate. The training framework adopts centralized training and distributed execution, thus training a policy for trucks to make decisions online using only nearby information. Hence, the policy can be autonomously executed on a large-scale network. Finally, we perform comparison experiments and ablation experiments in the transportation network of the Yangtze River Delta region in China to verify the effectiveness of the proposed framework. In a repeated comparative experiment with 5,000 trucks, our method average saves 19.17% of fuel with an average delay of only 9.57 minutes per truck and a decision time of 0.001 seconds.

[AI-58] Lookahead Counterfactual Fairness

链接: https://arxiv.org/abs/2412.01065
作者: Zhiqun Zuo,Tian Xie,Xuwei Tan,Xueru Zhang,Mohammad Mahdi Khalili
关键词-EN: machine learning, concerns have arisen, applications that involve, social groups, Counterfactual fairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:As machine learning (ML) algorithms are used in applications that involve humans, concerns have arisen that these algorithms may be biased against certain social groups. \textitCounterfactual fairness (CF) is a fairness notion proposed in Kusner et al. (2017) that measures the unfairness of ML predictions; it requires that the prediction perceived by an individual in the real world has the same marginal distribution as it would be in a counterfactual world, in which the individual belongs to a different group. Although CF ensures fair ML predictions, it fails to consider the downstream effects of ML predictions on individuals. Since humans are strategic and often adapt their behaviors in response to the ML system, predictions that satisfy CF may not lead to a fair future outcome for the individuals. In this paper, we introduce \textitlookahead counterfactual fairness (LCF), a fairness notion accounting for the downstream effects of ML models which requires the individual \textitfuture status to be counterfactually fair. We theoretically identify conditions under which LCF can be satisfied and propose an algorithm based on the theorems. We also extend the concept to path-dependent fairness. Experiments on both synthetic and real data validate the proposed method.

[AI-59] Reducing Inference Energy Consumption Using Dual Complementary CNNs

链接: https://arxiv.org/abs/2412.01039
作者: Michail Kinnas,John Violos,Ioannis Kompatsiaris,Symeon Papadopoulos
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, area of research, important area
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Energy efficiency of Convolutional Neural Networks (CNNs) has become an important area of research, with various strategies being developed to minimize the power consumption of these models. Previous efforts, including techniques like model pruning, quantization, and hardware optimization, have made significant strides in this direction. However, there remains a need for more effective on device AI solutions that balance energy efficiency with model performance. In this paper, we propose a novel approach to reduce the energy requirements of inference of CNNs. Our methodology employs two small Complementary CNNs that collaborate with each other by covering each other’s “weaknesses” in predictions. If the confidence for a prediction of the first CNN is considered low, the second CNN is invoked with the aim of producing a higher confidence prediction. This dual-CNN setup significantly reduces energy consumption compared to using a single large deep CNN. Additionally, we propose a memory component that retains previous classifications for identical inputs, bypassing the need to re-invoke the CNNs for the same input, further saving energy. Our experiments on a Jetson Nano computer demonstrate an energy reduction of up to 85.8% achieved on modified datasets where each sample was duplicated once. These findings indicate that leveraging a complementary CNN pair along with a memory component effectively reduces inference energy while maintaining high accuracy.

[AI-60] AI Benchmarks and Datasets for LLM Evaluation

链接: https://arxiv.org/abs/2412.01020
作者: Todor Ivanov,Valeri Penchev
关键词-EN: large model sizes, LLMs demand significant, requiring distributed computing, demand significant computational, significant computational resources
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: November 2024 v1.0

点击查看摘要

Abstract:LLMs demand significant computational resources for both pre-training and fine-tuning, requiring distributed computing capabilities due to their large model sizes \citesastry2024computing. Their complex architecture poses challenges throughout the entire AI lifecycle, from data collection to deployment and monitoring \citeOECD_AIlifecycle. Addressing critical AI system challenges, such as explainability, corrigibility, interpretability, and hallucination, necessitates a systematic methodology and rigorous benchmarking \citeguldimann2024complai. To effectively improve AI systems, we must precisely identify systemic vulnerabilities through quantitative evaluation, bolstering system trustworthiness. The enactment of the EU AI Act \citeEUAIAct by the European Parliament on March 13, 2024, establishing the first comprehensive EU-wide requirements for the development, deployment, and use of AI systems, further underscores the importance of tools and methodologies such as Z-Inspection. It highlights the need to enrich this methodology with practical benchmarks to effectively address the technical challenges posed by AI systems. To this end, we have launched a project that is part of the AI Safety Bulgaria initiatives \citeAI_Safety_Bulgaria, aimed at collecting and categorizing AI benchmarks. This will enable practitioners to identify and utilize these benchmarks throughout the AI system lifecycle.

[AI-61] DSSRNN: Decomposition-Enhanced State-Space Recurrent Neural Network for Time-Series Analysis

链接: https://arxiv.org/abs/2412.00994
作者: Ahmad Mohammadshirazi,Ali Nosratifiroozsalari,Rajiv Ramnath
关键词-EN: requiring domain-specific knowledge, domain-specific knowledge due, requiring domain-specific, wide-ranging applications, DSSRNN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is a crucial yet challenging task in machine learning, requiring domain-specific knowledge due to its wide-ranging applications. While recent Transformer models have improved forecasting capabilities, they come with high computational costs. Linear-based models have shown better accuracy than Transformers but still fall short of ideal performance. To address these challenges, we introduce the Decomposition State-Space Recurrent Neural Network (DSSRNN), a novel framework designed for both long-term and short-term time series forecasting. DSSRNN uniquely combines decomposition analysis to capture seasonal and trend components with state-space models and physics-based equations. We evaluate DSSRNN’s performance on indoor air quality datasets, focusing on CO2 concentration prediction across various forecasting horizons. Results demonstrate that DSSRNN consistently outperforms state-of-the-art models, including transformer-based architectures, in terms of both Mean Squared Error (MSE) and Mean Absolute Error (MAE). For example, at the shortest horizon (T=96) in Office 1, DSSRNN achieved an MSE of 0.378 and an MAE of 0.401, significantly lower than competing models. Additionally, DSSRNN exhibits superior computational efficiency compared to more complex models. While not as lightweight as the DLinear model, DSSRNN achieves a balance between performance and efficiency, with only 0.11G MACs and 437MiB memory usage, and an inference time of 0.58ms for long-term forecasting. This work not only showcases DSSRNN’s success but also establishes a new benchmark for physics-informed machine learning in environmental forecasting and potentially other domains.

[AI-62] Linear Probe Penalties Reduce LLM Sycophancy NEURIPS2024

链接: https://arxiv.org/abs/2412.00967
作者: Henry Papadatos,Rachel Freedman
关键词-EN: Large language models, Large language, prioritizing agreement, objective statements, users over accurate
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 15 figures, NeurIPS 2024 Workshop Socially Responsible Language Modelling Research (SoLaR)

点击查看摘要

Abstract:Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF often rewards sycophancy. We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior. Our experiments show that constructing and optimizing against this surrogate reward function reduces sycophantic behavior in multiple open-source LLMs. Our results suggest a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning.

[AI-63] Generative Language Models Potential for Requirement Engineering Applications: Insights into Current Strengths and Limitations

链接: https://arxiv.org/abs/2412.00959
作者: Summra Saleem,Muhammad Nabeel Asim,Ludger Van Elst,Andreas Dengel
关键词-EN: language models, Traditional language models, software engineering domain, Traditional language, language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7,445 samples, requirements extraction),PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77,respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.

[AI-64] BIGCity: A Universal Spatiotemporal Model for Unified Trajectory and Traffic State Data Analysis

链接: https://arxiv.org/abs/2412.00953
作者: Xie Yu,Jingyuan Wang,Yifan Yang,Qian Huang,Ke Qu
关键词-EN: representing individual-level mobility, representing population-level mobility, traffic state data, Typical dynamic, traffic state
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Typical dynamic ST data includes trajectory data (representing individual-level mobility) and traffic state data (representing population-level mobility). Traditional studies often treat trajectory and traffic state data as distinct, independent modalities, each tailored to specific tasks within a single modality. However, real-world applications, such as navigation apps, require joint analysis of trajectory and traffic state data. Treating these data types as two separate domains can lead to suboptimal model performance. Although recent advances in ST data pre-training and ST foundation models aim to develop universal models for ST data analysis, most existing models are “multi-task, solo-data modality” (MTSM), meaning they can handle multiple tasks within either trajectory data or traffic state data, but not both simultaneously. To address this gap, this paper introduces BIGCity, the first multi-task, multi-data modality (MTMD) model for ST data analysis. The model targets two key challenges in designing an MTMD ST model: (1) unifying the representations of different ST data modalities, and (2) unifying heterogeneous ST analysis tasks. To overcome the first challenge, BIGCity introduces a novel ST-unit that represents both trajectories and traffic states in a unified format. Additionally, for the second challenge, BIGCity adopts a tunable large model with ST task-oriented prompt, enabling it to perform a range of heterogeneous tasks without the need for fine-tuning. Extensive experiments on real-world datasets demonstrate that BIGCity achieves state-of-the-art performance across 8 tasks, outperforming 18 baselines. To the best of our knowledge, BIGCity is the first model capable of handling both trajectories and traffic states for diverse heterogeneous tasks. Our code are available at this https URL

[AI-65] STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

链接: https://arxiv.org/abs/2412.00949
作者: Nicholas Lenzen,Amogh Raut,Andrew Melnik
关键词-EN: latent CLIP embeddings, latent goal space, follow instructions, CLIP foundation model, training generative agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted at CoRL 2024: Workshop on Lifelong Learning for Home Robots

点击查看摘要

Abstract:Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

[AI-66] Bilinear Convolution Decomposition for Causal RL Interpretability

链接: https://arxiv.org/abs/2412.00944
作者: Narmeen Oozeer,Sinem Erisken,Alice Rigg
关键词-EN: coarse causal control, Efforts to interpret, interpret reinforcement learning, attribution or probing, causal control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:Efforts to interpret reinforcement learning (RL) models often rely on high-level techniques such as attribution or probing, which provide only correlational insights and coarse causal control. This work proposes replacing nonlinearities in convolutional neural networks (ConvNets) with bilinear variants, to produce a class of models for which these limitations can be addressed. We show bilinear model variants perform comparably in model-free reinforcement learning settings, and give a side by side comparison on ProcGen environments. Bilinear layers’ analytic structure enables weight-based decomposition. Previous work has shown bilinearity enables quantifying functional importance through eigendecomposition, to identify interpretable low rank structure. We show how to adapt the decomposition to convolution layers by applying singular value decomposition to vectors of interest, to separate the channel and spatial dimensions. Finally, we propose a methodology for causally validating concept-based probes, and illustrate its utility by studying a maze-solving agent’s ability to track a cheese object.

[AI-67] A Deep Generative Model for the Design of Synthesizable Ionizable Lipids NEURIPS2024

链接: https://arxiv.org/abs/2412.00928
作者: Yuxuan Ou,Jingyi Zhao,Austin Tripp,Morteza Rasoulianboroujeni,José Miguel Hernández-Lobato
关键词-EN: modern biomedicine, enabling the effective, rapid degradation, vital in modern, ionizable lipids
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Workshop on AI for New Drug Modalities

点击查看摘要

Abstract:Lipid nanoparticles (LNPs) are vital in modern biomedicine, enabling the effective delivery of mRNA for vaccines and therapies by protecting it from rapid degradation. Among the components of LNPs, ionizable lipids play a key role in RNA protection and facilitate its delivery into the cytoplasm. However, designing ionizable lipids is complex. Deep generative models can accelerate this process and explore a larger candidate space compared to traditional methods. Due to the structural differences between lipids and small molecules, existing generative models used for small molecule generation are unsuitable for lipid generation. To address this, we developed a deep generative model specifically tailored for the discovery of ionizable lipids. Our model generates novel ionizable lipid structures and provides synthesis paths using synthetically accessible building blocks, addressing synthesizability. This advancement holds promise for streamlining the development of lipid-based delivery systems, potentially accelerating the deployment of new therapeutic agents, including mRNA vaccines and gene therapies.

[AI-68] Playable Game Generation

链接: https://arxiv.org/abs/2412.00887
作者: Mingyu Yang,Junyou Li,Zhongbin Fang,Sheng Chen,Yangbin Yu,Qiang Fu,Wei Yang,Deheng Ye
关键词-EN: Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, multimodal video synthesis, Generated Content
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Artificial Intelligence Generated Content (AIGC) has advanced from text-to-image generation to text-to-video and multimodal video synthesis. However, generating playable games presents significant challenges due to the stringent requirements for real-time interaction, high visual quality, and accurate simulation of game mechanics. Existing approaches often fall short, either lacking real-time capabilities or failing to accurately simulate interactive mechanics. To tackle the playability issue, we propose a novel method called \emphPlayGen, which encompasses game data generation, an autoregressive DiT-based diffusion model, and a comprehensive playability-based evaluation framework. Validated on well-known 2D and 3D games, PlayGen achieves real-time interaction, ensures sufficient visual quality, and provides accurate interactive mechanics simulation. Notably, these results are sustained even after over 1000 frames of gameplay on an NVIDIA RTX 2060 GPU. Our code is publicly available: this https URL. Our playable demo generated by AI is: this http URL.

[AI-69] Learn to Unlearn: Meta-Learning-Based Knowledge Graph Embedding Unlearning

链接: https://arxiv.org/abs/2412.00881
作者: Naixing Xu,Qian Li,Xu Wang,Bingchen Liu,Xin Li
关键词-EN: continuous vector spaces, methods map entities, Knowledge graph, embedding methods map, vector spaces
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge graph (KG) embedding methods map entities and relations into continuous vector spaces, improving performance in tasks like link prediction and question answering. With rising privacy concerns, machine unlearning (MU) has emerged as a critical AI technology, enabling models to eliminate the influence of specific data. Existing MU approaches often rely on data obfuscation and adjustments to training loss but lack generalization across unlearning tasks. This paper introduces MetaEU, a Meta-Learning-Based Knowledge Graph Embedding Unlearning framework. MetaEU leverages meta-learning to unlearn specific embeddings, mitigating their impact while preserving model performance on remaining data. Experiments on benchmark datasets demonstrate its effectiveness in KG embedding unlearning.

[AI-70] Deep evolving semi-supervised anomaly detection

链接: https://arxiv.org/abs/2412.00860
作者: Jack Belham,Aryan Bhosale,Samrat Mukherjee,Biplab Banerjee,Fabio Cuzzolin
关键词-EN: continual semi-supervised anomaly, semi-supervised anomaly detection, aim of highlighting, continual semi-supervised learning, anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD), with the aim of highlighting the importance of such a problem formulation which assumes as close to real-world conditions as possible. After an overview of the relevant definitions of continual semi-supervised learning, its components, anomaly detection extension, and the training protocols; the paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection. The results show that such a use of extreme value theory (EVT) applied to anomaly detection can provide promising results even in comparison to an upper baseline of joint training. The results explore the effects of how much labelled and unlabelled data is present, of which class, and where it is located in the data stream. Outlier rejection shows promising initial results where it often surpasses a baseline method of Elastic Weight Consolidation (EWC). A baseline for CSAD is put forward along with the specific dataset setups used for reproducability and testability for other practitioners. Future research directions include other CSAD settings and further research into efficient continual hyperparameter tuning.

[AI-71] Improving Multimodal LLM s Ability In Geometry Problem Solving Reasoning And Multistep Scoring

链接: https://arxiv.org/abs/2412.00846
作者: Avinash Anand,Raj Jaiswal,Abhishek Dharmadhikari,Atharva Marathe,Harsh Parimal Popat,Harshil Mital,Kritarth Prasad,Rajiv Ratn Shah,Roger Zimmermann
关键词-EN: Large Vision Language, Large Vision, Vision Language Models, Vision Language, capabilities of Large
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:This paper presents GPSM4K, a comprehensive geometry multimodal dataset tailored to augment the problem-solving capabilities of Large Vision Language Models (LVLMs). GPSM4K encompasses 2157 multimodal question-answer pairs manually extracted from mathematics textbooks spanning grades 7-12 and is further augmented to 5340 problems, consisting of both numerical and theorem-proving questions. In contrast to PGPS9k, Geometry3K, and Geo170K which feature only objective-type questions, GPSM4K offers detailed step-by-step solutions in a consistent format, facilitating a comprehensive evaluation of problem-solving approaches. This dataset serves as an excellent benchmark for assessing the geometric reasoning capabilities of LVLMs. Evaluation of our test set shows that there is scope for improvement needed in open-source language models in geometry problem-solving. Finetuning on our training set increases the geometry problem-solving capabilities of models. Further, We also evaluate the effectiveness of techniques such as image captioning and Retrieval Augmentation generation (RAG) on model performance. We leveraged LLM to automate the task of final answer evaluation by providing ground truth and predicted solutions. This research will help to assess and improve the geometric reasoning capabilities of LVLMs.

[AI-72] SPILDL: A Scalable and Parallel Inductive Learner in Description Logic

链接: https://arxiv.org/abs/2412.00830
作者: Eyad Algahtani
关键词-EN: Description Logic, Parallel Inductive Learner, Inductive Learner, SPILDL, Parallel Inductive
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present SPILDL, a Scalable and Parallel Inductive Learner in Description Logic (DL). SPILDL is based on the DL-Learner (the state of the art in DL-based ILP learning). As a DL-based ILP learner, SPILDL targets the \mathcalALCQI^\mathcal(D) DL language, and can learn DL hypotheses expressed as disjunctions of conjunctions (using the \sqcup operator). Moreover, SPILDL’s hypothesis language also incorporates the use of string concrete roles (also known as string data properties in the Web Ontology Language, OWL); As a result, this incorporation of powerful DL constructs, enables SPILDL to learn powerful DL-based hypotheses for describing many real-world complex concepts. SPILDL employs a hybrid parallel approach which combines both shared-memory and distributed-memory approaches, to accelerates ILP learning (for both hypothesis search and evaluation). According to experimental results, SPILDL’s parallel search improved performance by up to \sim 27.3 folds (best case). For hypothesis evaluation, SPILDL improved evaluation performance through HT-HEDL (our multi-core CPU + multi-GPU hypothesis evaluation engine), by up to 38 folds (best case). By combining both parallel search and evaluation, SPILDL improved performance by up to \sim 560 folds (best case). In terms of worst case scenario, SPILDL’s parallel search doesn’t provide consistent speedups on all datasets, and is highly dependent on the search space nature of the ILP dataset. For some datasets, increasing the number of parallel search threads result in reduced performance, similar or worse than baseline. Some ILP datasets benefit from parallel search, while others don’t (or the performance gains are negligible). In terms of parallel evaluation, on small datasets, parallel evaluation provide similar or worse performance than baseline.

[AI-73] Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents

链接: https://arxiv.org/abs/2412.00821
作者: Raj Jaiswal,Dhruv Jain,Harsh Parimal Popat,Avinash Anand,Abhishek Dharmadhikari,Atharva Marathe,Rajiv Ratn Shah
关键词-EN: Large Language Models, Large Language, Language Models, demonstrate remarkable capabilities, demonstrate remarkable
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in various reasoning tasks. However, they encounter significant challenges when it comes to scientific reasoning, particularly in physics, which requires not only mathematical reasoning but also factual and conceptual understanding. When addressing complex physics problems, LLMs typically face three key issues: problem miscomprehension, incorrect concept application, and computational errors. While each of these problems can be addressed individually, there is a need for a generalized approach that can tackle all three issues simultaneously. To address this, we introduce Mixture of Refinement Agents (MoRA), a novel agentic refinement framework that iteratively refines the LLM generated base solution by correcting the aforementioned errors, resulting in a significant performance improvement for open-source LLMs. Our approach aims to bridge the gap between opensource LLMs and GPT-4o by utilizing the latter as error identifier to guide these refinement agents. We evaluate our approach on the SciEval and MMLU subsets along with our own physics dataset (PhysicsQA). MoRA significantly improves the performance of Llama-3-70B and Gemma-2-27B on these datasets, achieving up to a 16% increase in final answer accuracy.

[AI-74] Long text outline generation: Chinese text outline based on unsupervised framework and large language mode

链接: https://arxiv.org/abs/2412.00810
作者: Yan Yan,Yuanchi Ma
关键词-EN: identifying underlying chapter, Outline generation aims, aims to reveal, reveal the internal, internal structure
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Outline generation aims to reveal the internal structure of a document by identifying underlying chapter relationships and generating corresponding chapter summaries. Although existing deep learning methods and large models perform well on small- and medium-sized texts, they struggle to produce readable outlines for very long texts (such as fictional works), often failing to segment chapters coherently. In this paper, we propose a novel outline generation method for Chinese, combining an unsupervised framework with large models. Specifically, the method first generates chapter feature graph data based on entity and syntactic dependency relationships. Then, a representation module based on graph attention layers learns deep embeddings of the chapter graph data. Using these chapter embeddings, we design an operator based on Markov chain principles to segment plot boundaries. Finally, we employ a large model to generate summaries of each plot segment and produce the overall outline. We evaluate our model based on segmentation accuracy and outline readability, and our performance outperforms several deep learning models and large models in comparative evaluations.

[AI-75] Generative Model for Synthesizing Ionizable Lipids: A Monte Carlo Tree Search Approach

链接: https://arxiv.org/abs/2412.00807
作者: Jingyi Zhao,Yuxuan Ou,Austin Tripp,Morteza Rasoulianboroujeni,José Miguel Hernández-Lobato
关键词-EN: effective messenger RNA, messenger RNA, developing lipid nanoparticles, Ionizable lipids, essential in developing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Ionizable lipids are essential in developing lipid nanoparticles (LNPs) for effective messenger RNA (mRNA) delivery. While traditional methods for designing new ionizable lipids are typically time-consuming, deep generative models have emerged as a powerful solution, significantly accelerating the molecular discovery process. However, a practical challenge arises as the molecular structures generated can often be difficult or infeasible to synthesize. This project explores Monte Carlo tree search (MCTS)-based generative models for synthesizable ionizable lipids. Leveraging a synthetically accessible lipid building block dataset and two specialized predictors to guide the search through chemical space, we introduce a policy network guided MCTS generative model capable of producing new ionizable lipids with available synthesis pathways.

[AI-76] HT-HEDL: High-Throughput Hypothesis Evaluation in Description Logic

链接: https://arxiv.org/abs/2412.00802
作者: Eyad Algahtani
关键词-EN: Description Logic, present High-Throughput Hypothesis, Evaluation, single hypothesis, Hypothesis
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present High-Throughput Hypothesis Evaluation in Description Logic (HT-HEDL). HT-HEDL is a high-performance hypothesis evaluation engine that accelerates hypothesis evaluation computations for inductive logic programming (ILP) learners using description logic (DL) for their knowledge representation; in particular, HT-HEDL targets accelerating computations for the \mathcalALCQI^\mathcal(D) DL language. HT-HEDL aggregates the computing power of multi-core CPUs with multi-GPUs to improve hypothesis computations at two levels: 1) the evaluation of a single hypothesis and 2) the evaluation of multiple hypotheses (i.e., batch of hypotheses). In the first level, HT-HEDL uses a single GPU or a vectorized multi-threaded CPU to evaluate a single hypothesis. In vectorized multi-threaded CPU evaluation, classical (scalar) CPU multi-threading is combined with CPU’s extended vector instructions set to extract more CPU-based performance. The experimental results revealed that HT-HEDL increased performance using CPU-based evaluation (on a single hypothesis): from 20.4 folds using classical multi-threading to \sim85 folds using vectorized multi-threading. In the GPU-based evaluation, HT-HEDL achieved speedups of up to \sim38 folds for single hypothesis evaluation using a single GPU. To accelerate the evaluation of multiple hypotheses, HT-HEDL combines, in parallel, GPUs with multi-core CPUs to increase evaluation throughput (number of evaluated hypotheses per second). The experimental results revealed that HT-HEDL increased evaluation throughput by up to 29.3 folds using two GPUs and up to \sim44 folds using two GPUs combined with a CPU’s vectorized multi-threaded evaluation.

[AI-77] A Comprehensive Guide to Explainable AI: From Classical Models to LLM s

链接: https://arxiv.org/abs/2412.00800
作者: Weiche Hsieh,Ziqian Bi,Chuanqi Jiang,Junyu Liu,Benji Peng,Sen Zhang,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Keyu Chen,Caitlyn Heqi Yin,Pohsun Feng,Yizhu Wen,Xinyuan Song,Tianyang Wang,Junjie Yang,Ming Li,Bowen Jing,Jintao Ren,Junhao Song,Han Xu,Hong-Ming Tseng,Yichao Zhang,Lawrence K.Q. Yan,Qian Niu,Silin Chen,Yunze Wang,Chia Xin Liang,Ming Liu
关键词-EN: Explainable Artificial Intelligence, Explainable Artificial, addresses the growing, enabling trust, decision-making processes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) addresses the growing need for transparency and interpretability in AI systems, enabling trust and accountability in decision-making processes. This book offers a comprehensive guide to XAI, bridging foundational concepts with advanced methodologies. It explores interpretability in traditional models such as Decision Trees, Linear Regression, and Support Vector Machines, alongside the challenges of explaining deep learning architectures like CNNs, RNNs, and Large Language Models (LLMs), including BERT, GPT, and T5. The book presents practical techniques such as SHAP, LIME, Grad-CAM, counterfactual explanations, and causal inference, supported by Python code examples for real-world applications. Case studies illustrate XAI’s role in healthcare, finance, and policymaking, demonstrating its impact on fairness and decision support. The book also covers evaluation metrics for explanation quality, an overview of cutting-edge XAI tools and frameworks, and emerging research directions, such as interpretability in federated learning and ethical AI considerations. Designed for a broad audience, this resource equips readers with the theoretical insights and practical skills needed to master XAI. Hands-on examples and additional resources are available at the companion GitHub repository: this https URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.00800 [cs.LG] (or arXiv:2412.00800v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.00800 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-78] A Cognac shot to forget bad memories: Corrective Unlearning in GNNs

链接: https://arxiv.org/abs/2412.00789
作者: Varshita Kolipaka,Akshit Sinha,Debangan Mishra,Sumit Kumar,Arvindh Arun,Shashwat Goel,Ponnurangam Kumaraguru
关键词-EN: Graph Neural Networks, Neural Networks, Graph Neural, graph unlearning, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. As graph data does not follow the independently and identically distributed (i.i.d) assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, deteriorating the model’s performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem of Corrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method, Cognac, which can unlearn the effect of the manipulation set even when only 5% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set while being 8x more efficient. We hope our work guides GNN developers in fixing harmful effects due to issues in real-world data post-training.

[AI-79] A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series

链接: https://arxiv.org/abs/2412.00772
作者: Xiangkai Ma,Xiaobin Hong,Wenzhong Li,Sanglu Lu
关键词-EN: empirical risk minimization, fundamental data mining, supervised training methods, Time series analysis, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series analysis is a fundamental data mining task that supervised training methods based on empirical risk minimization have proven their effectiveness on specific tasks and datasets. However, the acquisition of well-annotated data is costly and a large amount of unlabeled series data is under-utilized. Due to distributional shifts across various domains and different patterns of interest across multiple tasks. The problem of cross-domain multi-task migration of time series remains a significant challenge. To address these problems, this paper proposes a novel cross-domain pretraining method based on Wave Quantization (termed as WQ4TS), which can be combined with any advanced time series model and applied to multiple downstream tasks. Specifically, we transfer the time series data from different domains into a common spectral latent space, and enable the model to learn the temporal pattern knowledge of different domains directly from the common space and utilize it for the inference of downstream tasks, thereby mitigating the challenge of heterogeneous cross-domains migration. The establishment of spectral latent space brings at least three benefits, cross-domain migration capability thus adapting to zero- and few-shot scenarios without relying on priori knowledge of the dataset, general compatible cross-domain migration framework without changing the existing model structure, and robust modeling capability thus achieving SOTA results in multiple downstream tasks. To demonstrate the effectiveness of the proposed approach, we conduct extensive experiments including three important tasks: forecasting, imputation, and classification. And three common real-world data scenarios are simulated: full-data, few-shot, and zero-shot. The proposed WQ4TS achieves the best performance on 87.5% of all tasks, and the average improvement of the metrics on all the tasks is up to 34.7%.

[AI-80] Learning to Forget using Hypernetworks NEURIPS’24

链接: https://arxiv.org/abs/2412.00761
作者: Jose Miguel Lara Rangel,Stefan Schoepf,Jack Foster,David Krueger,Usman Anwar
关键词-EN: gaining increasing attention, remove adversarial data, adversarial data poisoning, data poisoning attacks, gaining increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: AdvML-Frontiers’24: The 3rd Workshop on New Frontiers in Adversarial Machine Learning@NeurIPS’24, Vancouver, CA

点击查看摘要

Abstract:Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a trained model while maintaining performance on the remaining data. This paper introduces HyperForget, a novel machine unlearning framework that leverages hypernetworks - neural networks that generate parameters for other networks - to dynamically sample models that lack knowledge of targeted data while preserving essential capabilities. Leveraging diffusion models, we implement two Diffusion HyperForget Networks and used them to sample unlearned models in Proof-of-Concept experiments. The unlearned models obtained zero accuracy on the forget set, while preserving good accuracy on the retain sets, highlighting the potential of HyperForget for dynamic targeted data removal and a promising direction for developing adaptive machine unlearning algorithms.

[AI-81] Rethinking Cognition: Morphological Info-Computation and the Embodied Paradigm in Life and Artificial Intelligence

链接: https://arxiv.org/abs/2412.00751
作者: Gordana Dodig-Crnkovic
关键词-EN: place Lorenzo Magnanis, Lorenzo Magnanis Eco-Cognitive, place Lorenzo, Magnanis Eco-Cognitive Computationalism, Lorenzo Magnanis
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study aims to place Lorenzo Magnanis Eco-Cognitive Computationalism within the broader context of current work on information, computation, and cognition. Traditionally, cognition was believed to be exclusive to humans and a result of brain activity. However, recent studies reveal it as a fundamental characteristic of all life forms, ranging from single cells to complex multicellular organisms and their networks. Yet, the literature and general understanding of cognition still largely remain human-brain-focused, leading to conceptual gaps and incoherency. This paper presents a variety of computational (information processing) approaches, including an info-computational approach to cognition, where natural structures represent information and dynamical processes on natural structures are regarded as computation, relative to an observing cognizing agent. We model cognition as a web of concurrent morphological computations, driven by processes of self-assembly, self-organisation, and autopoiesis across physical, chemical, and biological domains. We examine recent findings linking morphological computation, morphogenesis, agency, basal cognition, extended evolutionary synthesis, and active inference. We establish a connection to Magnanis Eco-Cognitive Computationalism and the idea of computational domestication of ignorant entities. Novel theoretical and applied insights question the boundaries of conventional computational models of cognition. The traditional models prioritize symbolic processing and often neglect the inherent constraints and potentialities in the physical embodiment of agents on different levels of organization. Gaining a better info-computational grasp of cognitive embodiment is crucial for the advancement of fields such as biology, evolutionary studies, artificial intelligence, robotics, medicine, and more.

[AI-82] MERLIN: Multi-stagE query performance prediction for dynamic paRallel oLap pIpeliNe

链接: https://arxiv.org/abs/2412.00749
作者: Kaixin Zhang,Hongzhi Wang,Kunkai Gu,Ziqi Li,Chunyu Zhao,Yingze Li,Yu Yan
关键词-EN: massive data analysis, OLAP database technology, High-performance OLAP database, Performance Prediction, Query Performance Prediction
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-performance OLAP database technology has emerged with the growing demand for massive data analysis. To achieve much higher performance, many DBMSs adopt sophisticated designs including SIMD operators, parallel execution, and dynamic pipeline modification. However, such advanced OLAP query execution mechanisms still lack targeted Query Performance Prediction (QPP) methods because most existing methods target conventional tree-shaped query plans and static serial executors. To address this problem, in this paper, we proposed MERLIN a multi-stage query performance prediction method for high-performance OLAP DBMSs. MERLIN first establishes resource cost models for each physical operator. Then, it constructs a DAG that consists of a data-flow tree backbone and resource competition relationships among concurrent operators. After using a GAT with an extra attention mechanism to calibrate the cost, the cost vector tree is extracted and summarized by a TCN, ultimately enabling effective query performance prediction. Experimental results demonstrate that MERLIN yields higher performance prediction precision than existing methods.

[AI-83] Exploring Cognition through Morphological Info-Computational Framework

链接: https://arxiv.org/abs/2412.00748
作者: Gordana Dodig-Crnkovic
关键词-EN: capability involving perception, uniquely human capability, human capability involving, involving perception, considered a uniquely
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditionally, cognition has been considered a uniquely human capability involving perception, memory, learning, reasoning, and problem-solving. However, recent research shows that cognition is a fundamental ability shared by all living beings, from single cells to complex organisms. This chapter takes an info-computational approach (ICON), viewing natural structures as information and the processes of change in these structures as computations. It is a relational framework dependent on the perspective of a cognizing observer/cognizer. Informational structures are properties of the material substrate, and when focusing on the behavior of the substrate, we discuss morphological computing (MC). ICON and MC are complementary perspectives for a cognizer. Information and computation are inseparably connected with cognition. This chapter explores research connecting nature as a computational structure for a cognizer, with morphological computation, morphogenesis, agency, extended cognition, and extended evolutionary synthesis, using examples of the free energy principle and active inference. It introduces theoretical and practical approaches challenging traditional computational models of cognition limited to abstract symbol processing, highlighting the computational capacities inherent in the material substrate (embodiment). Understanding the embodiment of cognition through its morphological computational basis is crucial for biology, evolution, intelligence theory, AI, robotics, and other fields.

[AI-84] A Cross-Scene Benchmark for Open-World Drone Active Tracking

链接: https://arxiv.org/abs/2412.00744
作者: Haowei Sun,Jinwu Hu,Zhirui Zhang,Haoyuan Tian,Xinze Xie,Yufeng Wang,Zhuliang Yu,Xiaohua Xie,Mingkui Tan
关键词-EN: Visual Active Tracking, Drone Visual Active, Visual Active, motion system based, drone active tracking
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations, providing a more practical solution for effective tracking in dynamic environments. However, accurate Drone Visual Active Tracking using reinforcement learning remains challenging due to the absence of a unified benchmark, the complexity of open-world environments with frequent interference, and the diverse motion behavior of dynamic targets. To address these issues, we propose a unified cross-scene cross-domain benchmark for open-world drone active tracking called DAT. The DAT benchmark provides 24 visually complex environments to assess the algorithms’ cross-scene and cross-domain generalization abilities, and high-fidelity modeling of realistic robot dynamics. Additionally, we propose a reinforcement learning-based drone tracking method called R-VAT, which aims to improve the performance of drone tracking targets in complex scenarios. Specifically, inspired by curriculum learning, we introduce a Curriculum-Based Training strategy that progressively enhances the agent tracking performance in vast environments with complex interference. We design a goal-centered reward function to provide precise feedback to the drone agent, preventing targets farther from the center of view from receiving higher rewards than closer ones. This allows the drone to adapt to the diverse motion behavior of open-world targets. Experiments demonstrate that the R-VAT has about 400% improvement over the SOTA method in terms of the cumulative reward metric.

[AI-85] Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective

链接: https://arxiv.org/abs/2412.00742
作者: Yujie Mo,Zhihe Lu,Runpeng Yu,Xiaofeng Zhu,Xinchao Wang
关键词-EN: Self-supervised heterogeneous graph, shown promising potential, Self-supervised heterogeneous, diverse scenarios, heterogeneous graph learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios. However, while existing SHGL methods share a similar essential with clustering approaches, they encounter two significant limitations: (i) noise in graph structures is often introduced during the message-passing process to weaken node representations, and (ii) cluster-level information may be inadequately captured and leveraged, diminishing the performance in downstream tasks. In this paper, we address these limitations by theoretically revisiting SHGL from the spectral clustering perspective and introducing a novel framework enhanced by rank and dual consistency constraints. Specifically, our framework incorporates a rank-constrained spectral clustering method that refines the affinity matrix to exclude noise effectively. Additionally, we integrate node-level and cluster-level consistency constraints that concurrently capture invariant and clustering information to facilitate learning in downstream tasks. We theoretically demonstrate that the learned representations are divided into distinct partitions based on the number of classes and exhibit enhanced generalization ability across tasks. Experimental results affirm the superiority of our method, showcasing remarkable improvements in several downstream tasks compared to existing methods.

[AI-86] Free and Customizable Code Documentation with LLM s: A Fine-Tuning Approach

链接: https://arxiv.org/abs/2412.00726
作者: Sayak Chakrabarty,Souradip Pal
关键词-EN: Automated documentation, programming source code, task with significant, significant practical, practical and scientific
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. We present a large language model (LLM)-based application that developers can use as a support tool to generate basic documentation for any publicly available repository. Over the last decade, several papers have been written on generating documentation for source code using neural network architectures. With the recent advancements in LLM technology, some open-source applications have been developed to address this problem. However, these applications typically rely on the OpenAI APIs, which incur substantial financial costs, particularly for large repositories. Moreover, none of these open-source applications offer a fine-tuned model or features to enable users to fine-tune. Additionally, finding suitable data for fine-tuning is often challenging. Our application addresses these issues which is available at this https URL.

[AI-87] Decision Transformer vs. Decision Mamba: Analysing the Complexity of Sequential Decision Making in Atari Games

链接: https://arxiv.org/abs/2412.00725
作者: Ke Yan
关键词-EN: Decision Transformer, Decision Mamba, Atari games, visual complexity, Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work analyses the disparity in performance between Decision Transformer (DT) and Decision Mamba (DM) in sequence modelling reinforcement learning tasks for different Atari games. The study first observed that DM generally outperformed DT in the games Breakout and Qbert, while DT performed better in more complicated games, such as Hero and Kung Fu Master. To understand these differences, we expanded the number of games to 12 and performed a comprehensive analysis of game characteristics, including action space complexity, visual complexity, average trajectory length, and average steps to the first non-zero reward. In order to further analyse the key factors that impact the disparity in performance between DT and DM, we employ various approaches, including quantifying visual complexity, random forest regression, correlation analysis, and action space simplification strategies. The results indicate that the performance gap between DT and DM is affected by the complex interaction of multiple factors, with the complexity of the action space and visual complexity (particularly evaluated by compression ratio) being the primary determining factors. DM performs well in environments with simple action and visual elements, while DT shows an advantage in games with higher action and visual complexity. Our findings contribute to a deeper understanding of how the game characteristics affect the performance difference in sequential modelling reinforcement learning, potentially guiding the development of future model design and applications for diverse and complex environments.

[AI-88] AdaScale: Dynamic Context-aware DNN Scaling via Automated Adaptation Loop on Mobile Devices

链接: https://arxiv.org/abs/2412.00724
作者: Yuzhan Wang,Sicong Liu,Bin Guo,Boqi Zhang,Ke Ma,Yasan Ding,Hao Luo,Yao Li,Zhiwen Yu
关键词-EN: reshaping mobile applications, deploying deep neural, learning is reshaping, growing trend, trend of deploying
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning is reshaping mobile applications, with a growing trend of deploying deep neural networks (DNNs) directly to mobile and embedded devices to address real-time performance and privacy. To accommodate local resource limitations, techniques like weight compression, convolution decomposition, and specialized layer architectures have been developed. However, the \textitdynamic and \textitdiverse deployment contexts of mobile devices pose significant challenges. Adapting deep models to meet varied device-specific requirements for latency, accuracy, memory, and energy is labor-intensive. Additionally, changing processor states, fluctuating memory availability, and competing processes frequently necessitate model re-compression to preserve user experience. To address these issues, we introduce AdaScale, an elastic inference framework that automates the adaptation of deep models to dynamic contexts. AdaScale leverages a self-evolutionary model to streamline network creation, employs diverse compression operator combinations to reduce the search space and improve outcomes, and integrates a resource availability awareness block and performance profilers to establish an automated adaptation loop. Our experiments demonstrate that AdaScale significantly enhances accuracy by 5.09%, reduces training overhead by 66.89%, speeds up inference latency by 1.51 to 6.2 times, and lowers energy costs by 4.69 times.

[AI-89] Protect Your Secrets: Understanding and Measuring Data Exposure in VSCode Extensions

链接: https://arxiv.org/abs/2412.00707
作者: Yue Liu,Chakkrit Tantithamthavorn,Li Li
关键词-EN: Integrated Development Environments, Visual Studio Code, modern Integrated Development, Development Environments, Integrated Development
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Recent years have witnessed the emerging trend of extensions in modern Integrated Development Environments (IDEs) like Visual Studio Code (VSCode) that significantly enhance developer productivity. Especially, popular AI coding assistants like GitHub Copilot and Tabnine provide conveniences like automated code completion and debugging. While these extensions offer numerous benefits, they may introduce privacy and security concerns to software developers. However, there is no existing work that systematically analyzes the security and privacy concerns, including the risks of data exposure in VSCode extensions. In this paper, we investigate on the security issues of cross-extension interactions in VSCode and shed light on the vulnerabilities caused by data exposure among different extensions. Our study uncovers high-impact security flaws that could allow adversaries to stealthily acquire or manipulate credential-related data (e.g., passwords, API keys, access tokens) from other extensions if not properly handled by extension vendors. To measure their prevalence, we design a novel automated risk detection framework that leverages program analysis and natural language processing techniques to automatically identify potential risks in VSCode extensions. By applying our tool to 27,261 real-world VSCode extensions, we discover that 8.5% of them (i.e., 2,325 extensions) are exposed to credential-related data leakage through various vectors, such as commands, user input, and configurations. Our study sheds light on the security challenges and flaws of the extension-in-IDE paradigm and provides suggestions and recommendations for improving the security of VSCode extensions and mitigating the risks of data exposure. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2412.00707 [cs.CR] (or arXiv:2412.00707v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.00707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-90] he Advancement of Personalized Learning Potentially Accelerated by Generative AI

链接: https://arxiv.org/abs/2412.00691
作者: Yuang Wei,Yuan-Hao Jiang,Jiayi Liu,Changyong Qi,Rui Jia
关键词-EN: development of Generative, GAI, Personalized learning, aspects of education, learning
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The rapid development of Generative AI (GAI) has sparked revolutionary changes across various aspects of education. Personalized learning, a focal point and challenge in educational research, has also been influenced by the development of GAI. To explore GAI’s extensive impact on personalized learning, this study investigates its potential to enhance various facets of personalized learning through a thorough analysis of existing research. The research comprehensively examines GAI’s influence on personalized learning by analyzing its application across different methodologies and contexts, including learning strategies, paths, materials, environments, and specific analyses within the teaching and learning processes. Through this in-depth investigation, we find that GAI demonstrates exceptional capabilities in providing adaptive learning experiences tailored to individual preferences and needs. Utilizing different forms of GAI across various subjects yields superior learning outcomes. The article concludes by summarizing scenarios where GAI is applicable in educational processes and discussing strategies for leveraging GAI to enhance personalized learning, aiming to guide educators and learners in effectively utilizing GAI to achieve superior learning objectives.

[AI-91] Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2412.00661
作者: Emile Anand,Ishani Karmarkar,Guannan Qu
关键词-EN: Designing efficient algorithms, fundamentally challenging due, multi-agent reinforcement learning, Designing efficient, multi-agent reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 48 pages. 7 figures

点击查看摘要

Abstract:Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging due to the fact that the size of the joint state and action spaces are exponentially large in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm \textttSUBSAMPLE-MFQ (\textbfSubsample-\textbfMean-\textbfField-\textbfQ-learning) and a decentralized randomized policy for a system with n agents. For k\leq n , our algorithm system learns a policy for the system in time polynomial in k . We show that this learned policy converges to the optimal policy in the order of \tildeO(1/\sqrtk) as the number of subsampled agents k increases. We validate our method empirically on Gaussian squeeze and global exploration settings.

[AI-92] Improving Vietnamese Legal Document Retrieval using Synthetic Data

链接: https://arxiv.org/abs/2412.00657
作者: Son Pham Tien,Hieu Nguyen Doan,An Nguyen Dai,Sang Dinh Viet
关键词-EN: accurate question-answering systems, effective embedding-based models, Vietnamese legal, effective embedding-based, question-answering systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of legal information retrieval, effective embedding-based models are essential for accurate question-answering systems. However, the scarcity of large annotated datasets poses a significant challenge, particularly for Vietnamese legal texts. To address this issue, we propose a novel approach that leverages large language models to generate high-quality, diverse synthetic queries for Vietnamese legal passages. This synthetic data is then used to pre-train retrieval models, specifically bi-encoder and ColBERT, which are further fine-tuned using contrastive loss with mined hard negatives. Our experiments demonstrate that these enhancements lead to strong improvement in retrieval accuracy, validating the effectiveness of synthetic data and pre-training techniques in overcoming the limitations posed by the lack of large labeled datasets in the Vietnamese legal domain.

[AI-93] Predictive Inference With Fast Feature Conformal Prediction

链接: https://arxiv.org/abs/2412.00653
作者: Zihao Tang,Boyuan Wang,Chuan Wen,Jiaye Teng
关键词-EN: Feature Conformal Prediction, Conformal prediction, deploys conformal prediction, Feature Conformal, Fast Feature Conformal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction is widely adopted in uncertainty quantification, due to its post-hoc, distribution-free, and model-agnostic properties. In the realm of modern deep learning, researchers have proposed Feature Conformal Prediction (FCP), which deploys conformal prediction in a feature space, yielding reduced band lengths. However, the practical utility of FCP is limited due to the time-consuming non-linear operations required to transform confidence bands from feature space to output space. In this paper, we introduce Fast Feature Conformal Prediction (FFCP), which features a novel non-conformity score and is convenient for practical applications. FFCP serves as a fast version of FCP, in that it equivalently employs a Taylor expansion to approximate the aforementioned non-linear operations in FCP. Empirical validations showcase that FFCP performs comparably with FCP (both outperforming the vanilla version) while achieving a significant reduction in computational time by approximately 50x. The code is available at this https URL

[AI-94] ARChef: An iOS-Based Augmented Reality Cooking Assistant Powered by Multimodal Gemini LLM

链接: https://arxiv.org/abs/2412.00627
作者: Rithik Vir,Parsa Madinei
关键词-EN: Augmented Reality, cookbooks and online, results in missing, Large Language Model, Google Gemini Large
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cooking meals can be difficult, causing many to use cookbooks and online recipes, which results in missing ingredients, nutritional hazards, unsatisfactory meals. Using Augmented Reality (AR) can address this issue, however, current AR cooking applications have poor user interfaces and limited accessibility. This paper proposes a prototype of an iOS application that integrates AR and Computer Vision (CV) into the cooking process. We leverage Google’s Gemini Large Language Model (LLM) to identify ingredients based on the camera’s field of vision, and generate recipe choices with their nutritional information. Additionally, this application uses Apple’s ARKit to create an AR user interface compatible with iOS devices. Users can personalize their meal suggestions by inputting their dietary preferences and rating each meal. The application’s effectiveness is evaluated through user experience surveys. This application contributes to the field of accessible cooking assistance technologies, aiming to reduce food wastage and improve the meal planning experience.

[AI-95] Exposing LLM Vulnerabilities: Adversarial Scam Detection and Performance

链接: https://arxiv.org/abs/2412.00621
作者: Chen-Wei Chang,Shailik Sarkar,Shutonu Mitra,Qi Zhang,Hossein Salemi,Hemant Purohit,Fengxiu Zhang,Michin Hong,Jin-Hee Cho,Chang-Tien Lu
关键词-EN: Large Language Models, trust Large Language, Language Models, Large Language, trust Large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 4 pages, 2024 IEEE International Conference on Big Data workshop BigEACPS 2024

点击查看摘要

Abstract:Can we trust Large Language Models (LLMs) to accurately predict scam? This paper investigates the vulnerabilities of LLMs when facing adversarial scam messages for the task of scam detection. We addressed this issue by creating a comprehensive dataset with fine-grained labels of scam messages, including both original and adversarial scam messages. The dataset extended traditional binary classes for the scam detection task into more nuanced scam types. Our analysis showed how adversarial examples took advantage of vulnerabilities of a LLM, leading to high misclassification rate. We evaluated the performance of LLMs on these adversarial scam messages and proposed strategies to improve their robustness.

[AI-96] Leveraging LLM for Automated Ontology Extraction and Knowledge Graph Generation

链接: https://arxiv.org/abs/2412.00608
作者: Mohammad Sadeq Abolhasani,Rong Pan
关键词-EN: Reliability and Maintainability, complex technical documents, Extracting relevant, Large Language Models, complex technical
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Extracting relevant and structured knowledge from large, complex technical documents within the Reliability and Maintainability (RAM) domain is labor-intensive and prone to errors. Our work addresses this challenge by presenting OntoKGen, a genuine pipeline for ontology extraction and Knowledge Graph (KG) generation. OntoKGen leverages Large Language Models (LLMs) through an interactive user interface guided by our adaptive iterative Chain of Thought (CoT) algorithm to ensure that the ontology extraction process and, thus, KG generation align with user-specific requirements. Although KG generation follows a clear, structured path based on the confirmed ontology, there is no universally correct ontology as it is inherently based on the user’s preferences. OntoKGen recommends an ontology grounded in best practices, minimizing user effort and providing valuable insights that may have been overlooked, all while giving the user complete control over the final ontology. Having generated the KG based on the confirmed ontology, OntoKGen enables seamless integration into schemeless, non-relational databases like Neo4j. This integration allows for flexible storage and retrieval of knowledge from diverse, unstructured sources, facilitating advanced querying, analysis, and decision-making. Moreover, the generated KG serves as a robust foundation for future integration into Retrieval Augmented Generation (RAG) systems, offering enhanced capabilities for developing domain-specific intelligent applications.

[AI-97] Fairness at Every Intersection: Uncovering and Mitigating Intersectional Biases in Multimodal Clinical Predictions

链接: https://arxiv.org/abs/2412.00606
作者: Resmi Ramachandranpillai,Kishore Sampath,Ayaazuddin Mohammad,Malihe Alikhani
关键词-EN: Electronic Healthcare Records, Healthcare Records, Electronic Healthcare, impose significant disparities, decision-making using Electronic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Biases in automated clinical decision-making using Electronic Healthcare Records (EHR) impose significant disparities in patient care and treatment outcomes. Conventional approaches have primarily focused on bias mitigation strategies stemming from single attributes, overlooking intersectional subgroups – groups formed across various demographic intersections (such as race, gender, ethnicity, etc.). Rendering single-attribute mitigation strategies to intersectional subgroups becomes statistically irrelevant due to the varying distribution and bias patterns across these subgroups. The multimodal nature of EHR – data from various sources such as combinations of text, time series, tabular, events, and images – adds another layer of complexity as the influence on minority groups may fluctuate across modalities. In this paper, we take the initial steps to uncover potential intersectional biases in predictions by sourcing extensive multimodal datasets, MIMIC-Eye1 and MIMIC-IV ED, and propose mitigation at the intersectional subgroup level. We perform and benchmark downstream tasks and bias evaluation on the datasets by learning a unified text representation from multimodal sources, harnessing the enormous capabilities of the pre-trained clinical Language Models (LM), MedBERT, Clinical BERT, and Clinical BioBERT. Our findings indicate that the proposed sub-group-specific bias mitigation is robust across different datasets, subgroups, and embeddings, demonstrating effectiveness in addressing intersectional biases in multimodal settings.

[AI-98] Audio Atlas: Visualizing and Exploring Audio Datasets

链接: https://arxiv.org/abs/2412.00591
作者: Luca A. Lanzendörfer,Florian Grötschla,Uzeyir Valizada,Roger Wattenhofer
关键词-EN: interactive web application, introduce Audio Atlas, Audio Atlas, visualizing audio data, Audio
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Extended Abstract at ISMIR 2024

点击查看摘要

Abstract:We introduce Audio Atlas, an interactive web application for visualizing audio data using text-audio embeddings. Audio Atlas is designed to facilitate the exploration and analysis of audio datasets using a contrastive embedding model and a vector database for efficient data management and semantic search. The system maps audio embeddings into a two-dimensional space and leverages DeepScatter for dynamic visualization. Designed for extensibility, Audio Atlas allows easy integration of new datasets, enabling users to better understand their audio data and identify both patterns and outliers. We open-source the codebase of Audio Atlas, and provide an initial implementation containing various audio and music datasets.

[AI-99] uring Representational Similarity Analysis (RSA): A Flexible Method for Measuring Alignment Between Human and Artificial Intelligence

链接: https://arxiv.org/abs/2412.00577
作者: Mattson Ogg,Ritwik Bose,Jamie Scharf,Christopher Ratto,Michael Wolmetz
关键词-EN: entrusting Large Language, Large Language Models, Large Language, Vision Language Model, decision-making roles
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As we consider entrusting Large Language Models (LLMs) with key societal and decision-making roles, measuring their alignment with human cognition becomes critical. This requires methods that can assess how these systems represent information and facilitate comparisons to human understanding across diverse tasks. To meet this need, we developed Turing Representational Similarity Analysis (RSA), a method that uses pairwise similarity ratings to quantify alignment between AIs and humans. We tested this approach on semantic alignment across text and image modalities, measuring how different Large Language and Vision Language Model (LLM and VLM) similarity judgments aligned with human responses at both group and individual levels. GPT-4o showed the strongest alignment with human performance among the models we tested, particularly when leveraging its text processing capabilities rather than image processing, regardless of the input modality. However, no model we studied adequately captured the inter-individual variability observed among human participants. This method helped uncover certain hyperparameters and prompts that could steer model behavior to have more or less human-like qualities at an inter-individual or group level. Turing RSA enables the efficient and flexible quantification of human-AI alignment and complements existing accuracy-based benchmark tasks. We demonstrate its utility across multiple modalities (words, sentences, images) for understanding how LLMs encode knowledge and for examining representational alignment with human cognition.

[AI-100] Opus: A Large Work Model for Complex Workflow Generation

链接: https://arxiv.org/abs/2412.00573
作者: Théo Fagnoni,Bellinda Mesbah,Mahsun Altin,Phillip Kingston
关键词-EN: Business Process Outsourcing, complex Business Process, Work Knowledge Graph, established industry processes, optimizing Workflows tailored
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 9 figures

点击查看摘要

Abstract:This paper introduces Opus, a novel framework for generating and optimizing Workflows tailored to complex Business Process Outsourcing (BPO) use cases, focusing on cost reduction and quality enhancement while adhering to established industry processes and operational constraints. Our approach generates executable Workflows from Intention, defined as the alignment of Client Input, Client Output, and Process Context. These Workflows are represented as Directed Acyclic Graphs (DAGs), with nodes as Tasks consisting of sequences of executable Instructions, including tools and human expert reviews. We adopt a two-phase methodology: Workflow Generation and Workflow Optimization. In the Generation phase, Workflows are generated using a Large Work Model (LWM) informed by a Work Knowledge Graph (WKG) that encodes domain-specific procedural and operational knowledge. In the Optimization phase, Workflows are transformed into Workflow Graphs (WFGs), where optimal Workflows are determined through path optimization. Our experiments demonstrate that state-of-the-art Large Language Models (LLMs) face challenges in reliably retrieving detailed process data as well as generating industry-compliant workflows. The key contributions of this paper include: - The integration of a Work Knowledge Graph (WKG) into a Large Work Model (LWM), enabling the generation of context-aware, semantically aligned, structured and auditable Workflows. - A two-phase approach that combines Workflow Generation from Intention with graph-based Workflow Optimization. - Opus Alpha 1 Large and Opus Alpha 1 Small, models that outperform state-of-the-art LLMs by 38% and 29% respectively in Workflow Generation for a Medical Coding use case.

[AI-101] Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection

链接: https://arxiv.org/abs/2412.00560
作者: Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang
关键词-EN: anomaly detection, Aberrance Retention Quotient, Overfitting, long been stigmatized, stigmatized as detrimental
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Overfitting has long been stigmatized as detrimental to model performance, especially in the context of anomaly detection. Our work challenges this conventional view by introducing a paradigm shift, recasting overfitting as a controllable and strategic mechanism for enhancing model discrimination capabilities. In this paper, we present Controllable Overfitting-based Anomaly Detection (COAD), a novel framework designed to leverage overfitting for optimized anomaly detection. We propose the Aberrance Retention Quotient (ARQ), a novel metric that systematically quantifies the extent of overfitting, enabling the identification of an optimal “golden overfitting interval.” Within this interval, overfitting is leveraged to significantly amplify the model’s sensitivity to anomalous patterns, while preserving generalization to normal samples. Additionally, we present the Relative Anomaly Distribution Index (RADI), an innovative metric designed to complement AUROC pixel by providing a more versatile and theoretically robust framework for assessing model performance. RADI leverages ARQ to track and evaluate how overfitting impacts anomaly detection, offering an integrated approach to understanding the relationship between overfitting dynamics and model efficacy. Our theoretical work also rigorously validates the use of Gaussian noise in pseudo anomaly synthesis, providing the foundation for its broader applicability across diverse domains. Empirical evaluations demonstrate that our controllable overfitting method not only achieves State of the Art (SOTA) performance in both one-class and multi-class anomaly detection tasks but also redefines overfitting from a modeling challenge into a powerful tool for optimizing anomaly detection.

[AI-102] FullStack Bench: Evaluating LLM s as Full Stack Coder

链接: https://arxiv.org/abs/2412.00535
作者: Siyao Liu,He Zhu,Jerry Liu,Shulin Xin,Aoyan Li,Rui Long,Li Chen,Jack Yang,Jinxiang Xia,Z.Y. Peng,Shukai Liu,Zhaoxiang Zhang,Ge Zhang,Wenhao Huang,Kai Shen,Liang Xiang
关键词-EN: FullStack Bench, large language models, continue to expand, diverse code intelligence, rapidly increasing
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.

[AI-103] owards Fault Tolerance in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2412.00534
作者: Yuchen Shi,Huaxin Pei,Liang Feng,Yi Zhang,Danya Yao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 14 pages, 13 figures

点击查看摘要

[AI-104] LAMBDA: Covering the Multimodal Critical Scenarios for Automated Driving Systems by Search Space Quantization

链接: https://arxiv.org/abs/2412.00517
作者: Xinzheng Wu,Junyi Chen,Xingyu Xing,Jian Sun,Ye Tian,Lihao Liu,Yong Shen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO)
*备注: 17pages, 21figures

点击查看摘要

[AI-105] Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

链接: https://arxiv.org/abs/2412.00508
作者: Lukas Schulze Balhorn,Kevin Degens,Artur M. Schweidtmann
关键词-EN: PID development, PID development time, reduce PID development, accelerate PID development, important but tedious
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Control structure design is an important but tedious step in PID development. Generative artificial intelligence (AI) promises to reduce PID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate PID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

[AI-106] Homeostazis and Sparsity in Transformer

链接: https://arxiv.org/abs/2412.00503
作者: Leonid Kotyuzanskiy,Artem Klimov
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-107] Improved Cleanup and Decoding of Fractional Power Encodings

链接: https://arxiv.org/abs/2412.00488
作者: Alicia Bremer,Jeff Orchard
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

[AI-108] On the Conditions for Domain Stability for Machine Learning: a Mathematical Approach

链接: https://arxiv.org/abs/2412.00464
作者: Gabriel Pedroza
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 8 pages including references, no figures

点击查看摘要

[AI-109] Benchmark Real-time Adaptation and Communication Capabilities of Embodied Agent in Collaborative Scenarios

链接: https://arxiv.org/abs/2412.00435
作者: Shipeng Liu,Boshen Zhang,Zhehui Huang
关键词-EN:
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 16 pages, 8 figures

点击查看摘要

[AI-110] Predictive Models in Sequential Recommendations: Bridging Performance Laws with Data Quality Insights

链接: https://arxiv.org/abs/2412.00430
作者: Tingjia Shen,Hao Wang,Chuhan Wu,Jin Yao Chin,Wei Guo,Yong Liu,Huifeng Guo,Defu Lian,Ruiming Tang,Enhong Chen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 12 pages, 5 figures

点击查看摘要

[AI-111] Mixture of Experts for Node Classification

链接: https://arxiv.org/abs/2412.00418
作者: Yu Shi,Yiqi Wang,WeiXuan Lang,Jiaxin Zhang,Pan Dong,Aiping Li
关键词-EN:
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-112] Federated Progressive Self-Distillation with Logits Calibration for Personalized IIoT Edge Intelligence

链接: https://arxiv.org/abs/2412.00410
作者: Yingchao Wang,Wenqi Niu
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages,5 figures

点击查看摘要

[AI-113] Fine-Tuning Pre-trained Large Time Series Models for Prediction of Wind Turbine SCADA Data

链接: https://arxiv.org/abs/2412.00403
作者: Yuwei Fan,Tao Song,Chenlong Feng,Keyu Song,Chao Liu,Dongxiang Jiang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

[AI-114] DroidCall: A Dataset for LLM -powered Android Intent Invocation

链接: https://arxiv.org/abs/2412.00402
作者: Weikai Xie,Li Zhang,Shihe Wang,Rongjie Yi,Mengwei Xu
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-115] Strategic Application of AIGC for UAV Trajectory Design: A Channel Knowledge Map Approach

链接: https://arxiv.org/abs/2412.00386
作者: Chiya Zhang,Ting Wang,Rubing Han,Yuanxiang Gong
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-116] Unified Parameter-Efficient Unlearning for LLM s

链接: https://arxiv.org/abs/2412.00383
作者: Chenlu Ding,Jiancan Wu,Yancheng Yuan,Jinda Lu,Kai Zhang,Alex Su,Xiang Wang,Xiangnan He
关键词-EN:
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-117] Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment

链接: https://arxiv.org/abs/2412.00373
作者: Dongfang Zhao
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

[AI-118] 2-Factor Retrieval for Improved Human-AI Decision Making in Radiology

链接: https://arxiv.org/abs/2412.00372
作者: Jim Solomon,Laleh Jalilian,Alexander Vilesov,Meryl Mathew,Tristan Grogan,Arash Bedayat,Achuta Kadambi
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-119] On the Role of Noise in Factorizers for Disentangling Distributed Representations NEURIPS2024

链接: https://arxiv.org/abs/2412.00354
作者: Geethan Karunaratne,Michael Hersche,Abu Sebastian,Abbas Rahimi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at Second Workshop on Machine Learning with New Compute Paradigms at 38th NeurIPS 2024 (MLNCP 2024)

点击查看摘要

[AI-120] CaDA: Cross-Problem Routing Solver with Constraint-Aware Dual-Attention

链接: https://arxiv.org/abs/2412.00346
作者: Han Li,Fei Liu,Zhi Zheng,Yu Zhang,Zhenkun Wang
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-121] Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

链接: https://arxiv.org/abs/2412.00342
作者: Nadeen Fathallah,Monika Bhole,Steffen Staab
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-122] MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI

链接: https://arxiv.org/abs/2412.00325
作者: Jongmin Jung,Andreas Jansson,Dasaem Jeong
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Late-breaking/demo (LBD) at ISMIR 2024. this https URL

点击查看摘要

[AI-123] Improving speaker verification robustness with synthetic emotional utterances

链接: https://arxiv.org/abs/2412.00319
作者: Nikhil Kumar Koditala,Chelsea Jui-Ting Ju,Ruirui Li,Minho Jin,Aman Chadha,Andreas Stolcke
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-124] HiMoE: Heterogeneity-Informed Mixture-of-Experts for Fair Spatial-Temporal Forecasting

链接: https://arxiv.org/abs/2412.00316
作者: Shaohan Yu,Pan Deng,Yu Zhao,Junting Liu,Zi’ang Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

[AI-125] One Model for One Graph: A New Perspective for Pretraining with Cross-domain Graphs

链接: https://arxiv.org/abs/2412.00315
作者: Jingzhe Liu,Haitao Mao,Zhikai Chen,Wenqi Fan,Mingxuan Ju,Tong Zhao,Neil Shah,Jiliang Tang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

[AI-126] Raw Audio Classification with Cosine Convolutional Neural Network (CosCovNN)

链接: https://arxiv.org/abs/2412.00312
作者: Kazi Nazmul Haque,Rajib Rana,Tasnim Jarin,Bjorn W. Schuller Jr
关键词-EN:
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

[AI-127] BOTS: Batch Bayesian Optimization of Extended Thompson Sampling for Severely Episode-Limited RL Settings NEURIPS2024

链接: https://arxiv.org/abs/2412.00308
作者: Karine Karine,Susan A. Murphy,Benjamin M. Marlin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty

点击查看摘要

[AI-128] PlanCritic: Formal Planning with Human Feedback

链接: https://arxiv.org/abs/2412.00300
作者: Owen Burns,Dana Hughes,Katia Sycara
关键词-EN:
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 5 pages, 3 figures

点击查看摘要

[AI-129] Adaptformer: Sequence models as adaptive iterative planners

链接: https://arxiv.org/abs/2412.00293
作者: Akash Karthikeyan,Yash Vardhan Pant
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

[AI-130] Streamlining the review process: AI-generated annotations in research manuscripts

链接: https://arxiv.org/abs/2412.00281
作者: Oscar Díaz,Xabier Garmendia,Juanan Pereira
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-131] Average-Over-Time Spiking Neural Networks for Uncertainty Estimation in Regression

链接: https://arxiv.org/abs/2412.00278
作者: Tao Sun,Sander Bohté
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[AI-132] Attribute-Enhanced Similarity Ranking for Sparse Link Prediction KDD

链接: https://arxiv.org/abs/2412.00261
作者: João Mattos,Zexi Huang,Mert Kosan,Ambuj Singh,Arlei Silva
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: To appear at the 31st SIGKDD Conference on Knowledge Discovery and Data Mining - Research Track (August 2024 Deadline)

点击查看摘要

[AI-133] Fine Tuning Large Language Models to Deliver CBT for Depression

链接: https://arxiv.org/abs/2412.00251
作者: Talha Tahir
关键词-EN:
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

[AI-134] Integrating Social Determinants of Health into Knowledge Graphs: Evaluating Prediction Bias and Fairness in Healthcare

链接: https://arxiv.org/abs/2412.00245
作者: Tianqi Shang,Weiqing He,Tianlong Chen,Ying Ding,Huanmei Wu,Kaixiong Zhou,Li Shen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-135] Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model

链接: https://arxiv.org/abs/2412.00243
作者: Qiujing Lu,Meng Ma,Ximiao Dai,Xuanhan Wang,Shuo Feng
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-136] Generating a Low-code Complete Workflow via Task Decomposition and RAG

链接: https://arxiv.org/abs/2412.00239
作者: Orlando Marquez Ayala,Patrice Béchard
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under review; 12 pages, 8 figures

点击查看摘要

[AI-137] An AI-Driven Data Mesh Architecture Enhancing Decision-Making in Infrastructure Construction and Public Procurement

链接: https://arxiv.org/abs/2412.00224
作者: Saurabh Mishra,Mahendra Shinde,Aniket Yadav,Bilal Ayyub,Anand Rao
关键词-EN:
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

[AI-138] Digital Twin in Industries: A Comprehensive Survey

链接: https://arxiv.org/abs/2412.00209
作者: Md Bokhtiar Al Zami,Shaba Shaon,Vu Khanh Quy,Dinh C. Nguyen
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-139] owards the Ultimate Programming Language: Trust and Benevolence in the Age of Artificial Intelligence

链接: https://arxiv.org/abs/2412.00206
作者: Bartosz Sawicki,Michał Śmiałek,Bartłomiej Skowron
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: submitted to proceedings of “Ethics and AI” conference

点击查看摘要

[AI-140] Origin-Destination Demand Prediction: An Urban Radiation and Attraction Perspective

链接: https://arxiv.org/abs/2412.00167
作者: Xuan Ma,Zepeng Bao,Ming Zhong,Yuanyuan Zhu,Chenliang Li,Jiawei Jiang,Qing Li,Tieyun Qian
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-141] o1-Coder: an o1 Replication for Coding

链接: https://arxiv.org/abs/2412.00154
作者: Yuxiang Zhang,Shangxi Wu,Yuqi Yang,Jiangming Shu,Jinlin Xiao,Chao Kong,Jitao Sang
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-142] Dynamic Neural Curiosity Enhances Learning Flexibility for Autonomous Goal Discovery

链接: https://arxiv.org/abs/2412.00152
作者: Quentin Houbre,Roel Pieters
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-143] Knowledge-Augmented Explainable and Interpretable Learning for Anomaly Detection and Diagnosis

链接: https://arxiv.org/abs/2412.00146
作者: Martin Atzmueller,Tim Bohne,Patricia Windler
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 8 figures

点击查看摘要

[AI-144] Road User Classification from High-Frequency GNSS Data Using Distributed Edge Intelligence

链接: https://arxiv.org/abs/2412.00132
作者: Lennart Köpper,Thomas Wieland
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

[AI-145] Proceedings of the 2024 XCSP3 Competition

链接: https://arxiv.org/abs/2412.00117
作者: Gilles Audemard,Christophe Lecoutre,Emmanuel Lonca
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 104 pages

点击查看摘要

[AI-146] Boundary-Decoder network for inverse prediction of capacitor electrostatic analysis

链接: https://arxiv.org/abs/2412.00113
作者: Kart-Leong Lim,Rahul Dutta,Mihai Rotaru
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

[AI-147] Virtual Sensing to Enable Real-Time Monitoring of Inaccessible Locations Unmeasurable Parameters

链接: https://arxiv.org/abs/2412.00107
作者: Kazuma Kobayashi,Farid Ahmed,Syed Bahauddin Alam
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 17 pages, 7 figures

点击查看摘要

[AI-148] Differential learning kinetics govern the transition from memorization to generalization during in-context learning

链接: https://arxiv.org/abs/2412.00104
作者: Alex Nguyen,Gautam Reddy
关键词-EN:
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

[AI-149] MLLM -Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

链接: https://arxiv.org/abs/2412.00103
作者: Angus Fung,Aaron Hao Tan,Haitong Wang,Beno Benhabib,Goldie Nejat
关键词-EN:
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-150] Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

链接: https://arxiv.org/abs/2412.00099
作者: Andrii Skliar,Ties van Rozendaal,Romain Lepert,Todor Boinovski,Mart van Baalen,Markus Nagel,Paul Whatmough,Babak Ehteshami Bejnordi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

[AI-151] Dual Prototyping with Domain and Class Prototypes for Affective Brain-Computer Interface in Unseen Target Conditions

链接: https://arxiv.org/abs/2412.00082
作者: Guangli Li,Zhehao Zhou,Tuo Sun,Ping Tan,Li Zhang,Zhen Liang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注:

点击查看摘要

[AI-152] Recurrent Stochastic Configuration Networks with Hybrid Regularization for Nonlinear Dynamics Modelling

链接: https://arxiv.org/abs/2412.00070
作者: Gang Dang,Dianhui Wang
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注:

点击查看摘要

[AI-153] Adaptive Coordinate-Wise Step Sizes for Quasi-Newton Methods: A Learning-to-Optimize Approach

链接: https://arxiv.org/abs/2412.00059
作者: Wei Lin,Qingyu Song,Hong Xu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

[AI-154] Creating Hierarchical Dispositions of Needs in an Agent

链接: https://arxiv.org/abs/2412.00044
作者: Tofara Moyo
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

[AI-155] Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies NEURIPS2024

链接: https://arxiv.org/abs/2412.00033
作者: Frédéric Berdoz,Roger Wattenhofer
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted at NeurIPS 2024

点击查看摘要

[AI-156] Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers

链接: https://arxiv.org/abs/2412.00029
作者: Neel Redkar
关键词-EN:
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, preprint

点击查看摘要

[AI-157] Partitioning Message Passing for Graph Fraud Detection

链接: https://arxiv.org/abs/2412.00020
作者: Wei Zhuo,Zemin Liu,Bryan Hooi,Bingsheng He,Guang Tan,Rizal Fathony,Jia Chen
关键词-EN:
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[AI-158] he use of knowledge in open-ended systems

链接: https://arxiv.org/abs/2412.00011
作者: Abigail Devereaux,Roger Koppl
关键词-EN:
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Theoretical Economics (econ.TH)
*备注: 44 pages, 0 figures

点击查看摘要

[AI-159] Randomized-Grid Search for Hyperparameter Tuning in Decision Tree Model to Improve Performance of Cardiovascular Disease Classification

链接: https://arxiv.org/abs/2411.18234
作者: Abhay Kumar Pathak,Mrityunjay Chaubey,Manjari Gupta
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Computation (stat.CO)
*备注:

点击查看摘要

[AI-160] Understanding complex crowd dynamics with generative neural simulators

链接: https://arxiv.org/abs/2412.01491
作者: Koen Minartz,Fleur Hendriks,Simon Martinus Koop,Alessandro Corbetta,Vlado Menkovski
关键词-EN:
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 26 pages, 6 figures

点击查看摘要

[AI-161] Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification

链接: https://arxiv.org/abs/2412.01195
作者: Bei Liu,Yanmin Qian
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

点击查看摘要

[AI-162] Representation Learning for Time-Domain High-Energy Astrophysics: Discovery of Extragalactic Fast X-ray Transient XRT 200515

链接: https://arxiv.org/abs/2412.01150
作者: Steven Dillmann,Rafael Martínez-Galarza,Roberto Soria,Rosanne Di Stefano,Vinay L. Kashyap
关键词-EN:
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, submitted to Monthly Notices of the Royal Astronomical Society, presented at the 2023 Conference on Machine Learning in Astronomical Surveys (ML-IAP/CCA-2023)

点击查看摘要

[AI-163] Explicit and data-Efficient Encoding via Gradient Flow NEURIPS2024

链接: https://arxiv.org/abs/2412.00864
作者: Kyriakos Flouris,Anna Volokitin,Gustav Bredell,Ender Konukoglu
关键词-EN:
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注: Machine Learning and the Physical Sciences Workshop, NeurIPS 2024

点击查看摘要

[AI-164] Well log data generation and imputation using sequence-based generative adversarial networks

链接: https://arxiv.org/abs/2412.00718
作者: Abdulrahman Al-Fakih,A. Koeshidayatullah,Tapan Mukerji,Sadam Al-Azani,SanLinn I. Kaka
关键词-EN:
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-165] Beyond Monte Carlo: Harnessing Diffusion Models to Simulate Financial Market Dynamics

链接: https://arxiv.org/abs/2412.00036
作者: Andrew Lesniewski,Giulio Trigila
关键词-EN:
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Portfolio Management (q-fin.PM)
*备注: 27 pages

点击查看摘要

[AI-166] Spatial-variant causal Bayesian inference for rapid seismic ground failures and impacts estimation

链接: https://arxiv.org/abs/2412.00026
作者: Xuechun Li,Susu Xu
关键词-EN:
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注: This paper was accepted for 2024 WCEE conference

点击查看摘要

机器学习

[LG-0] Hard Constraint Guided Flow Matching for Gradient-Free Generation of PDE Solutions

链接: https://arxiv.org/abs/2412.01786
作者: Chaoran Cheng,Boran Han,Danielle C. Maddix,Abdul Fatir Ansari,Andrew Stuart,Michael W. Mahoney,Yuyang Wang
关键词-EN: Generative models, scientific and engineering, engineering applications, constrained generative models, satisfy hard constraints
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models that satisfy hard constraints are crucial in many scientific and engineering applications where physical laws or system requirements must be strictly respected. However, many existing constrained generative models, especially those developed for computer vision, rely heavily on gradient information, often sparse or computationally expensive in fields like partial differential equations (PDEs). In this work, we introduce a novel framework for adapting pre-trained, unconstrained flow-matching models to satisfy constraints exactly in a zero-shot manner without requiring expensive gradient computations or fine-tuning. Our framework, ECI sampling, alternates between extrapolation (E), correction ©, and interpolation (I) stages during each iterative sampling step of flow matching sampling to ensure accurate integration of constraint information while preserving the validity of the generation. We demonstrate the effectiveness of our approach across various PDE systems, showing that ECI-guided generation strictly adheres to physical constraints and accurately captures complex distribution shifts induced by these constraints. Empirical results demonstrate that our framework consistently outperforms baseline approaches in various zero-shot constrained generation tasks and also achieves competitive results in the regression tasks without additional fine-tuning.

[LG-1] ransfer Learning for Control Systems via Neural Simulation Relations

链接: https://arxiv.org/abs/2412.01783
作者: Alireza Nadali,Bingzhuo Zhong,Ashutosh Trivedi,Majid Zamani
关键词-EN: leverage knowledge gained, target domain, improve speed, leverage knowledge, knowledge gained
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning is an umbrella term for machine learning approaches that leverage knowledge gained from solving one problem (the source domain) to improve speed, efficiency, and data requirements in solving a different but related problem (the target domain). The performance of the transferred model in the target domain is typically measured via some notion of loss function in the target domain. This paper focuses on effectively transferring control logic from a source control system to a target control system while providing approximately similar behavioral guarantees in both domains. However, in the absence of a complete characterization of behavioral specifications, this problem cannot be captured in terms of loss functions. To overcome this challenge, we use (approximate) simulation relations to characterize observational equivalence between the behaviors of two systems. Simulation relations ensure that the outputs of both systems, equipped with their corresponding controllers, remain close to each other over time, and their closeness can be quantified \it a priori. By parameterizing simulation relations with neural networks, we introduce the notion of \emphneural simulation relations, which provides a data-driven approach to transfer any synthesized controller, regardless of the specification of interest, along with its proof of correctness. Compared with prior approaches, our method eliminates the need for a closed-loop mathematical model and specific requirements for both the source and target systems. We also introduce validity conditions that, when satisfied, guarantee the closeness of the outputs of two systems equipped with their corresponding controllers, thus eliminating the need for post-facto verification. We demonstrate the effectiveness of our approach through case studies involving a vehicle and a double inverted pendulum. Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2412.01783 [eess.SY] (or arXiv:2412.01783v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2412.01783 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] FERERO: A Flexible Framework for Preference-Guided Multi-Objective Learning

链接: https://arxiv.org/abs/2412.01773
作者: Lisha Chen,AFM Saif,Yanning Shen,Tianyi Chen
关键词-EN: specific preference-guided Pareto, preference-guided Pareto solutions, preference-guided Pareto, represent different trade-offs, critical yet challenging
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finding specific preference-guided Pareto solutions that represent different trade-offs among multiple objectives is critical yet challenging in multi-objective problems. Existing methods are restrictive in preference definitions and/or their theoretical guarantees. In this work, we introduce a Flexible framEwork for pREfeRence-guided multi-Objective learning (FERERO) by casting it as a constrained vector optimization problem. Specifically, two types of preferences are incorporated into this formulation – the relative preference defined by the partial ordering induced by a polyhedral cone, and the absolute preference defined by constraints that are linear functions of the objectives. To solve this problem, convergent algorithms are developed with both single-loop and stochastic variants. Notably, this is the first single-loop primal algorithm for constrained vector optimization to our knowledge. The proposed algorithms adaptively adjust to both constraint and objective values, eliminating the need to solve different subproblems at different stages of constraint satisfaction. Experiments on multiple benchmarks demonstrate the proposed method is very competitive in finding preference-guided optimal solutions. Code is available at this https URL.

[LG-3] Bluetooth Low Energy Dataset Using In-Phase and Quadrature Samples for Indoor Localization

链接: https://arxiv.org/abs/2412.01767
作者: Samuel G. Leitch,Qasim Zeeshan Ahmed,Ben Van Herbruggen,Mathias Baert,Jaron Fontaine,Eli De Poorter,Adnan Shahid,Pavlos I. Lazaridis
关键词-EN: output variables, significant challenge, challenge in research, collect a large, large amount
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:One significant challenge in research is to collect a large amount of data and learn the underlying relationship between the input and the output variables. This paper outlines the process of collecting and validating a dataset designed to determine the angle of arrival (AoA) using Bluetooth low energy (BLE) technology. The data, collected in a laboratory setting, is intended to approximate real-world industrial scenarios. This paper discusses the data collection process, the structure of the dataset, and the methodology adopted for automating sample labeling for supervised learning. The collected samples and the process of generating ground truth (GT) labels were validated using the Texas Instruments (TI) phase difference of arrival (PDoA) implementation on the data, yielding a mean absolute error (MAE) at one of the heights without obstacles of 25.71^\circ . The distance estimation on BLE was implemented using a Gaussian Process Regression algorithm, yielding an MAE of 0.174 m.

[LG-4] Structure-Guided Input Graph for GNNs facing Heterophily

链接: https://arxiv.org/abs/2412.01757
作者: Victor M. Tenorio,Madeline Navarro,Samuel Rey,Santiago Segarra,Antonio G. Marques
关键词-EN: Graph Neural Networks, Neural Networks, handle data exhibiting, GNN architectures, Graph Neural
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Presented as a conference paper in the Asilomar Conference on Signals, Systems, and Computers 2024

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a promising tool to handle data exhibiting an irregular structure. However, most GNN architectures perform well on homophilic datasets, where the labels of neighboring nodes are likely to be the same. In recent years, an increasing body of work has been devoted to the development of GNN architectures for heterophilic datasets, where labels do not exhibit this low-pass behavior. In this work, we create a new graph in which nodes are connected if they share structural characteristics, meaning a higher chance of sharing their labels, and then use this new graph in the GNN architecture. To do this, we compute the k-nearest neighbors graph according to distances between structural features, which are either (i) role-based, such as degree, or (ii) global, such as centrality measures. Experiments show that the labels are smoother in this newly defined graph and that the performance of GNN architectures improves when using this alternative structure.

[LG-5] Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios NEURIPS

链接: https://arxiv.org/abs/2412.01756
作者: Sangyeon Yoon,Wonje Jeung,Albert No
关键词-EN: Stochastic Gradient Descent, Private Stochastic Gradient, Gradient Descent, Stochastic Gradient, Differentially Private Stochastic
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures, NeurIPS (SFLLM Workshop)

点击查看摘要

Abstract:Auditing Differentially Private Stochastic Gradient Descent (DP-SGD) in the final model setting is challenging and often results in empirical lower bounds that are significantly looser than theoretical privacy guarantees. We introduce a novel auditing method that achieves tighter empirical lower bounds without additional assumptions by crafting worst-case adversarial samples through loss-based input-space auditing. Our approach surpasses traditional canary-based heuristics and is effective in both white-box and black-box scenarios. Specifically, with a theoretical privacy budget of \varepsilon = 10.0 , our method achieves empirical lower bounds of 6.68 in white-box settings and 4.51 in black-box settings, compared to the baseline of 4.11 for MNIST. Moreover, we demonstrate that significant privacy auditing results can be achieved using in-distribution (ID) samples as canaries, obtaining an empirical lower bound of 4.33 where traditional methods produce near-zero leakage detection. Our work offers a practical framework for reliable and accurate privacy auditing in differentially private machine learning.

[LG-6] CBOL-Tuner: Classifier-pruned Bayesian optimization to explore temporally structured latent spaces for particle accelerator tuning

链接: https://arxiv.org/abs/2412.01748
作者: Mahindra Rautela,Alan Williams,Alexander Scheinker
关键词-EN: Complex dynamical systems, time-consuming tuning procedures, Complex dynamical, Classifier-pruned Bayesian Optimization-based, Bayesian Optimization-based Latent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex dynamical systems, such as particle accelerators, often require complicated and time-consuming tuning procedures for optimal performance. It may also be required that these procedures estimate the optimal system parameters, which govern the dynamics of a spatiotemporal beam – this can be a high-dimensional optimization problem. To address this, we propose a Classifier-pruned Bayesian Optimization-based Latent space Tuner (CBOL-Tuner), a framework for efficient exploration within a temporally-structured latent space. The CBOL-Tuner integrates a convolutional variational autoencoder (CVAE) for latent space representation, a long short-term memory (LSTM) network for temporal dynamics, a dense neural network (DNN) for parameter estimation, and a classifier-pruned Bayesian optimizer (C-BO) to adaptively search and filter the latent space for optimal solutions. CBOL-Tuner demonstrates superior performance in identifying multiple optimal settings and outperforms alternative global optimization methods.

[LG-7] FSMLP: Modelling Channel Dependencies With Simplex Theory Based Multi-Layer Perceptions In Frequency Domain

链接: https://arxiv.org/abs/2412.01654
作者: Zhengnan Li,Haoxuan Li,Hao Wang,Jun Fang,Duoyin Li Yunxiao Qin
关键词-EN: energy consumption prediction, textbf, web data analysis, including web data, Time series forecasting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) plays a crucial role in various domains, including web data analysis, energy consumption prediction, and weather forecasting. While Multi-Layer Perceptrons (MLPs) are lightweight and effective for capturing temporal dependencies, they are prone to overfitting when used to model inter-channel dependencies. In this paper, we investigate the overfitting problem in channel-wise MLPs using Rademacher complexity theory, revealing that extreme values in time series data exacerbate this issue. To mitigate this issue, we introduce a novel Simplex-MLP layer, where the weights are constrained within a standard simplex. This strategy encourages the model to learn simpler patterns and thereby reducing overfitting to extreme values. Based on the Simplex-MLP layer, we propose a novel \textbfFrequency \textbfSimplex \textbfMLP (FSMLP) framework for time series forecasting, comprising of two kinds of modules: \textbfSimplex \textbfChannel-\textbfWise MLP (SCWM) and \textbfFrequency \textbfTemporal \textbfMLP (FTM). The SCWM effectively leverages the Simplex-MLP to capture inter-channel dependencies, while the FTM is a simple yet efficient temporal MLP designed to extract temporal information from the data.

[LG-8] Review of Mathematical Optimization in Federated Learning

链接: https://arxiv.org/abs/2412.01630
作者: Shusen Yang,Fangyuan Zhao,Zihao Zhou,Liang Shi,Xuebin Ren,Zongben Xu
关键词-EN: Federated Learning, popular interdisciplinary research, interdisciplinary research area, information sciences, popular interdisciplinary
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: To appear in CSIAM Transactions on Applied Mathematics (CSIAM-AM)

点击查看摘要

Abstract:Federated Learning (FL) has been becoming a popular interdisciplinary research area in both applied mathematics and information sciences. Mathematically, FL aims to collaboratively optimize aggregate objective functions over distributed datasets while satisfying a variety of privacy and system this http URL from conventional distributed optimization methods, FL needs to address several specific issues (e.g., non-i.i.d. data distributions and differential private noises), which pose a set of new challenges in the problem formulation, algorithm design, and convergence analysis. In this paper, we will systematically review existing FL optimization research including their assumptions, formulations, methods, and theoretical results. Potential future directions are also discussed.

[LG-9] Representation and Regression Problems in Neural Networks: Relaxation Generalization and Numerics

链接: https://arxiv.org/abs/2412.01619
作者: Kang Liu,Enrique Zuazua
关键词-EN: shallow neural networks, neural networks, approximate representation, regression tasks, address three non-convex
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 33 pages, 5 figures

点击查看摘要

Abstract:In this work, we address three non-convex optimization problems associated with the training of shallow neural networks (NNs) for exact and approximate representation, as well as for regression tasks. Through a mean-field approach, we convexify these problems and, applying a representer theorem, prove the absence of relaxation gaps. We establish generalization bounds for the resulting NN solutions, assessing their predictive performance on test datasets and, analyzing the impact of key hyperparameters on these bounds, propose optimal choices. On the computational side, we examine the discretization of the convexified problems and derive convergence rates. For low-dimensional datasets, these discretized problems are efficiently solvable using the simplex method. For high-dimensional datasets, we propose a sparsification algorithm that, combined with gradient descent for over-parameterized shallow NNs, yields effective solutions to the primal problems. Comments: 33 pages, 5 figures Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) MSC classes: 68T07, 68T09, 90C06, 90C26 Cite as: arXiv:2412.01619 [cs.LG] (or arXiv:2412.01619v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.01619 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] FairML: A Julia Package for Fair Classification

链接: https://arxiv.org/abs/2412.01585
作者: Jan Pablo Burgard,João Vitor Pamplona
关键词-EN: Julia package providing, Julia package, machine learning, http URL, http URL package
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:In this paper, we propose this http URL, a Julia package providing a framework for fair classification in machine learning. In this framework, the fair learning process is divided into three stages. Each stage aims to reduce unfairness, such as disparate impact and disparate mistreatment, in the final prediction. For the preprocessing stage, we present a resampling method that addresses unfairness coming from data imbalances. The in-processing phase consist of a classification method. This can be either one coming from the this http URL package, or a user defined one. For this phase, we incorporate fair ML methods that can handle unfairness to a certain degree through their optimization process. In the post-processing, we discuss the choice of the cut-off value for fair prediction. With simulations, we show the performance of the single phases and their combinations.

[LG-11] Multi-objective Deep Learning: Taxonomy and Survey of the State of the Art

链接: https://arxiv.org/abs/2412.01566
作者: Sebastian Peitz,Sedjro Salomon Hotegni
关键词-EN: multicriteria hyperparameter tuning, Simultaneously considering multiple, hyperparameter tuning, multiple objectives, objectives in machine
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Simultaneously considering multiple objectives in machine learning has been a popular approach for several decades, with various benefits for multi-task learning, the consideration of secondary goals such as sparsity, or multicriteria hyperparameter tuning. However - as multi-objective optimization is significantly more costly than single-objective optimization - the recent focus on deep learning architectures poses considerable additional challenges due to the very large number of parameters, strong nonlinearities and stochasticity. This survey covers recent advancements in the area of multi-objective deep learning. We introduce a taxonomy of existing methods - based on the type of training algorithm as well as the decision maker’s needs - before listing recent advancements, and also successful applications. All three main learning paradigms supervised learning, reinforcement learning and unsupervised learning are covered, and we also address the recently very popular area of generative modeling.

[LG-12] okenizing 3D Molecule Structure with Quantized Spherical Coordinates

链接: https://arxiv.org/abs/2412.01564
作者: Kaiyuan Gao,Yusong Wang,Haoxiang Guan,Zun Wang,Qizhi Pei,John E. Hopcroft,Kun He,Lijun Wu
关键词-EN: SMILES and SELFIES, field of cheminformatics, application of language, SMILES, SELFIES
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 17 pages, 6 figures, preprint

点击查看摘要

Abstract:The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.

[LG-13] How Much Can Time-related Features Enhance Time Series Forecasting?

链接: https://arxiv.org/abs/2412.01557
作者: Chaolv Zeng,Yuan Tian,Guanjie Zheng,Yunjun Gao
关键词-EN: Recent advancements, time series data, long-term time series, time series forecasting, cross-time and cross-variate
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent advancements in long-term time series forecasting (LTSF) have primarily focused on capturing cross-time and cross-variate (channel) dependencies within historical data. However, a critical aspect often overlooked by many existing methods is the explicit incorporation of \textbftime-related features (e.g., season, month, day of the week, hour, minute), which are essential components of time series data. The absence of this explicit time-related encoding limits the ability of current models to capture cyclical or seasonal trends and long-term dependencies, especially with limited historical input. To address this gap, we introduce a simple yet highly efficient module designed to encode time-related features, Time Stamp Forecaster (TimeSter), thereby enhancing the backbone’s forecasting performance. By integrating TimeSter with a linear backbone, our model, TimeLinear, significantly improves the performance of a single linear projector, reducing MSE by an average of 23% on benchmark datasets such as Electricity and Traffic. Notably, TimeLinear achieves these gains while maintaining exceptional computational efficiency, delivering results that are on par with or exceed state-of-the-art models, despite using a fraction of the parameters.

[LG-14] Data-Centric and Heterogeneity-Adaptive Sequence Parallelism for Efficient LLM Training

链接: https://arxiv.org/abs/2412.01523
作者: Yujie Wang,Shiju Wang,Shenhan Zhu,Fangcheng Fu,Xinyi Liu,Xuefeng Xiao,Huixia Li,Jiashi Li,Faming Wu,Bin Cui
关键词-EN: maximum supported sequence, supported sequence length, sequence, paramount significance, Extending the context
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity. In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2412.01523 [cs.DC] (or arXiv:2412.01523v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2412.01523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment

链接: https://arxiv.org/abs/2412.01519
作者: Tomer Borreda,Daniel Freedman,Or Litany
关键词-EN: graph transformer architecture, Neural Atoms, transformer architecture, number, hubs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present ReHub, a novel graph transformer architecture that achieves linear complexity through an efficient reassignment technique between nodes and virtual nodes. Graph transformers have become increasingly important in graph learning for their ability to utilize long-range node communication explicitly, addressing limitations such as oversmoothing and oversquashing found in message-passing graph networks. However, their dense attention mechanism scales quadratically with the number of nodes, limiting their applicability to large-scale graphs. ReHub draws inspiration from the airline industry’s hub-and-spoke model, where flights are assigned to optimize operational efficiency. In our approach, graph nodes (spokes) are dynamically reassigned to a fixed number of virtual nodes (hubs) at each model layer. Recent work, Neural Atoms (Li et al., 2024), has demonstrated impressive and consistent improvements over GNN baselines by utilizing such virtual nodes; their findings suggest that the number of hubs strongly influences performance. However, increasing the number of hubs typically raises complexity, requiring a trade-off to maintain linear complexity. Our key insight is that each node only needs to interact with a small subset of hubs to achieve linear complexity, even when the total number of hubs is large. To leverage all hubs without incurring additional computational costs, we propose a simple yet effective adaptive reassignment technique based on hub-hub similarity scores, eliminating the need for expensive node-hub computations. Our experiments on LRGB indicate a consistent improvement in results over the base method, Neural Atoms, while maintaining a linear complexity. Remarkably, our sparse model achieves performance on par with its non-sparse counterpart. Furthermore, ReHub outperforms competitive baselines and consistently ranks among top performers across various benchmarks.

[LG-16] Leverage Domain-invariant assumption for regularization

链接: https://arxiv.org/abs/2412.01476
作者: RuiZhe Jiang,Haotian Lei
关键词-EN: Over-parameterized neural networks, Over-parameterized neural, neural networks, notable gap, gap in performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over-parameterized neural networks often exhibit a notable gap in performance between the training and test sets, a phenomenon known as overfitting. To mitigate this, various regularization techniques have been proposed, each tailored to specific tasks and model architectures. In this paper, we offer a novel perspective on overfitting: models tend to learn different representations from distinct i.i.d. datasets. Building on this insight, we introduce \textbfSameloss, an adaptive method that regularizes models by constraining the feature differences across random subsets of the same training set. Due to its minimal prior assumptions, this approach is broadly applicable across different architectures and tasks. Our experiments demonstrate that \textbfSameloss effectively reduces overfitting with low sensitivity to hyperparameters and minimal computational cost. It exhibits particularly strong memory suppression and fosters normal convergence, even when the model is beginning to overfit. \textbfEven in the absence of significant overfitting, our method consistently improves accuracy and lowers validation loss.

[LG-17] A Comprehensive Study of Shapley Value in Data Analytics

链接: https://arxiv.org/abs/2412.01460
作者: Hong Lin,Shixin Wan,Zhongle Xie,Ke Chen,Meihui Zhang,Lidan Shou,Gang Chen
关键词-EN: cooperative game theory, found numerous applications, game theory, concept from cooperative, cooperative game
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Over the last few years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper provides the first comprehensive survey of SV used throughout the DA workflow, which involves three main steps: data fabric, data exploration, and result reporting. We summarize existing versatile forms of SV used in these steps by a unified definition and clarify the essential functionalities that SV can provide for data scientists. We categorize the arts in this field based on the technical challenges they tackled, which include computation efficiency, approximation error, privacy preservation, and appropriate interpretations. We discuss these challenges and analyze the corresponding solutions. We also implement SVBench, the first open-sourced benchmark for developing SV applications, and conduct experiments on six DA tasks to validate our analysis and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.

[LG-18] Bio-Inspired Adaptive Neurons for Dynamic Weighting in Artificial Neural Networks

链接: https://arxiv.org/abs/2412.01454
作者: Ashhadul Islam,Abdesselam Bouzerdoum,Samir Brahim Belhaouari
关键词-EN: networks employ fixed, changing input conditions, employ fixed weights, strength dynamically based, unlike biological neurons
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional neural networks employ fixed weights during inference, limiting their ability to adapt to changing input conditions, unlike biological neurons that adjust signal strength dynamically based on stimuli. This discrepancy between artificial and biological neurons constrains neural network flexibility and adaptability. To bridge this gap, we propose a novel framework for adaptive neural networks, where neuron weights are modeled as functions of the input signal, allowing the network to adjust dynamically in real-time. Importantly, we achieve this within the same traditional architecture of an Artificial Neural Network, maintaining structural familiarity while introducing dynamic adaptability. In our research, we apply Chebyshev polynomials as one of the many possible decomposition methods to achieve this adaptive weighting mechanism, with polynomial coefficients learned during training. Out of the 145 datasets tested, our adaptive Chebyshev neural network demonstrated a marked improvement over an equivalent MLP in approximately 8% of cases, performing strictly better on 121 datasets. In the remaining 24 datasets, the performance of our algorithm matched that of the MLP, highlighting its ability to generalize standard neural network behavior while offering enhanced adaptability. As a generalized form of the MLP, this model seamlessly retains MLP performance where needed while extending its capabilities to achieve superior accuracy across a wide range of complex tasks. These results underscore the potential of adaptive neurons to enhance generalization, flexibility, and robustness in neural networks, particularly in applications with dynamic or non-linear data dependencies.

[LG-19] ask Adaptation of Reinforcement Learning-based NAS Agents through Transfer Learning

链接: https://arxiv.org/abs/2412.01420
作者: Amber Cassimon,Siegfried Mercelis,Kevin Mets
关键词-EN: reinforcement learning-based NAS, learning-based NAS agents, incremental improvement, NAS agents, Recently
类目: Machine Learning (cs.LG)
*备注: 15 Pages, 13 Figures

点击查看摘要

Abstract:Recently, a novel paradigm has been proposed for reinforcement learning-based NAS agents, that revolves around the incremental improvement of a given architecture. We assess the abilities of such reinforcement learning agents to transfer between different tasks. We perform our evaluation using the Trans-NASBench-101 benchmark, and consider the efficacy of the transferred agents, as well as how quickly they can be trained. We find that pretraining an agent on one task benefits the performance of the agent in another task in all but 1 task when considering final performance. We also show that the training procedure for an agent can be shortened significantly by pretraining it on another task. Our results indicate that these effects occur regardless of the source or target task, although they are more pronounced for some tasks than for others. Our results show that transfer learning can be an effective tool in mitigating the computational cost of the initial training procedure for reinforcement learning-based NAS agents.

[LG-20] Machine Learning Analysis of Anomalous Diffusion

链接: https://arxiv.org/abs/2412.01393
作者: Wenjie Cai,Yi Hu,Xiang Qu,Hui Zhao,Gongyi Wang,Jing Li,Zihan Huang
关键词-EN: machine learning, anomalous diffusion, anomalous diffusion analysis, Anomalous Diffusion Challenge, machine learning techniques
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 43 pages, 10 figures

点击查看摘要

Abstract:The rapid advancements in machine learning have made its application to anomalous diffusion analysis both essential and inevitable. This review systematically introduces the integration of machine learning techniques for enhanced analysis of anomalous diffusion, focusing on two pivotal aspects: single trajectory characterization via machine learning and representation learning of anomalous diffusion. We extensively compare various machine learning methods, including both classical machine learning and deep learning, used for the inference of diffusion parameters and trajectory segmentation. Additionally, platforms such as the Anomalous Diffusion Challenge that serve as benchmarks for evaluating these methods are highlighted. On the other hand, we outline three primary strategies for representing anomalous diffusion: the combination of predefined features, the feature vector from the penultimate layer of neural network, and the latent representation from the autoencoder, analyzing their applicability across various scenarios. This investigation paves the way for future research, offering valuable perspectives that can further enrich the study of anomalous diffusion and advance the application of artificial intelligence in statistical physics and biophysics.

[LG-21] Harnessing Preference Optimisation in Protein LMs for Hit Maturation in Cell Therapy

链接: https://arxiv.org/abs/2412.01388
作者: Katarzyna Janocha,Annabel Ling,Alice Godson,Yulia Lampi,Simon Bornschein,Nils Y. Hammerla
关键词-EN: offer transformative potential, immunotherapy offer transformative, offer transformative, transformative potential, potential for treating
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cell and immunotherapy offer transformative potential for treating diseases like cancer and autoimmune disorders by modulating the immune system. The development of these therapies is resource-intensive, with the majority of drug candidates failing to progress beyond laboratory testing. While recent advances in machine learning have revolutionised areas such as protein engineering, applications in immunotherapy remain limited due to the scarcity of large-scale, standardised datasets and the complexity of cellular systems. In this work, we address these challenges by leveraging a high-throughput experimental platform to generate data suitable for fine-tuning protein language models. We demonstrate how models fine-tuned using a preference task show surprising correlations to biological assays, and how they can be leveraged for few-shot hit maturation in CARs. This proof-of-concept presents a novel pathway for applying ML to immunotherapy and could generalise to other therapeutic modalities.

[LG-22] A deformation-based framework for learning solution mappings of PDEs defined on varying domains

链接: https://arxiv.org/abs/2412.01379
作者: Shanshan Xiao,Pengzhan Jin,Yifa Tang
关键词-EN: learning solution mappings, solution mapping, mapping, varying domains, defined on varying
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we establish a deformation-based framework for learning solution mappings of PDEs defined on varying domains. The union of functions defined on varying domains can be identified as a metric space according to the deformation, then the solution mapping is regarded as a continuous metric-to-metric mapping, and subsequently can be represented by another continuous metric-to-Banach mapping using two different strategies, referred to as the D2D framework and the D2E framework, respectively. We point out that such a metric-to-Banach mapping can be learned by neural networks, hence the solution mapping is accordingly learned. With this framework, a rigorous convergence analysis is built for the problem of learning solution mappings of PDEs on varying domains. As the theoretical framework holds based on several pivotal assumptions which need to be verified for a given specific problem, we study the star domains as a typical example, and other situations could be similarly verified. There are three important features of this framework: (1) The domains under consideration are not required to be diffeomorphic, therefore a wide range of regions can be covered by one model provided they are homeomorphic. (2) The deformation mapping is unnecessary to be continuous, thus it can be flexibly established via combining a primary identity mapping and a local deformation mapping. This capability facilitates the resolution of large systems where only local parts of the geometry undergo change. (3) If a linearity-preserving neural operator such as MIONet is adopted, this framework still preserves the linearity of the surrogate solution mapping on its source term for linear PDEs, thus it can be applied to the hybrid iterative method. We finally present several numerical experiments to validate our theoretical results.

[LG-23] Hierarchical VAE with a Diffusion-based VampPrior

链接: https://arxiv.org/abs/2412.01373
作者: Anna Kuzina,Jakub M. Tomczak
关键词-EN: hierarchical variational autoencoders, Diffusion-based Variational Mixture, variable generative models, Deep hierarchical variational, powerful latent variable
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

[LG-24] Practical Performative Policy Learning with Strategic Agents

链接: https://arxiv.org/abs/2412.01344
作者: Qianyi Chen,Ying Chen,Bo Li
关键词-EN: potential outcomes, inducing an endogenous, paper studies, adjust their features, features in response
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes, inducing an endogenous distribution shift. There has been growing interest in training machine learning models in strategic environments, including strategic classification and performative prediction. However, existing approaches often rely on restrictive parametric assumptions: micro-level utility models in strategic classification and macro-level data distribution maps in performative prediction, severely limiting scalability and generalizability. We approach this problem as a complex causal inference task, relaxing parametric assumptions on both micro-level agent behavior and macro-level data distribution. Leveraging bounded rationality, we uncover a practical low-dimensional structure in distribution shifts and construct an effective mediator in the causal path from the deployed model to the shifted data. We then propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map. Our algorithm efficiently utilizes batch feedback and limited manipulation patterns. Our approach achieves high sample efficiency compared to methods reliant on bandit feedback or zero-order optimization. We also provide theoretical guarantees for algorithmic convergence. Extensive and challenging experiments on high-dimensional settings demonstrate our method’s practical efficacy.

[LG-25] A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

链接: https://arxiv.org/abs/2412.01335
作者: Junwei Deng,Weijing Tang,Jiaqi W. Ma
关键词-EN: data points affect, training data points, individual training data, Influence function, data attribution
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution – quantifying how individual training data points affect a model’s predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that can be decomposed into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 10^3 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.

[LG-26] Morphological-Symmetry-Equivariant Heterogeneous Graph Neural Network for Robotic Dynamics Learning

链接: https://arxiv.org/abs/2412.01297
作者: Fengze Xie,Sizhe Wei,Yue Song,Yisong Yue,Lu Gan
关键词-EN: graph neural network, single graph network, heterogeneous graph neural, integrates robotic kinematic, robotic kinematic structures
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a morphological-symmetry-equivariant heterogeneous graph neural network, namely MS-HGNN, for robotic dynamics learning, that integrates robotic kinematic structures and morphological symmetries into a single graph network. These structural priors are embedded into the learning architecture as constraints, ensuring high generalizability, sample and model efficiency. The proposed MS-HGNN is a versatile and general architecture that is applicable to various multi-body dynamic systems and a wide range of dynamics learning problems. We formally prove the morphological-symmetry-equivariant property of our MS-HGNN and validate its effectiveness across multiple quadruped robot learning problems using both real-world and simulated data. Our code is made publicly available at this https URL.

[LG-27] owards Robust Interpretable Surrogates for Optimization

链接: https://arxiv.org/abs/2412.01264
作者: Marc Goerigk,Michael Hartisch,Sebastian Merten
关键词-EN: intended users, practical implementation, inherently interpretable optimization, interpretable optimization models, important factor
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:An important factor in the practical implementation of optimization models is the acceptance by the intended users. This is influenced among other factors by the interpretability of the solution process. Decision rules that meet this requirement can be generated using the framework for inherently interpretable optimization models. In practice, there is often uncertainty about the parameters of an optimization problem. An established way to deal with this challenge is the concept of robust optimization. The goal of our work is to combine both concepts: to create decision trees as surrogates for the optimization process that are more robust to perturbations and still inherently interpretable. For this purpose we present suitable models based on different variants to model uncertainty, and solution methods. Furthermore, the applicability of heuristic methods to perform this task is evaluated. Both approaches are compared with the existing framework for inherently interpretable optimization models.

[LG-28] Quantum Pointwise Convolution: A Flexible and Scalable Approach for Neural Network Enhancement

链接: https://arxiv.org/abs/2412.01241
作者: An Ning,Tai-Yue Li,Nan-Yow Chen
关键词-EN: neural network framework, Quantum Pointwise Convolution, quantum neural network, Pointwise Convolution, incorporates pointwise convolution
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:In this study, we propose a novel architecture, the Quantum Pointwise Convolution, which incorporates pointwise convolution within a quantum neural network framework. Our approach leverages the strengths of pointwise convolution to efficiently integrate information across feature channels while adjusting channel outputs. By using quantum circuits, we map data to a higher-dimensional space, capturing more complex feature relationships. To address the current limitations of quantum machine learning in the Noisy Intermediate-Scale Quantum (NISQ) era, we implement several design optimizations. These include amplitude encoding for data embedding, allowing more information to be processed with fewer qubits, and a weight-sharing mechanism that accelerates quantum pointwise convolution operations, reducing the need to retrain for each input pixels. In our experiments, we applied the quantum pointwise convolution layer to classification tasks on the FashionMNIST and CIFAR10 datasets, where our model demonstrated competitive performance compared to its classical counterpart. Furthermore, these optimizations not only improve the efficiency of the quantum pointwise convolutional layer but also make it more readily deployable in various CNN-based or deep learning models, broadening its potential applications across different architectures.

[LG-29] Variational formulation based on duality to solve partial differential equations: Use of B-splines and machine learning approximants

链接: https://arxiv.org/abs/2412.01232
作者: N. Sukumar,Amit Acharya
关键词-EN: partial differential equations, Stokes equations, fluid mechanics, inelastic deformation, deformation in solids
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 43 pages, 19 figures

点击查看摘要

Abstract:Many partial differential equations (PDEs) such as Navier–Stokes equations in fluid mechanics, inelastic deformation in solids, and transient parabolic and hyperbolic equations do not have an exact, primal variational structure. Recently, a variational principle based on the dual (Lagrange multiplier) field was proposed. The essential idea in this approach is to treat the given PDE as constraints, and to invoke an arbitrarily chosen auxiliary potential with strong convexity properties to be optimized. This leads to requiring a convex dual functional to be minimized subject to Dirichlet boundary conditions on dual variables, with the guarantee that even PDEs that do not possess a variational structure in primal form can be solved via a variational principle. The vanishing of the first variation of the dual functional is, up to Dirichlet boundary conditions on dual fields, the weak form of the primal PDE problem with the dual-to-primal change of variables incorporated. We derive the dual weak form for the linear, one-dimensional, transient convection-diffusion equation. A Galerkin discretization is used to obtain the discrete equations, with the trial and test functions chosen as linear combination of either RePU activation functions (shallow neural network) or B-spline basis functions; the corresponding stiffness matrix is symmetric. For transient problems, a space-time Galerkin implementation is used with tensor-product B-splines as approximating functions. Numerical results are presented for the steady-state and transient convection-diffusion equation, and transient heat conduction. The proposed method delivers sound accuracy for ODEs and PDEs and rates of convergence are established in the L^2 norm and H^1 seminorm for the steady-state convection-diffusion problem.

[LG-30] EsurvFusion: An evidential multimodal survival fusion model based on Gaussian random fuzzy numbers

链接: https://arxiv.org/abs/2412.01215
作者: Ling Huang,Yucheng Xing,Qika Lin,Su Ruan,Mengling Feng
关键词-EN: heterogeneous data sources, Multimodal survival, survival, data, Multimodal
类目: Machine Learning (cs.LG)
*备注: Multimodal survival analysis, Epistemic random fuzzy sets theory, Uncertainty

点击查看摘要

Abstract:Multimodal survival analysis aims to combine heterogeneous data sources (e.g., clinical, imaging, text, genomics) to improve the prediction quality of survival outcomes. However, this task is particularly challenging due to high heterogeneity and noise across data sources, which vary in structure, distribution, and context. Additionally, the ground truth is often censored (uncertain) due to incomplete follow-up data. In this paper, we propose a novel evidential multimodal survival fusion model, EsurvFusion, designed to combine multimodal data at the decision level through an evidence-based decision fusion layer that jointly addresses both data and model uncertainty while incorporating modality-level reliability. Specifically, EsurvFusion first models unimodal data with newly introduced Gaussian random fuzzy numbers, producing unimodal survival predictions along with corresponding aleatoric and epistemic uncertainties. It then estimates modality-level reliability through a reliability discounting layer to correct the misleading impact of noisy data modalities. Finally, a multimodal evidence-based fusion layer is introduced to combine the discounted predictions to form a unified, interpretable multimodal survival analysis model, revealing each modality’s influence based on the learned reliability coefficients. This is the first work that studies multimodal survival analysis with both uncertainty and reliability. Extensive experiments on four multimodal survival datasets demonstrate the effectiveness of our model in handling high heterogeneity data, establishing new state-of-the-art on several benchmarks.

[LG-31] Siamese Machine Unlearning with Knowledge Vaporization and Concentration

链接: https://arxiv.org/abs/2412.01207
作者: Songjie Xie,Hengtao He,Shenghui Song,Jun Zhang,Khaled B. Letaief
关键词-EN: removal of undesired, essential technique, machine unlearning emerges, practical demands, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In response to the practical demands of the ``right to be forgotten" and the removal of undesired data, machine unlearning emerges as an essential technique to remove the learned knowledge of a fraction of data points from trained models. However, existing methods suffer from limitations such as insufficient methodological support, high computational complexity, and significant memory demands. In this work, we propose the concepts of knowledge vaporization and concentration to selectively erase learned knowledge from specific data points while maintaining representations for the remaining data. Utilizing the Siamese networks, we exemplify the proposed concepts and develop an efficient method for machine unlearning. Our proposed Siamese unlearning method does not require additional memory overhead and full access to the remaining dataset. Extensive experiments conducted across multiple unlearning scenarios showcase the superiority of Siamese unlearning over baseline methods, illustrating its ability to effectively remove knowledge from forgetting data, enhance model utility on remaining data, and reduce susceptibility to membership inference attacks.

[LG-32] Divergent Ensemble Networks: Enhancing Uncertainty Estimation with Shared Representations and Independent Branching

链接: https://arxiv.org/abs/2412.01193
作者: Arnav Kharbanda,Advait Chandorkar
关键词-EN: improving predictive performance, proven effective, effective in improving, improving predictive, predictive performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble learning has proven effective in improving predictive performance and estimating uncertainty in neural networks. However, conventional ensemble methods often suffer from redundant parameter usage and computational inefficiencies due to entirely independent network training. To address these challenges, we propose the Divergent Ensemble Network (DEN), a novel architecture that combines shared representation learning with independent branching. DEN employs a shared input layer to capture common features across all branches, followed by divergent, independently trainable layers that form an ensemble. This shared-to-branching structure reduces parameter redundancy while maintaining ensemble diversity, enabling efficient and scalable learning.

[LG-33] Rectified Flow For Structure Based Drug Design

链接: https://arxiv.org/abs/2412.01174
作者: Daiheng Zhang,Chengyue Gong,Qiang Liu
关键词-EN: Deep generative models, achieved tremendous success, structure-based drug design, Deep generative, specific protein pocket
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Accepted to ELLIS ML4Molecules 2024 Workshop

点击查看摘要

Abstract:Deep generative models have achieved tremendous success in structure-based drug design in recent years, especially for generating 3D ligand molecules that bind to specific protein pocket. Notably, diffusion models have transformed ligand generation by providing exceptional quality and creativity. However, traditional diffusion models are restricted by their conventional learning objectives, which limit their broader applicability. In this work, we propose a new framework FlowSBDD, which is based on rectified flow model, allows us to flexibly incorporate additional loss to optimize specific target and introduce additional condition either as an extra input condition or replacing the initial Gaussian distribution. Extensive experiments on CrossDocked2020 show that our approach could achieve state-of-the-art performance on generating high-affinity molecules while maintaining proper molecular properties without specifically designing binding site, with up to -8.50 Avg. Vina Dock score and 75.0% Diversity.

[LG-34] Cross-Task Inconsistency Based Active Learning (CTIAL) for Emotion Recognition

链接: https://arxiv.org/abs/2412.01171
作者: Yifan Xu,Xue Jiang,Dongrui Wu
关键词-EN: critical component, Emotion recognition, active learning, emotion recognition typically, learning
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: IEEE Trans. on Affective Computing, 2024

点击查看摘要

Abstract:Emotion recognition is a critical component of affective computing. Training accurate machine learning models for emotion recognition typically requires a large amount of labeled data. Due to the subtleness and complexity of emotions, multiple evaluators are usually needed for each affective sample to obtain its ground-truth label, which is expensive. To save the labeling cost, this paper proposes an inconsistency-based active learning approach for cross-task transfer between emotion classification and estimation. Affective norms are utilized as prior knowledge to connect the label spaces of categorical and dimensional emotions. Then, the prediction inconsistency on the two tasks for the unlabeled samples is used to guide sample selection in active learning for the target task. Experiments on within-corpus and cross-corpus transfers demonstrated that cross-task inconsistency could be a very valuable metric in active learning. To our knowledge, this is the first work that utilizes prior knowledge on affective norms and data in a different task to facilitate active learning for a new task, even the two tasks are from different datasets.

[LG-35] HumekaFL: Automated Detection of Neonatal Asphyxia Using Federated Learning

链接: https://arxiv.org/abs/2412.01167
作者: Pamely Zantou,Blessed Guda,Bereket Retta,Gladys Inabeza,Carlee Joe-Wong,Assane Gueye
关键词-EN: Birth Apshyxia, severe condition characterized, severe condition, condition characterized, characterized by insufficient
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Poster at ACM compass 2024

点击查看摘要

Abstract:Birth Apshyxia (BA) is a severe condition characterized by insufficient supply of oxygen to a newborn during the delivery. BA is one of the primary causes of neonatal death in the world. Although there has been a decline in neonatal deaths over the past two decades, the developing world, particularly sub-Saharan Africa, continues to experience the highest under-five (5) mortality rates. While evidence-based methods are commonly used to detect BA in African healthcare settings, they can be subject to physician errors or delays in diagnosis, preventing timely interventions. Centralized Machine Learning (ML) methods demonstrated good performance in early detection of BA but require sensitive health data to leave their premises before training, which does not guarantee privacy and security. Healthcare institutions are therefore reluctant to adopt such solutions in Africa. To address this challenge, we suggest a federated learning (FL)-based software architecture, a distributed learning method that prioritizes privacy and security by design. We have developed a user-friendly and cost-effective mobile application embedding the FL pipeline for early detection of BA. Our Federated SVM model outperformed centralized SVM pipelines and Neural Networks (NN)-based methods in the existing literature

[LG-36] Graph Community Augmentation with GMM-based Modeling in Latent Space ICDM

链接: https://arxiv.org/abs/2412.01163
作者: Shintaro Fukushima,Kenji Yamanishi
关键词-EN: graph community augmentation, community, graph community, graph, community augmentation
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: IEEE Copyright. Accepted to 24th IEEE International Conference on Data Mining (ICDM). 10pages

点击查看摘要

Abstract:This study addresses the issue of graph generation with generative models. In particular, we are concerned with graph community augmentation problem, which refers to the problem of generating unseen or unfamiliar graphs with a new community out of the probability distribution estimated with a given graph dataset. The graph community augmentation means that the generated graphs have a new community. There is a chance of discovering an unseen but important structure of graphs with a new community, for example, in a social network such as a purchaser network. Graph community augmentation may also be helpful for generalization of data mining models in a case where it is difficult to collect real graph data enough. In fact, there are many ways to generate a new community in an existing graph. It is desirable to discover a new graph with a new community beyond the given graph while we keep the structure of the original graphs to some extent for the generated graphs to be realistic. To this end, we propose an algorithm called the graph community augmentation (GCA). The key ideas of GCA are (i) to fit Gaussian mixture model (GMM) to data points in the latent space into which the nodes in the original graph are embedded, and (ii) to add data points in the new cluster in the latent space for generating a new community based on the minimum description length (MDL) principle. We empirically demonstrate the effectiveness of GCA for generating graphs with a new community structure on synthetic and real datasets.

[LG-37] SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics

链接: https://arxiv.org/abs/2412.01124
作者: Qingtian Zhu,Yumin Zheng,Yuling Sang,Yifan Zhan,Ziyan Zhu,Jun Ding,Yinqiang Zheng
关键词-EN: captures spatial gene, Spatial Transcriptomics, gene expression profiles, Implicit Neural Representations, spatial gene expression
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Spatial Transcriptomics (ST) is a method that captures spatial gene expression profiles within histological sections. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can improve both the spatial resolution and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms, SUICA outperforms both conventional INR variants and SOTA methods for ST super-resolution regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis. The code is available at this https URL.

[LG-38] Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations

链接: https://arxiv.org/abs/2412.01114
作者: Cevahir Koprulu,Po-han Li,Tianyu Qiu,Ruihan Zhao,Tyler Westenbroek,David Fridovich-Keil,Sandeep Chinchali,Ufuk Topcu
关键词-EN: continuous control problems, sparse-reward reinforcement learning, continuous control, control problems, formulated as sparse-reward
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many continuous control problems can be formulated as sparse-reward reinforcement learning (RL) tasks. In principle, online RL methods can automatically explore the state space to solve each new task. However, discovering sequences of actions that lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. Manually shaping rewards can accelerate learning for a fixed task, but it is an arduous process that must be repeated for each new environment. We introduce a systematic reward-shaping framework that distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations, and then uses these priors to synthesize dense dynamics-aware rewards for the given task. This supervision substantially accelerates learning in our experiments, and we provide analysis demonstrating how the approach can effectively guide online learning agents to faraway goals.

[LG-39] Multi-Scale Representation Learning for Protein Fitness Prediction

链接: https://arxiv.org/abs/2412.01108
作者: Zuobai Zhang,Pascal Notin,Yining Huang,Aurélie Lozano,Vijil Chenthamarakshan,Debora Marks,Payel Das,Jian Tang
关键词-EN: proteins crucially depends, crucially depends, depends on accurately, accurately modeling, Designing novel functional
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness (S3F) model - a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Our code is at this https URL.

[LG-40] Personalized Coupled Tensor Decomposition for Multimodal Data Fusion: Uniqueness and Algorithms

链接: https://arxiv.org/abs/2412.01102
作者: Ricardo Augusto Borsoi,Konstantin Usevich,David Brie,Tülay Adali
关键词-EN: perform data fusion, data fusion, perform data, Coupled tensor decompositions, linking factors
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Coupled tensor decompositions (CTDs) perform data fusion by linking factors from different datasets. Although many CTDs have been already proposed, current works do not address important challenges of data fusion, where: 1) the datasets are often heterogeneous, constituting different “views” of a given phenomena (multimodality); and 2) each dataset can contain personalized or dataset-specific information, constituting distinct factors that are not coupled with other datasets. In this work, we introduce a personalized CTD framework tackling these challenges. A flexible model is proposed where each dataset is represented as the sum of two components, one related to a common tensor through a multilinear measurement model, and another specific to each dataset. Both the common and distinct components are assumed to admit a polyadic decomposition. This generalizes several existing CTD models. We provide conditions for specific and generic uniqueness of the decomposition that are easy to interpret. These conditions employ uni-mode uniqueness of different individual datasets and properties of the measurement model. Two algorithms are proposed to compute the common and distinct components: a semi-algebraic one and a coordinate-descent optimization method. Experimental results illustrate the advantage of the proposed framework compared with the state of the art approaches.

[LG-41] Gated Parametric Neuron for Spike-based Audio Recognition

链接: https://arxiv.org/abs/2412.01087
作者: Haoran Wang,Herui Zhang,Siyang Li,Dongrui Wu
关键词-EN: Spiking neural networks, biologically plausible neurons, Spiking neural, aim to simulate, biologically plausible
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) aim to simulate real neural networks in the human brain with biologically plausible neurons. The leaky integrate-and-fire (LIF) neuron is one of the most widely studied SNN architectures. However, it has the vanishing gradient problem when trained with backpropagation. Additionally, its neuronal parameters are often manually specified and fixed, in contrast to the heterogeneity of real neurons in the human brain. This paper proposes a gated parametric neuron (GPN) to process spatio-temporal information effectively with the gating mechanism. Compared with the LIF neuron, the GPN has two distinguishing advantages: 1) it copes well with the vanishing gradients by improving the flow of gradient propagation; and, 2) it learns spatio-temporal heterogeneous neuronal parameters automatically. Additionally, we use the same gate structure to eliminate initial neuronal parameter selection and design a hybrid recurrent neural network-SNN structure. Experiments on two spike-based audio datasets demonstrated that the GPN network outperformed several state-of-the-art SNNs, could mitigate vanishing gradients, and had spatio-temporal heterogeneous parameters. Our work shows the ability of SNNs to handle long-term dependencies and achieve high performance simultaneously.

[LG-42] Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces

链接: https://arxiv.org/abs/2412.01079
作者: Tianwang Jia,Lubin Meng,Siyang Li,Jiajing Liu,Dongrui Wu
关键词-EN: EEG-based brain-computer interface, brain-computer interface, critical consideration, requires EEG data, accurate classifier
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Training an accurate classifier for EEG-based brain-computer interface (BCI) requires EEG data from a large number of users, whereas protecting their data privacy is a critical consideration. Federated learning (FL) is a promising solution to this challenge. This paper proposes Federated classification with local Batch-specific batch normalization and Sharpness-aware minimization (FedBS) for privacy protection in EEG-based motor imagery (MI) classification. FedBS utilizes local batch-specific batch normalization to reduce data discrepancies among different clients, and sharpness-aware minimization optimizer in local training to improve model generalization. Experiments on three public MI datasets using three popular deep learning models demonstrated that FedBS outperformed six state-of-the-art FL approaches. Remarkably, it also outperformed centralized training, which does not consider privacy protection at all. In summary, FedBS protects user EEG data privacy, enabling multiple BCI users to participate in large-scale machine learning model training, which in turn improves the BCI decoding accuracy.

[LG-43] A Memory-Based Reinforcement Learning Approach to Integrated Sensing and Communication

链接: https://arxiv.org/abs/2412.01077
作者: Homa Nikbakht,Michèle Wigger,Shlomo Shamai(Shitz),H. Vincent Poor
关键词-EN: transmitter conveys, conveys a message, estimates the state, backscattered signals, ISAC
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider a point-to-point integrated sensing and communication (ISAC) system, where a transmitter conveys a message to a receiver over a channel with memory and simultaneously estimates the state of the channel through the backscattered signals from the emitted waveform. Using Massey’s concept of directed information for channels with memory, we formulate the capacity-distortion tradeoff for the ISAC problem when sensing is performed in an online fashion. Optimizing the transmit waveform for this system to simultaneously achieve good communication and sensing performance is a complicated task, and thus we propose a deep reinforcement learning (RL) approach to find a solution. The proposed approach enables the agent to optimize the ISAC performance by learning a reward that reflects the difference between the communication gain and the sensing loss. Since the state-space in our RL model is à priori unbounded, we employ deep deterministic policy gradient algorithm (DDPG). Our numerical results suggest a significant performance improvement when one considers unbounded state-space as opposed to a simpler RL problem with reduced state-space. In the extreme case of degenerate state-space only memoryless signaling strategies are possible. Our results thus emphasize the necessity of well exploiting the memory inherent in ISAC systems.

[LG-44] MuSiCNet: A Gradual Coarse-to-Fine Framework for Irregularly Sampled Multivariate Time Series Analysis IJCAI2024

链接: https://arxiv.org/abs/2412.01063
作者: Jiexi Liu,Meng Cao,Songcan Chen
关键词-EN: sampled multivariate time, Irregularly sampled multivariate, sampled time series, multivariate time series, prevalent in reality
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: IJCAI2024 AI4TS workshop best paper runner-up

点击查看摘要

Abstract:Irregularly sampled multivariate time series (ISMTS) are prevalent in reality. Most existing methods treat ISMTS as synchronized regularly sampled time series with missing values, neglecting that the irregularities are primarily attributed to variations in sampling rates. In this paper, we introduce a novel perspective that irregularity is essentially relative in some senses. With sampling rates artificially determined from low to high, an irregularly sampled time series can be transformed into a hierarchical set of relatively regular time series from coarse to fine. We observe that additional coarse-grained relatively regular series not only mitigate the irregularly sampled challenges to some extent but also incorporate broad-view temporal information, thereby serving as a valuable asset for representation learning. Therefore, following the philosophy of learning that Seeing the big picture first, then delving into the details, we present the Multi-Scale and Multi-Correlation Attention Network (MuSiCNet) combining multiple scales to iteratively refine the ISMTS representation. Specifically, within each scale, we explore time attention and frequency correlation matrices to aggregate intra- and inter-series information, naturally enhancing the representation quality with richer and more intrinsic details. While across adjacent scales, we employ a representation rectification method containing contrastive learning and reconstruction results adjustment to further improve representation consistency. MuSiCNet is an ISMTS analysis framework that competitive with SOTA in three mainstream tasks consistently, including classification, interpolation, and forecasting.

[LG-45] Research on Optimizing Real-Time Data Processing in High-Frequency Trading Algorithms using Machine Learning

链接: https://arxiv.org/abs/2412.01062
作者: Yuxin Fan,Zhuohuan Hu,Lei Fu,Yu Cheng,Liyang Wang,Yuxiang Wang
关键词-EN: intensely competitive domain, represents a pivotal, pivotal and intensely, intensely competitive, competitive domain
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:High-frequency trading (HFT) represents a pivotal and intensely competitive domain within the financial markets. The velocity and accuracy of data processing exert a direct influence on profitability, underscoring the significance of this field. The objective of this work is to optimise the real-time processing of data in high-frequency trading algorithms. The dynamic feature selection mechanism is responsible for monitoring and analysing market data in real time through clustering and feature weight analysis, with the objective of automatically selecting the most relevant features. This process employs an adaptive feature extraction method, which enables the system to respond and adjust its feature set in a timely manner when the data input changes, thus ensuring the efficient utilisation of data. The lightweight neural networks are designed in a modular fashion, comprising fast convolutional layers and pruning techniques that facilitate the expeditious completion of data processing and output prediction. In contrast to conventional deep learning models, the neural network architecture has been specifically designed to minimise the number of parameters and computational complexity, thereby markedly reducing the inference time. The experimental results demonstrate that the model is capable of maintaining consistent performance in the context of varying market conditions, thereby illustrating its advantages in terms of processing speed and revenue enhancement.

[LG-46] Embedded Machine Learning for Solar PV Power Regulation in a Remote Microgrid

链接: https://arxiv.org/abs/2412.01054
作者: Yongli Zhu,Linna Xu,Jian Huang
关键词-EN: solar inverter power, inverter power regulation, remote microgrid, paper presents, presents a machine-learning
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This paper has been acccepted by and presented in IEEE ICPEA 2024, Taiyuan, China

点击查看摘要

Abstract:This paper presents a machine-learning study for solar inverter power regulation in a remote microgrid. Machine learning models for active and reactive power control are respectively trained using an ensemble learning method. Then, unlike conventional schemes that make inferences on a central server in the far-end control center, the proposed scheme deploys the trained models on an embedded edge-computing device near the inverter to reduce the communication delay. Experiments on a real embedded device achieve matched results as on the desktop PC, with about 0.1ms time cost for each inference input.

[LG-47] runcFormer: Private LLM Inference Using Only Truncations

链接: https://arxiv.org/abs/2412.01042
作者: Patrick Yubeaton,Jianqiao Cambridge Mo,Karthik Garimella,Nandan Kumar Jha,Brandon Reagen,Chinmay Hegde,Siddharth Garg
关键词-EN: proprietary machine learning, machine learning models, Private inference, serves an important, important role
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Private inference (PI) serves an important role in guaranteeing the privacy of user data when interfacing with proprietary machine learning models such as LLMs. However, PI remains practically intractable due to the massive latency costs associated with nonlinear functions present in LLMs. Existing works have focused on improving latency of specific LLM nonlinearities (such as the Softmax, or the GeLU) via approximations. However, new types of nonlinearities are regularly introduced with new LLM architectures, and this has led to a constant game of catch-up where PI researchers attempt to optimize the newest nonlinear function. We introduce TruncFormer, a framework for taking any LLM and transforming it into a plaintext emulation of PI. Our framework leverages the fact that nonlinearities in LLMs are differentiable and can be accurately approximated with a sequence of additions, multiplications, and truncations. Further, we decouple the add/multiply and truncation operations, and statically determine where truncations should be inserted based on a given field size and input representation size. This leads to latency improvements over existing cryptographic protocols that enforce truncation after every multiplication operation. We open source our code for community use.

[LG-48] Adaptive Traffic Element-Based Streetlight Control Using Neighbor Discovery Algorithm Based on IoT Events

链接: https://arxiv.org/abs/2412.01035
作者: Yupeng Tan,Sheng Xu,Chengyue Su
关键词-EN: Intelligent streetlight systems, reducing energy waste, streetlight systems divide, neighbor relationships, Intelligent streetlight
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Intelligent streetlight systems divide the streetlight network into multiple sectors, activating only the streetlights in the corresponding sectors when traffic elements pass by, rather than all streetlights, effectively reducing energy waste. This strategy requires streetlights to understand their neighbor relationships to illuminate only the streetlights in their respective sectors. However, manually configuring the neighbor relationships for a large number of streetlights in complex large-scale road streetlight networks is cumbersome and prone to errors. Due to the crisscrossing nature of roads, it is also difficult to determine the neighbor relationships using GPS or communication positioning. In response to these issues, this article proposes a systematic approach to model the streetlight network as a social network and construct a neighbor relationship probabilistic graph using IoT event records of streetlights detecting traffic elements. Based on this, a multi-objective genetic algorithm based probabilistic graph clustering method is designed to discover the neighbor relationships of streetlights. Considering the characteristic that pedestrians and vehicles usually move at a constant speed on a section of a road, speed consistency is introduced as an optimization objective, which, together with traditional similarity measures, forms a multi-objective function, enhancing the accuracy of neighbor relationship discovery. Extensive experiments on simulation datasets were conducted, comparing the proposed algorithm with other probabilistic graph clustering algorithms. The results demonstrate that the proposed algorithm can more accurately identify the neighbor relationships of streetlights compared to other algorithms, effectively achieving adaptive streetlight control for traffic elements.

[LG-49] Jacobian-Enforced Neural Networks (JENN) for Improved Data Assimilation Consistency in Dynamical Models

链接: https://arxiv.org/abs/2412.01013
作者: Xiaoxu Tian
关键词-EN: numerical weather prediction, traditional numerical weather, unlike traditional numerical, shown great promise, Machine learning-based weather
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine learning-based weather models have shown great promise in producing accurate forecasts but have struggled when applied to data assimilation tasks, unlike traditional numerical weather prediction (NWP) models. This study introduces the Jacobian-Enforced Neural Network (JENN) framework, designed to enhance DA consistency in neural network (NN)-emulated dynamical systems. Using the Lorenz 96 model as an example, the approach demonstrates improved applicability of NNs in DA through explicit enforcement of Jacobian relationships. The NN architecture includes an input layer of 40 neurons, two hidden layers with 256 units each employing hyperbolic tangent activation functions, and an output layer of 40 neurons without activation. The JENN framework employs a two-step training process: an initial phase using standard prediction-label pairs to establish baseline forecast capability, followed by a secondary phase incorporating a customized loss function to enforce accurate Jacobian relationships. This loss function combines root mean square error (RMSE) between predicted and true state values with additional RMSE terms for tangent linear (TL) and adjoint (AD) emulation results, weighted to balance forecast accuracy and Jacobian sensitivity. To ensure consistency, the secondary training phase uses additional pairs of TL/AD inputs and labels calculated from the physical models. Notably, this approach does not require starting from scratch or structural modifications to the NN, making it readily applicable to pretrained models such as GraphCast, NeuralGCM, Pangu, or FuXi, facilitating their adaptation for DA tasks with minimal reconfiguration. Experimental results demonstrate that the JENN framework preserves nonlinear forecast performance while significantly reducing noise in the TL and AD components, as well as in the overall Jacobian matrix. Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph) Cite as: arXiv:2412.01013 [cs.LG] (or arXiv:2412.01013v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.01013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] -Fold Cross-Validation for Recommender-System Evaluation

链接: https://arxiv.org/abs/2412.01011
作者: Moritz Baumgart,Lukas Wegmeth,Tobias Vente,Joeran Beel
关键词-EN: cross validation, e-fold cross validation, rising energy consumption, cross, e-fold cross
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in [TBA], and is available online at [TBA]

点击查看摘要

Abstract:To combat the rising energy consumption of recommender systems we implement a novel alternative for k-fold cross validation. This alternative, named e-fold cross validation, aims to minimize the number of folds to achieve a reduction in power usage while keeping the reliability and robustness of the test results high. We tested our method on 5 recommender system algorithms across 6 datasets and compared it with 10-fold cross validation. On average e-fold cross validation only needed 41.5% of the energy that 10-fold cross validation would need, while it’s results only differed by 1.81%. We conclude that e-fold cross validation is a promising approach that has the potential to be an energy efficient but still reliable alternative to k-fold cross validation.

[LG-51] Provable Partially Observable Reinforcement Learning with Privileged Information NEURIPS2024

链接: https://arxiv.org/abs/2412.00985
作者: Yang Cai,Xiangyu Liu,Argyris Oikonomou,Kaiqing Zhang
关键词-EN: generally presents significant, presents significant challenges, Partial observability, emph, underlying states generally
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted to 2024 Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain \emphprivileged information, e.g., the access to states from simulators, has been exploited in training and has achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting. Specifically, we first formalize the empirical paradigm of \emphexpert distillation (also known as \emphteacher-student learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the \emphdeterministic filter condition, under which expert distillation achieves sample and computational complexities that are \emphboth polynomial. Furthermore, we investigate another useful empirical paradigm of \emphasymmetric actor-critic, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, in which one key component is a new provable oracle for learning belief states that preserve \emphfilter stability under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms featuring \emphcentralized-training-with-decentralized-execution, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexities in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

[LG-52] GTOD: A Global Temporal Graph Transformer for Outlier Detection at Scale

链接: https://arxiv.org/abs/2412.00984
作者: Kay Liu,Jiahao Ding,MohamadAli Torkamani,Philip S. Yu
关键词-EN: restricted receptive fields, suboptimal generalization capability, revolutionized machine learning, graphs face limitations, temporal graph Transformers
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Preprint. Under review. Code available at this https URL

点击查看摘要

Abstract:While Transformers have revolutionized machine learning on various data, existing Transformers for temporal graphs face limitations in (1) restricted receptive fields, (2) overhead of subgraph extraction, and (3) suboptimal generalization capability beyond link prediction. In this paper, we rethink temporal graph Transformers and propose TGTOD, a novel end-to-end Temporal Graph Transformer for Outlier Detection. TGTOD employs global attention to model both structural and temporal dependencies within temporal graphs. To tackle scalability, our approach divides large temporal graphs into spatiotemporal patches, which are then processed by a hierarchical Transformer architecture comprising Patch Transformer, Cluster Transformer, and Temporal Transformer. We evaluate TGTOD on three public datasets under two settings, comparing with a wide range of baselines. Our experimental results demonstrate the effectiveness of TGTOD, achieving AP improvement of 61% on Elliptic. Furthermore, our efficiency evaluation shows that TGTOD reduces training time by 44x compared to existing Transformers for temporal graphs. To foster reproducibility, we make our implementation publicly available at this https URL.

[LG-53] Incentivizing Truthful Collaboration in Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2412.00980
作者: Dimitar Chakarov,Nikita Tsoy,Kristian Minchev,Nikola Konstantinov
关键词-EN: Federated Learning, well-known that Federated, vulnerable to manipulated, Learning, Federated
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: 16 pages, 0 figures

点击查看摘要

Abstract:It is well-known that Federated Learning (FL) is vulnerable to manipulated updates from clients. In this work we study the impact of data heterogeneity on clients’ incentives to manipulate their updates. We formulate a game in which clients may upscale their gradient updates in order to ``steer’’ the server model to their advantage. We develop a payment rule that disincentivizes sending large gradient updates, and steers the clients towards truthfully reporting their gradients. We also derive explicit bounds on the clients’ payments and the convergence rate of the global model, which allows us to study the trade-off between heterogeneity, payments and convergence.

[LG-54] Hierarchical Prompt Decision Transformer: Improving Few-Shot Policy Generalization with Global and Adaptive

链接: https://arxiv.org/abs/2412.00979
作者: Zhe Wang,Haozhu Wang,Yanjun Qi
关键词-EN: sequence generation problem, recast reinforcement learning, conditional sequence generation, transformers recast reinforcement, Decision transformers recast
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision transformers recast reinforcement learning as a conditional sequence generation problem, offering a simple but effective alternative to traditional value or policy-based methods. A recent key development in this area is the integration of prompting in decision transformers to facilitate few-shot policy generalization. However, current methods mainly use static prompt segments to guide rollouts, limiting their ability to provide context-specific guidance. Addressing this, we introduce a hierarchical prompting approach enabled by retrieval augmentation. Our method learns two layers of soft tokens as guiding prompts: (1) global tokens encapsulating task-level information about trajectories, and (2) adaptive tokens that deliver focused, timestep-specific instructions. The adaptive tokens are dynamically retrieved from a curated set of demonstration segments, ensuring context-aware guidance. Experiments across seven benchmark tasks in the MuJoCo and MetaWorld environments demonstrate the proposed approach consistently outperforms all baseline methods, suggesting that hierarchical prompting for decision transformers is an effective strategy to enable few-shot policy generalization.

[LG-55] Optimal Algorithms for Augmented Testing of Discrete Distributions NEURIPS24

链接: https://arxiv.org/abs/2412.00974
作者: Maryam Aliakbarpour,Piotr Indyk,Ronitt Rubinfeld,Sandeep Silwal
关键词-EN: discrete distributions, testing, sample complexity, hypothesis testing, sample
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: To appear in NeurIPS 24

点击查看摘要

Abstract:We consider the problem of hypothesis testing for discrete distributions. In the standard model, where we have sample access to an underlying distribution p , extensive research has established optimal bounds for uniformity testing, identity testing (goodness of fit), and closeness testing (equivalence or two-sample testing). We explore these problems in a setting where a predicted data distribution, possibly derived from historical data or predictive machine learning models, is available. We demonstrate that such a predictor can indeed reduce the number of samples required for all three property testing tasks. The reduction in sample complexity depends directly on the predictor’s quality, measured by its total variation distance from p . A key advantage of our algorithms is their adaptability to the precision of the prediction. Specifically, our algorithms can self-adjust their sample complexity based on the accuracy of the available prediction, operating without any prior knowledge of the estimation’s accuracy (i.e. they are consistent). Additionally, we never use more samples than the standard approaches require, even if the predictions provide no meaningful information (i.e. they are also robust). We provide lower bounds to indicate that the improvements in sample complexity achieved by our algorithms are information-theoretically optimal. Furthermore, experimental results show that the performance of our algorithms on real data significantly exceeds our worst-case guarantees for sample complexity, demonstrating the practicality of our approach.

[LG-56] Calibration through the Lens of Interpretability

链接: https://arxiv.org/abs/2412.00943
作者: Alireza Torabian,Ruth Urner
关键词-EN: frequently invoked concept, label probability estimates, frequently invoked, invoked concept, required on top
类目: Machine Learning (cs.LG)
*备注: Published in XAI 2024

点击查看摘要

Abstract:Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a function whose values correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assess a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration. We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing common calibration methods to employing a simple, interpretable decision tree.

[LG-57] Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks

链接: https://arxiv.org/abs/2412.00884
作者: Emily Liu
关键词-EN: equiangular tight frame, linear classifier weights, maximizes mutual distance, simplex equiangular tight, simplex ETF
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural collapse is a phenomenon observed during the terminal phase of neural network training, characterized by the convergence of network activations, class means, and linear classifier weights to a simplex equiangular tight frame (ETF), a configuration of vectors that maximizes mutual distance within a subspace. This phenomenon has been linked to improved interpretability, robustness, and generalization in neural networks. However, its potential to guide neural network training and regularization remains underexplored. Previous research has demonstrated that constraining the final layer of a neural network to a simplex ETF can reduce the number of trainable parameters without sacrificing model accuracy. Furthermore, deep fully connected networks exhibit neural collapse not only in the final layer but across all layers beyond a specific effective depth. Using these insights, we propose two novel training approaches: Adaptive-ETF, a generalized framework that enforces simplex ETF constraints on all layers beyond the effective depth, and ETF-Transformer, which applies simplex ETF constraints to the feedforward layers within transformer blocks. We show that these approaches achieve training and testing performance comparable to those of their baseline counterparts while significantly reducing the number of learnable parameters.

[LG-58] Combinatorial Rising Bandit

链接: https://arxiv.org/abs/2412.00798
作者: Seockbean Song,Youngsik Yoon,Siwei Wang,Wei Chen,Jungseul Ok
关键词-EN: systems providing uncertain, regret lower bound, providing uncertain rewards, combinatorial rising bandit, lower bound
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Combinatorial online learning is a fundamental task to decide the optimal combination of base arms in sequential interactions with systems providing uncertain rewards, which is applicable to diverse domains such as robotics, social advertising, network routing and recommendation systems. In real-world scenarios, we often observe rising rewards, where the selection of a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, \it e.g., robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. To address this, we introduce the problem of combinatorial rising bandit to minimize policy regret and propose a provably efficient algorithm, called Combinatorial Rising Upper Confidence Bound (CRUCB), of which regret upper bound is close to a regret lower bound. To the best of our knowledge, previous studies do not provide a sub-linear regret lower bound, making it impossible to assess the efficiency of their algorithms. However, we provide the sub-linear regret lower bound for combinatorial rising bandit and show that CRUCB is provably efficient by showing that the regret upper bound is close to the regret lower bound. In addition, we empirically demonstrate the effectiveness and superiority of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.

[LG-59] Online Poisoning Attack Against Reinforcement Learning under Black-box Environments

链接: https://arxiv.org/abs/2412.00797
作者: Jianhui Li,Bokang Zhang,Junfeng Wu
关键词-EN: adversary deliberately manipulates, deliberately manipulates training, manipulates training data, learning agents operating, reinforcement learning agents
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper proposes an online environment poisoning algorithm tailored for reinforcement learning agents operating in a black-box setting, where an adversary deliberately manipulates training data to lead the agent toward a mischievous policy. In contrast to prior studies that primarily investigate white-box settings, we focus on a scenario characterized by \textitunknown environment dynamics to the attacker and a \textitflexible reinforcement learning algorithm employed by the targeted agent. We first propose an attack scheme that is capable of poisoning the reward functions and state transitions. The poisoning task is formalized as a constrained optimization problem, following the framework of \citema2019policy. Given the transition probabilities are unknown to the attacker in a black-box environment, we apply a stochastic gradient descent algorithm, where the exact gradients are approximated using sample-based estimates. A penalty-based method along with a bilevel reformulation is then employed to transform the problem into an unconstrained counterpart and to circumvent the double-sampling issue. The algorithm’s effectiveness is validated through a maze environment.

[LG-60] Proper Latent Decomposition

链接: https://arxiv.org/abs/2412.00785
作者: Daniel Kelshaw,Luca Magri
关键词-EN: proper orthogonal decomposition, proper latent decomposition, orthogonal decomposition, introduce the proper, proper orthogonal
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Appears in proceedings of the Stanford Center for Turbulence Research Summer Program, 2024

点击查看摘要

Abstract:In this paper, we introduce the proper latent decomposition (PLD) as a generalization of the proper orthogonal decomposition (POD) on manifolds. PLD is a nonlinear reduced-order modeling technique for compressing high-dimensional data into nonlinear coordinates. First, we compute a reduced set of intrinsic coordinates (latent space) to accurately describe a flow with fewer degrees of freedom than the numerical discretization. The latent space, which is geometrically a manifold, is inferred by an autoencoder. Second, we leverage tools from differential geometry to develop numerical methods for operating directly on the latent space; namely, a metric-constrained Eikonal solver for distance computations. With this proposed numerical framework, we propose an algorithm to perform PLD on the manifold. Third, we demonstrate results for a laminar flow case and the turbulent Kolmogorov flow. For the laminar flow case, we are able to identify a semi-analytical expression for the solution of Navier-Stokes; in the Kolmogorov flow case, we are able to identify a dominant mode that exhibits physical structures, which are compared with POD. This work opens opportunities for analyzing autoencoders and latent spaces, nonlinear reduced-order modeling and scientific insights into the structure of high-dimensional data.

[LG-61] Learning Mamba as a Continual Learner

链接: https://arxiv.org/abs/2412.00776
作者: Chongyang Zhao,Dong Gong
关键词-EN: MCL, efficiently learn, learn and accumulate, accumulate knowledge, data stream
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning (CL) aims to efficiently learn and accumulate knowledge from a data stream with different distributions. By formulating CL as a sequence prediction task, meta-continual learning (MCL) enables to meta-learn an efficient continual learner based on the recent advanced sequence models, e.g., Transformers. Although attention-free models (e.g., Linear Transformers) can ideally match CL’s essential objective and efficiency requirements, they usually perform not well in MCL. Considering that the attention-free Mamba achieves excellent performances matching Transformers’ on general sequence modeling tasks, in this paper, we aim to answer a question – Can attention-free Mamba perform well on MCL? By formulating Mamba with a selective state space model (SSM) for MCL tasks, we propose to meta-learn Mamba as a continual learner, referred to as MambaCL. By incorporating a selectivity regularization, we can effectively train MambaCL. Through comprehensive experiments across various CL tasks, we also explore how Mamba and other models perform in different MCL scenarios. Our experiments and analyses highlight the promising performance and generalization capabilities of Mamba in MCL.

[LG-62] Bridging Fairness Gaps: A (Conditional) Distance Covariance Perspective in Fairness Learning

链接: https://arxiv.org/abs/2412.00720
作者: Ruifan Huang,Haixia Liu
关键词-EN: distance covariance, sensitive attributes, distance covariance statistics, statistical perspective, perspective by selectively
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:We bridge fairness gaps from a statistical perspective by selectively utilizing either conditional distance covariance or distance covariance statistics as measures to assess the independence between predictions and sensitive attributes. We enhance fairness by incorporating sample (conditional) distance covariance as a manageable penalty term into the machine learning process. Additionally, we present the matrix form of empirical (conditional) distance covariance for parallel calculations to enhance computational efficiency. Theoretically, we provide a proof for the convergence between empirical and population (conditional) distance covariance, establishing necessary guarantees for batch computations. Through experiments conducted on a range of real-world datasets, we have demonstrated that our method effectively bridges the fairness gap in machine learning.

[LG-63] owards Privacy-Preserving Medical Imaging: Federated Learning with Differential Privacy and Secure Aggregation Using a Modified ResNet Architecture NEURIPS2024

链接: https://arxiv.org/abs/2412.00687
作者: Mohamad Haj Fares,Ahmed Mohamed Saad Emam Saad
关键词-EN: Secure Multi-Party Computation, medical image classification, combines local differential, Multi-Party Computation, secure aggregation
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) - MusIML Workshop

点击查看摘要

Abstract:With increasing concerns over privacy in healthcare, especially for sensitive medical data, this research introduces a federated learning framework that combines local differential privacy and secure aggregation using Secure Multi-Party Computation for medical image classification. Further, we propose DPResNet, a modified ResNet architecture optimized for differential privacy. Leveraging the BloodMNIST benchmark dataset, we simulate a realistic data-sharing environment across different hospitals, addressing the distinct privacy challenges posed by federated healthcare data. Experimental results indicate that our privacy-preserving federated model achieves accuracy levels close to non-private models, surpassing traditional approaches while maintaining strict data confidentiality. By enhancing the privacy, efficiency, and reliability of healthcare data management, our approach offers substantial benefits to patients, healthcare providers, and the broader healthcare ecosystem.

[LG-64] DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLM s with Refined Rotation

链接: https://arxiv.org/abs/2412.00648
作者: Jingyang Xiang,Saiqian Zhang
关键词-EN: attracted significant attention, recently attracted significant, randomized Hadamard transforms, large language models, significant attention
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 38 figures, source code \url{ this https URL }

点击查看摘要

Abstract:Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomena remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-8B, a model known for its quantization challenges.

[LG-65] Revisit Non-parametric Two-sample Testing as a Semi-supervised Learning Problem

链接: https://arxiv.org/abs/2412.00613
作者: Xunye Tian,Liuhua Peng,Zhijian Zhou,Mingming Gong,Feng Liu
关键词-EN: learning inherent representations, two-sample testing, learning discriminative representations, effective data representations, data inherent features
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning effective data representations is crucial in answering if two samples X and Y are from the same distribution (a.k.a. the non-parametric two-sample testing problem), which can be categorized into: i) learning discriminative representations (DRs) that distinguish between two samples in a supervised-learning paradigm, and ii) learning inherent representations (IRs) focusing on data’s inherent features in an unsupervised-learning paradigm. However, both paradigms have issues: learning DRs reduces the data points available for the two-sample testing phase, and learning purely IRs misses discriminative cues. To mitigate both issues, we propose a novel perspective to consider non-parametric two-sample testing as a semi-supervised learning (SSL) problem, introducing the SSL-based Classifier Two-Sample Test (SSL-C2ST) framework. While a straightforward implementation of SSL-C2ST might directly use existing state-of-the-art (SOTA) SSL methods to train a classifier with labeled data (with sample indexes X or Y) and unlabeled data (the remaining ones in the two samples), conventional two-sample testing data often exhibits substantial overlap between samples and violates SSL methods’ assumptions, resulting in low test power. Therefore, we propose a two-step approach: first, learn IRs using all data, then fine-tune IRs with only labelled data to learn DRs, which can both utilize information from whole dataset and adapt the discriminative power to the given data. Extensive experiments and theoretical analysis demonstrate that SSL-C2ST outperforms traditional C2ST by effectively leveraging unlabeled data. We also offer a stronger empirically designed test achieving the SOTA performance in many two-sample testing datasets.

[LG-66] Exploration and Evaluation of Bias in Cyberbullying Detection with Machine Learning

链接: https://arxiv.org/abs/2412.00609
作者: Andrew Root,Liam Jakubowski,Mounika Vanamala
关键词-EN: machine learning, resulting machine learning, machine learning model, generalize to unseen, data
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:It is well known that the usefulness of a machine learning model is due to its ability to generalize to unseen data. This study uses three popular cyberbullying datasets to explore the effects of data, how it’s collected, and how it’s labeled, on the resulting machine learning models. The bias introduced from differing definitions of cyberbullying and from data collection is discussed in detail. An emphasis is made on the impact of dataset expansion methods, which utilize current data points to fetch and label new ones. Furthermore, explicit testing is performed to evaluate the ability of a model to generalize to unseen datasets through cross-dataset evaluation. As hypothesized, the models have a significant drop in the Macro F1 Score, with an average drop of 0.222. As such, this study effectively highlights the importance of dataset curation and cross-dataset testing for creating models with real-world applicability. The experiments and other code can be found at this https URL.

[LG-67] Contextual Bandits in Payment Processing: Non-uniform Exploration and Supervised Learning at Adyen WWW’25

链接: https://arxiv.org/abs/2412.00569
作者: Akhila Vangara,Alex Egg
关键词-EN: Uniform random exploration, incurs high regret, Uniform random, decision-making systems supports, systems supports off-policy
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 7 pages, 10 figures, submitted to WWW '25

点击查看摘要

Abstract:Uniform random exploration in decision-making systems supports off-policy learning via supervision but incurs high regret, making it impractical for many applications. Conversely, non-uniform exploration offers better immediate performance but lacks support for off-policy learning. Recent research suggests that regression oracles can bridge this gap by combining non-uniform exploration with supervised learning. In this paper, we analyze these approaches within a real-world industrial context at Adyen, a large global payments processor characterized by batch logged delayed feedback, short-term memory, and dynamic action spaces under the Empirical Risk Minimization (ERM) framework. Our analysis reveals that while regression oracles significantly improve performance, they introduce challenges due to rigid algorithmic assumptions. Specifically, we observe that as a policy improves, subsequent generations may perform worse due to shifts in the reward distribution and increased class imbalance in the training data. This degradation occurs de spite improvements in other aspects of the training data, leading to decreased performance in successive policy iterations. We further explore the long-term impact of regression oracles, identifying a potential “oscillation effect.” This effect arises when regression oracles influence probability estimates and the realizability of subsequent policy models, leading to fluctuations in performance across iterations. Our findings highlight the need for more adaptable algorithms that can leverage the benefits of regression oracles without introducing instability in policy performance over time.

[LG-68] he Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning NEURIPS2024

链接: https://arxiv.org/abs/2412.00568
作者: Ruben Ohana,Michael McCabe,Lucas Meyer,Rudy Morel,Fruzsina J. Agocs,Miguel Beneitez,Marsha Berger,Blakesley Burkhart,Stuart B. Dalziel,Drummond B. Fielding,Daniel Fortunato,Jared A. Goldberg,Keiya Hirashima,Yan-Fei Jiang,Rich R. Kerswell,Suryanarayana Maddu,Jonah Miller,Payel Mukhopadhyay,Stefan S. Nixon,Jeff Shen,Romain Watteaux,Bruno Régaldo-Saint Blancard,François Rozet,Liam H. Parker,Miles Cranmer,Shirley Ho
关键词-EN: Machine learning based, accelerating simulation-based workflows, learning based surrogate, offer researchers powerful, researchers powerful tools
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite. To facilitate usage of the Well, we provide a unified PyTorch interface for training and evaluating models. We demonstrate the function of this library by introducing example baselines that highlight the new challenges posed by the complex dynamics of the Well. The code and data is available at this https URL.

[LG-69] Rank It Then Ask It: Input Reranking for Maximizing the Performance of LLM s on Symmetric Tasks

链接: https://arxiv.org/abs/2412.00546
作者: Mohsen Dehghankar,Abolfazl Asudeh
关键词-EN: Large language models, language models, range of domains, LLMs, quickly emerged
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have quickly emerged as practical and versatile tools that provide new solutions for a wide range of domains. In this paper, we consider the application of LLMs on symmetric tasks where a query is asked on an (unordered) bag of elements. Examples of such tasks include answering aggregate queries on a database table. In general, when the bag contains a large number of elements, LLMs tend to overlook some elements, leading to challenges in generating accurate responses to the query. LLMs receive their inputs as ordered sequences. However, in this problem, we leverage the fact that the symmetric input is not ordered, and reordering should not affect the LLM’s response. Observing that LLMs are less likely to miss elements at certain positions of the input, we introduce the problem of LLM input reranking: to find a ranking of the input that maximizes the LLM’s accuracy for the given query without making explicit assumptions about the query. Finding the optimal ranking requires identifying (i) the relevance of each input element for answering the query and (ii) the importance of each rank position for the LLM’s attention. We develop algorithms for estimating these values efficiently utilizing a helper LLM. We conduct comprehensive experiments on different synthetic and real datasets to validate our proposal and to evaluate the effectiveness of our proposed algorithms. Our experiments confirm that our reranking approach improves the accuracy of the LLMs on symmetric tasks by up to 99% proximity to the optimum upper bound. Subjects: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2412.00546 [cs.LG] (or arXiv:2412.00546v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.00546 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Context-Based Echo State Networks with Prediction Confidence for Human-Robot Shared Control

链接: https://arxiv.org/abs/2412.00541
作者: Negin Amirshirzad,Mehmet Arda Eren,Erhan Oztop
关键词-EN: Context-based Echo State, Echo State Network, Context-based Echo, Echo State, State Network
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel lightweight learning from demonstration (LfD) model based on reservoir computing that can learn and generate multiple movement trajectories with prediction intervals, which we call as Context-based Echo State Network with prediction confidence (CESN+). CESN+ can generate movement trajectories that may go beyond the initial LfD training based on a desired set of conditions while providing confidence on its generated output. To assess the abilities of CESN+, we first evaluate its performance against Conditional Neural Movement Primitives (CNMP), a comparable framework that uses a conditional neural process to generate movement primitives. Our findings indicate that CESN+ not only outperforms CNMP but is also faster to train and demonstrates impressive performance in generating trajectories for extrapolation cases. In human-robot shared control applications, the confidence of the machine generated trajectory is a key indicator of how to arbitrate control sharing. To show the usability of the CESN+ for human-robot adaptive shared control, we have designed a proof-of-concept human-robot shared control task and tested its efficacy in adapting the sharing weight between the human and the robot by comparing it to a fixed-weight control scheme. The simulation experiments show that with CESN+ based adaptive sharing the total human load in shared control can be significantly reduced. Overall, the developed CESN+ model is a strong lightweight LfD system with desirable properties such fast training and ability to extrapolate to the new task parameters while producing robust prediction intervals for its output.

[LG-71] Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities

链接: https://arxiv.org/abs/2412.00538
作者: Ayush Mohanty,Jason Dekarske,Stephen K. Robinson,Sanjay Joshi,Nagi Gebraeel
关键词-EN: degrade over time, robotic manipulator Remaining, task severity, Remaining Lifetime Distribution, RUL
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot’s end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator’s Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot’s position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches – (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.

[LG-72] Exact Certification of (Graph) Neural Networks Against Label Poisoning

链接: https://arxiv.org/abs/2412.00537
作者: Mahalakshmi Sabanayagam,Lukas Gosch,Stephan Günnemann,Debarghya Ghoshdastidar
关键词-EN: Machine learning models, Machine learning, adversarial modification, compromise performance, label flipping
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Under review

点击查看摘要

Abstract:Machine learning models are highly vulnerable to label flipping, i.e., the adversarial modification (poisoning) of training labels to compromise performance. Thus, deriving robustness certificates is important to guarantee that test predictions remain unaffected and to understand worst-case robustness behavior. However, for Graph Neural Networks (GNNs), the problem of certifying label flipping has so far been unsolved. We change this by introducing an exact certification method, deriving both sample-wise and collective certificates. Our method leverages the Neural Tangent Kernel (NTK) to capture the training dynamics of wide networks enabling us to reformulate the bilevel optimization problem representing label flipping into a Mixed-Integer Linear Program (MILP). We apply our method to certify a broad range of GNN architectures in node classification tasks. Thereby, concerning the worst-case robustness to label flipping: (i) we establish hierarchies of GNNs on different benchmark graphs; (ii) quantify the effect of architectural choices such as activations, depth and skip-connections; and surprisingly, (iii) uncover a novel phenomenon of the robustness plateauing for intermediate perturbation budgets across all investigated datasets and architectures. While we focus on GNNs, our certificates are applicable to sufficiently wide NNs in general through their NTK. Thus, our work presents the first exact certificate to a poisoning attack ever derived for neural networks, which could be of independent interest.

[LG-73] A Self-Explainable Heterogeneous GNN for Relational Deep Learning

链接: https://arxiv.org/abs/2412.00521
作者: Francesco Ferrini,Antonio Longa,Andrea Passerini,Manfred Jaeger
关键词-EN: graph neural network, significant attention, enabling the application, neural network, technology for predictive
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Recently, significant attention has been given to the idea of viewing relational databases as heterogeneous graphs, enabling the application of graph neural network (GNN) technology for predictive tasks. However, existing GNN methods struggle with the complexity of the heterogeneous graphs induced by databases with numerous tables and relations. Traditional approaches either consider all possible relational meta-paths, thus failing to scale with the number of relations, or rely on domain experts to identify relevant meta-paths. A recent solution does manage to learn informative meta-paths without expert supervision, but assumes that a node’s class depends solely on the existence of a meta-path occurrence. In this work, we present a self-explainable heterogeneous GNN for relational data, that supports models in which class membership depends on aggregate information obtained from multiple occurrences of a meta-path. Experimental results show that in the context of relational databases, our approach effectively identifies informative meta-paths that faithfully capture the model’s reasoning mechanisms. It significantly outperforms existing methods in both synthetic and real-world scenario.

[LG-74] Distributed Differentially Private Data Analytics via Secure Sketching

链接: https://arxiv.org/abs/2412.00497
作者: Jakob Burkhardt,Hannah Keller,Claudio Orlandi,Chris Schwiegelshohn
关键词-EN: resulting distributed algorithm, differentially private mechanism, central model, distributed differentially private, model
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the use of distributed differentially private computations across multiple servers, balancing the tradeoff between the error introduced by the differentially private mechanism and the computational efficiency of the resulting distributed algorithm. We introduce the linear-transformation model, where clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques. The linear-transformation model serves as an intermediate model between the highly expressive central model and the minimal local model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model. We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients. Previously, such accuracy had only been achieved in the more expressive central model. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.00497 [cs.CR] (or arXiv:2412.00497v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.00497 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hannah Keller [view email] [v1] Sat, 30 Nov 2024 14:43:00 UTC (2,224 KB)

[LG-75] Rethinking Strategic Mechanism Design In The Age Of Large Language Models : New Directions For Communication Systems

链接: https://arxiv.org/abs/2412.00495
作者: Ismail Lotfi,Nouf Alabbasi,Omar Alhussein
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: submitted to IEEE IoTM

点击查看摘要

[LG-76] Automatic Differentiation-based Full Waveform Inversion with Flexible Workflows

链接: https://arxiv.org/abs/2412.00486
作者: Feng Liu,Haipeng Li,Guangyuan Zou,Junlun Li
关键词-EN:
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
*备注: Manuscript including 14 pages supplement. Code link: this https URL

点击查看摘要

[LG-77] On autoregressive deep learning models for day-ahead wind power forecasting with irregular shutdowns due to redispatching

链接: https://arxiv.org/abs/2412.00423
作者: Stefan Meisenbacher,Silas Aaron Selzer,Mehdi Dado,Maximilian Beichter,Tim Martin,Markus Zdrallek,Peter Bretschneider,Veit Hagenmeyer,Ralf Mikut
关键词-EN:
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-78] AutoPQ: Automating Quantile estimation from Point forecasts in the context of sustainability

链接: https://arxiv.org/abs/2412.00419
作者: Stefan Meisenbacher,Kaleb Phipps,Oskar Taubert,Marie Weiel,Markus Götz,Ralf Mikut,Veit Hagenmeyer
关键词-EN:
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-79] QuAKE: Speeding up Model Inference Using Quick and Approximate Kernels for Exponential Non-Linearities

链接: https://arxiv.org/abs/2412.00408
作者: Sai Kiran Narayanaswami,Gopalakrishnan Srinivasan,Balaraman Ravindran
关键词-EN:
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-80] PAL – Parallel active learning for machine-learned potentials

链接: https://arxiv.org/abs/2412.00401
作者: Chen Zhou,Marlen Neubert,Yuri Koide,Yumeng Zhang,Van-Quan Vuong,Tobias Schlöder,Stefanie Dehnen,Pascal Friederich
关键词-EN:
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Distributed, Parallel, and Cluster Computing (cs.DC); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 25 pages, 4 figures, and 1 table (references and SI included)

点击查看摘要

[LG-81] ARMOR: Egocentric Perception for Humanoid Robot Collision Avoidance and Motion Planning

链接: https://arxiv.org/abs/2412.00396
作者: Daehwa Kim,Mario Srouji,Chen Chen,Jian Zhang
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-82] On Foundation Models for Dynamical Systems from Purely Synthetic Data

链接: https://arxiv.org/abs/2412.00395
作者: Martin Ziegler,Andres Felipe Posada-Moreno,Friedrich Solowjow,Sebastian Trimpe
关键词-EN:
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注: 10 pages

点击查看摘要

[LG-83] oward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation

链接: https://arxiv.org/abs/2412.00382
作者: Chengyu Li,Debo Cheng,Guixian Zhang,Yi Li,Shichao Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-84] Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding NEURIPS2024

链接: https://arxiv.org/abs/2412.00369
作者: Daniel Severo,Ashish Khisti,Alireza Makhzani
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: Published in NeurIPS 2024

点击查看摘要

[LG-85] Mechanism design with multi-armed bandit

链接: https://arxiv.org/abs/2412.00345
作者: Takayuki Osogami,Hirota Kinoshita,Segev Wasserkrug
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

[LG-86] Robust Table Integration in Data Lakes

链接: https://arxiv.org/abs/2412.00324
作者: Daomin Ji,Hui Luo,Zhifeng Bao,Shane Culpepper
关键词-EN:
类目: Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-87] Bandit Learning in Matching Markets: Utilitarian and Rawlsian Perspectives

链接: https://arxiv.org/abs/2412.00301
作者: Hadi Hosseini,Duohan Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

[LG-88] Robust Testing for Deep Learning using Human Label Noise

链接: https://arxiv.org/abs/2412.00244
作者: Gordon Lim,Stefan Larson,Kevin Leach
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-89] Multigraph Message Passing with Bi-Directional Multi-Edge Aggregations

链接: https://arxiv.org/abs/2412.00241
作者: H. Çağrı Bilgi,Lydia Y. Chen,Kubilay Atasu
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 figures

点击查看摘要

[LG-90] Meta-learning Loss Functions of Parametric Partial Differential Equations Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2412.00225
作者: Michail Koumpanakis,Ricardo Vilalta
关键词-EN:
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-91] Improving the performance of weak supervision searches using data augmentation

链接: https://arxiv.org/abs/2412.00198
作者: Zong-En Chen,Cheng-Wei Chiang,Feng-Yang Hsieh
关键词-EN:
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:

点击查看摘要

[LG-92] Spatial Clustering of Molecular Localizations with Graph Neural Networks

链接: https://arxiv.org/abs/2412.00173
作者: Jesús Pineda,Sergi Masó-Orriols,Joan Bertran,Mattias Goksör,Giovanni Volpe,Carlo Manzo
关键词-EN:
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-93] Modelling Networked Dynamical System by Temporal Graph Neural ODE with Irregularly Partial Observed Time-series Data

链接: https://arxiv.org/abs/2412.00165
作者: Mengbang Zou,Weisi Guo
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-94] Dynamic High-Order Control Barrier Functions with Diffuser for Safety-Critical Trajectory Planning at Signal-Free Intersections

链接: https://arxiv.org/abs/2412.00162
作者: Di Chen,Ruiguo Zhong,Kehua Chen,Zhiwei Shang,Meixin Zhu,Edward Chung
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 figures, 3 tables, 12 pages

点击查看摘要

[LG-95] Semi-Supervised Neural Processes for Articulated Object Interactions

链接: https://arxiv.org/abs/2412.00145
作者: Emily Liu,Michael Noseworthy,Nicholas Roy
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-96] Scaling Particle Collision Data Analysis

链接: https://arxiv.org/abs/2412.00129
作者: Hengkui Wu,Panpan Chi,Yongfeng Zhu,Liujiang Liu,Shuyang Hu,Yuexin Wang,Chen Zhou,Qihao Wang,Yingsi Xin,Bruce Liu,Dahao Liang,Xinglong Jia,Manqi Ruan
关键词-EN:
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

[LG-97] Streamlined Federated Unlearning: Unite as One to Be Highly Efficient

链接: https://arxiv.org/abs/2412.00126
作者: Lei Zhou,Youwen Zhu,Qiao Xue,Ji Zhang,Pengfei Zhang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-98] Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression combined with Kernel-Based Support Vector Regression

链接: https://arxiv.org/abs/2412.00123
作者: Abhinav Das,Stephan Schlüter,Lorenz Schneider
关键词-EN:
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

[LG-99] Deep Neural Network-Based Prediction of B-Cell Epitopes for SARS-CoV and SARS-CoV-2: Enhancing Vaccine Design through Machine Learning

链接: https://arxiv.org/abs/2412.00109
作者: Xinyu Shi,Yixin Tao,Shih-Chi Lin
关键词-EN:
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-100] Act Now: A Novel Online Forecasting Framework for Large-Scale Streaming Data

链接: https://arxiv.org/abs/2412.00108
作者: Daojun Liang,Haixia Zhang,Jing Wang,Dongfeng Yuan,Minggao Zhang
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

[LG-101] Predicting Extubation Failure in Intensive Care: The Development of a Novel End-to-End Actionable and Interpretable Prediction System

链接: https://arxiv.org/abs/2412.00105
作者: Akram Yoosoofsah
关键词-EN:
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Thesis submitted in fulfilment of requirements for the degree of Master of Science in Computing - Department of Computing, Imperial College London

点击查看摘要

[LG-102] Multi-Label Contrastive Learning : A Comprehensive Study

链接: https://arxiv.org/abs/2412.00101
作者: Alexandre Audibert,Aurélien Gauffre,Massih-Reza Amini
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: 28 pages, 1 figure

点击查看摘要

[LG-103] Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators

链接: https://arxiv.org/abs/2412.00088
作者: Zekun Shi,Zheyuan Hu,Min Lin,Kenji Kawaguchi
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-104] ONION: Physics-Informed Deep Learning Model for Line Integral Diagnostics Across Fusion Devices

链接: https://arxiv.org/abs/2412.00087
作者: Cong Wang,Weizhe Yang,Haiping Wang,Renjie Yang,Jing Li,Zhijun Wang,Xinyao Yu,Yixiong Wei,Xianli Huang,Zhaoyang Liu,Changqing Zou,Zhifeng Zhao
关键词-EN:
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

[LG-105] Dynamic Non-Prehensile Object Transport via Model-Predictive Reinforcement Learning

链接: https://arxiv.org/abs/2412.00086
作者: Neel Jawale,Byron Boots,Balakumar Sundaralingam,Mohak Bhardwaj
关键词-EN:
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

[LG-106] Unpacking the Individual Components of Diffusion Policy

链接: https://arxiv.org/abs/2412.00084
作者: Xiu Yuan
关键词-EN:
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

[LG-107] ask Singular Vectors: Reducing Task Interference in Model Merging

链接: https://arxiv.org/abs/2412.00081
作者: Antonio Andrea Gargiulo,Donato Crisostomi,Maria Sofia Bucarelli,Simone Scardapane,Fabrizio Silvestri,Emanuele Rodolà
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 17 figures, 6 tables

点击查看摘要

[LG-108] Deep Learning-Based Electricity Price Forecast for Virtual Bidding in Wholesale Electricity Market

链接: https://arxiv.org/abs/2412.00062
作者: Xuesong Wang,Sharaf K. Magableh,Oraib Dawaghreh,Caisheng Wang,Jiaxuan Gong,Zhongyang Zhao,Michael H. Liao
关键词-EN:
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: Submitted to 2025 IEEE PES General Meeting

点击查看摘要

[LG-109] Less is More: Efficient Model Merging with Binary Task Switch

链接: https://arxiv.org/abs/2412.00054
作者: Biqing Qi,Fangyuan Li,Zhen Wang,Junqi Gao,Dong Li,Peng Ye,Bowen Zhou
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-110] he Femininomenon of Inequality: A Data-Driven Analysis and Cluster Profiling in Indonesia

链接: https://arxiv.org/abs/2412.00012
作者: J. S. Muthmaina
关键词-EN:
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-111] he Data-Driven Censored Newsvendor Problem

链接: https://arxiv.org/abs/2412.01763
作者: Chamsi Hssaine,Sean R. Sinclair
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 67 pages, 19 tables, 7 figures

点击查看摘要

[LG-112] Characterizing Jupiters interior using machine learning reveals four key structures

链接: https://arxiv.org/abs/2412.01611
作者: Maayan Ziv,Eli Galanti,Saburo Howard,Tristan Guillot,Yohai Kaspi
关键词-EN:
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 11 pages, 8 figures, 3 tables, accepted for publication in AA

点击查看摘要

[LG-113] Kernel-Based Optimal Control: An Infinitesimal Generator Approach

链接: https://arxiv.org/abs/2412.01591
作者: Petar Bevanda,Nicolas Hosichen,Tobias Wittmann,Jan Brüdigam,Sandra Hirche,Boris Houska
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

[LG-114] Unifying AMP Algorithms for Rotationally-Invariant Models

链接: https://arxiv.org/abs/2412.01574
作者: Songbin Liu,Junjie Ma
关键词-EN:
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

[LG-115] Generative modeling assisted simulation of measurement-altered quantum criticality

链接: https://arxiv.org/abs/2412.01513
作者: Yuchen Zhu,Molei Tao,Yuebo Jin,Xie Chen
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-116] ACO: Training-free Sound Prompted Segmentation via Deep Audio-visual CO-factorization

链接: https://arxiv.org/abs/2412.01488
作者: Hugo Malard,Michel Olvera,Stephane Lathuiliere,Slim Essid
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[LG-117] Refined Analysis of Federated Averagings Bias and Federated Richardson-Romberg Extrapolation

链接: https://arxiv.org/abs/2412.01389
作者: Paul Mangold,Alain Durmus,Aymeric Dieuleveut,Sergey Samsonov,Eric Moulines
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 37 pages

点击查看摘要

[LG-118] Big data approach to Kazhdan-Lusztig polynomials

链接: https://arxiv.org/abs/2412.01283
作者: Abel Lacabanne,Daniel Tubbenhauer,Pedro Vaz
关键词-EN:
类目: Representation Theory (math.RT); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 22 pages, many figures, comments welcome

点击查看摘要

[LG-119] Reliable and scalable variable importance estimation via warm-start and early stopping

链接: https://arxiv.org/abs/2412.01120
作者: Zexuan Sun,Garvesh Raskutti
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-120] Spatial Conformal Inference through Localized Quantile Regression

链接: https://arxiv.org/abs/2412.01098
作者: Hanyang Jiang,Yao Xie
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-121] An Efficient Unsupervised Framework for Convex Quadratic Programs via Deep Unrolling

链接: https://arxiv.org/abs/2412.01051
作者: Linxin Yang,Bingheng Li,Tian Ding,Jianghua Wu,Akang Wang,Yuyi Wang,Jiliang Tang,Ruoyu Sun,Xiaodong Luo
关键词-EN:
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-122] On the Feature Learning in Diffusion Models

链接: https://arxiv.org/abs/2412.01021
作者: Andi Han,Wei Huang,Yuan Cao,Difan Zou
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-123] Energy-Based Modelling for Discrete and Mixed Data via Heat Equations on Structured Spaces NEURIPS2024

链接: https://arxiv.org/abs/2412.01019
作者: Tobias Schröder,Zijing Ou,Yingzhen Li,Andrew B. Duncan
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To appear in the proceedings of NeurIPS 2024

点击查看摘要

[LG-124] A Note on Estimation Error Bound and Grouping Effect of Transfer Elastic Net

链接: https://arxiv.org/abs/2412.01010
作者: Yui Tomo
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-125] AI Meets Antimatter: Unveiling Antihydrogen Annihilations NEURIPS

链接: https://arxiv.org/abs/2412.00961
作者: Ashley Ferreira,Mahip Singh,Andrea Capra,Ina Carli,Daniel Duque Quiceno,Wojciech T. Fedorko,Makoto M. Fujiwara,Muyan Li,Lars Martin,Yukiya Saito,Gareth Smith,Anqi Xu
关键词-EN:
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, submitted to Machine Learning and the Physical Sciences Workshop at the 38th conference on Neural Information Processing Systems (NeurIPS)

点击查看摘要

[LG-126] Construction of generalized samplets in Banach spaces

链接: https://arxiv.org/abs/2412.00954
作者: Peter Balazs,Michael Multerer
关键词-EN:
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

[LG-127] 3D-PDR Orion dataset and NeuralPDR: Neural Differential Equations for Photodissociation Regions NEURIPS

链接: https://arxiv.org/abs/2412.00758
作者: Gijs Vermariën,Serena Viti,Rahul Ravichandran,Thomas G. Bisbas
关键词-EN:
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS Machine Learning and the Physical Sciences Workshop 2024

点击查看摘要

[LG-128] Invariant Measures in Time-Delay Coordinates for Unique Dynamical System Identification

链接: https://arxiv.org/abs/2412.00589
作者: Jonah Botvinick-Greenhouse,Robert Martin,Yunan Yang
关键词-EN:
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

[LG-129] Operator learning regularization for macroscopic permeability prediction in dual-scale flow problem

链接: https://arxiv.org/abs/2412.00579
作者: Christina Runkel,Sinan Xiao,Nicolas Boullé,Yang Chen
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 23 pages, 7 figures

点击查看摘要

[LG-130] Optimal Particle-based Approximation of Discrete Distributions (OPAD)

链接: https://arxiv.org/abs/2412.00545
作者: Hadi Mohasel Afshar,Gilad Francis,Sally Cripps
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-131] Nonlinearity and Uncertainty Informed Moment-Matching Gaussian Mixture Splitting

链接: https://arxiv.org/abs/2412.00343
作者: Jackson Kulik,Keith A. LeGrand
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-132] Differentiable High-Order Markov Models for Spectrum Prediction

链接: https://arxiv.org/abs/2412.00328
作者: Vincent Corlay,Tatsuya Nakazato,Kanako Yamaguchi,Akinori Nakajima
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-133] Scaling of Stochastic Normalizing Flows in mathrmSU(3) lattice gauge theory

链接: https://arxiv.org/abs/2412.00200
作者: Andrea Bulgarelli,Elia Cellini,Alessandro Nada
关键词-EN:
类目: High Energy Physics - Lattice (hep-lat); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 pages, 12 figures

点击查看摘要

[LG-134] A Context-Based Numerical Format Prediction for a Text-To-Speech System

链接: https://arxiv.org/abs/2412.00028
作者: Yaser Darwesh,Lit Wei Wern,Mumtaz Begum Mustafa
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 21 pages, 6 tables, 1 figure

点击查看摘要

信息检索

[IR-0] Global Estimation of Building-Integrated Facade and Rooftop Photovoltaic Potential by Integrating 3D Building Footprint and Spatio-Temporal Datasets

链接: https://arxiv.org/abs/2412.01291
作者: Qing Yu,Kechuan Dong,Zhiling Guo,Jiaxing Li,Hongjun Tan,Yanxiu Jin,Jian Yuan,Haoran Zhang,Junwei Liu,Qi Chen,Jinyue Yan
关键词-EN: estimating Building-Integrated Photovoltaics, Building-Integrated Photovoltaics, spatial scales, BIPV potential, research tackles
类目: Information Retrieval (cs.IR); Emerging Technologies (cs.ET)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:This research tackles the challenges of estimating Building-Integrated Photovoltaics (BIPV) potential across various temporal and spatial scales, accounting for different geographical climates and urban morphology. We introduce a holistic methodology for evaluating BIPV potential, integrating 3D building footprint models with diverse meteorological data sources to account for dynamic shadow effects. The approach enables the assessment of PV potential on facades and rooftops at different levels-individual buildings, urban blocks, and cities globally. Through an analysis of 120 typical cities, we highlight the importance of 3D building forms, cityscape morphology, and geographic positioning in measuring BIPV potential at various levels. In particular, our simulation study reveals that among cities with optimal facade PV performance, the average ratio of facade PV potential to rooftop PV potential is approximately 68.2%. Additionally, approximately 17.5% of the analyzed samples demonstrate even higher facade PV potentials compared to rooftop installations. This finding underscores the strategic value of incorporating facade PV applications into urban sustainable energy systems.

[IR-1] Lossless and Privacy-Preserving Graph Convolution Network for Federated Item Recommendation

链接: https://arxiv.org/abs/2412.01141
作者: Guowei Wu,Weike Pan,Qiang Yang,Zhong Ming
关键词-EN: federated recommendation methods, Graph neural network, recommendation methods, solution for item, federated recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph neural network (GNN) has emerged as a state-of-the-art solution for item recommendation. However, existing GNN-based recommendation methods rely on a centralized storage of fragmented user-item interaction sub-graphs and training on an aggregated global graph, which will lead to privacy concerns. As a response, some recent works develop GNN-based federated recommendation methods by exploiting decentralized and fragmented user-item sub-graphs in order to preserve user privacy. However, due to privacy constraints, the graph convolution process in existing federated recommendation methods is incomplete compared with the centralized counterpart, causing a degradation of the recommendation performance. In this paper, we propose a novel lossless and privacy-preserving graph convolution network (LP-GCN), which fully completes the graph convolution process with decentralized user-item interaction sub-graphs while ensuring privacy. It is worth mentioning that its performance is equivalent to that of the non-federated (i.e., centralized) counterpart. Moreover, we validate its effectiveness through both theoretical analysis and empirical studies. Extensive experiments on three real-world datasets show that our LP-GCN outperforms the existing federated recommendation methods. The code will be publicly available once the paper is accepted.

[IR-2] Precision Profile Pollution Attack on Sequential Recommenders via Influence Function

链接: https://arxiv.org/abs/2412.01127
作者: Xiaoyu Du,Yingying Chen,Yang Zhang,Jinhui Tang
关键词-EN: Sequential recommendation approaches, demonstrated remarkable proficiency, modeling user preferences, Sequential recommendation, approaches have demonstrated
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation approaches have demonstrated remarkable proficiency in modeling user preferences. Nevertheless, they are susceptible to profile pollution attacks (PPA), wherein items are introduced into a user’s interaction history deliberately to influence the recommendation list. Since retraining the model for each polluted item is time-consuming, recent PPAs estimate item influence based on gradient directions to identify the most effective attack candidates. However, the actual item representations diverge significantly from the gradients, resulting in disparate this http URL tackle this challenge, we introduce an INFluence Function-based Attack approach INFAttack that offers a more accurate estimation of the influence of polluting items. Specifically, we calculate the modifications to the original model using the influence function when generating polluted sequences by introducing specific items. Subsequently, we choose the sequence that has been most significantly influenced to substitute the original sequence, thus promoting the target item. Comprehensive experiments conducted on five real-world datasets illustrate that INFAttack surpasses all baseline methods and consistently delivers stable attack performance for both popular and unpopular items.

[IR-3] Patent-publication pairs for the detection of knowledge transfer from research to industry: reducing ambiguities with word embeddings and references

链接: https://arxiv.org/abs/2412.00978
作者: Klaus Lippert,Konrad U. Förstner
关键词-EN: viewed and evaluated, economic exploitability, perspective, publication output, research
类目: Information Retrieval (cs.IR)
*备注: 16 Pages, 8 figures

点击查看摘要

Abstract:The performance of medical research can be viewed and evaluated not only from the perspective of publication output, but also from the perspective of economic exploitability. Patents can represent the exploitation of research results and thus the transfer of knowledge from research to industry. In this study, we set out to identify publication-patent pairs in order to use patents as a proxy for the economic impact of research. To identify these pairs, we matched scholarly publications and patents by comparing the names of authors and investors. To resolve the ambiguities that arise in this name-matching process, we expanded our approach with two additional filter features, one used to assess the similarity of text content, the other to identify common references in the two document types. To evaluate text similarity, we extracted and transformed technical terms from a medical ontology (MeSH) into numerical vectors using word embeddings. We then calculated the results of the two supporting features over an example five-year period. Furthermore, we developed a statistical procedure which can be used to determine valid patent classes for the domain of medicine. Our complete data processing pipeline is freely available, from the raw data of the two document types right through to the validated publication-patent pairs.

[IR-4] Oracle-guided Dynamic User Preference Modeling for Sequential Recommendation

链接: https://arxiv.org/abs/2412.00813
作者: Jiafeng Xia,Dongsheng Li,Hansu Gu,Tun Lu,Peng Zhang,Li Shang,Ning Gu
关键词-EN: user historical interactions, dynamic user preferences, user preferences, user preference modeling, user
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation methods can capture dynamic user preferences from user historical interactions to achieve better performance. However, most existing methods only use past information extracted from user historical interactions to train the models, leading to the deviations of user preference modeling. Besides past information, future information is also available during training, which contains the oracle'' user preferences in the future and will be beneficial to model dynamic user preferences. Therefore, we propose an oracle-guided dynamic user preference modeling method for sequential recommendation (Oracle4Rec), which leverages future information to guide model training on past information, aiming to learn forward-looking’’ models. Specifically, Oracle4Rec first extracts past and future information through two separate encoders, then learns a forward-looking model through an oracle-guiding module which minimizes the discrepancy between past and future information. We also tailor a two-phase model training strategy to make the guiding more effective. Extensive experiments demonstrate that Oracle4Rec is superior to state-of-the-art sequential methods. Further experiments show that Oracle4Rec can be leveraged as a generic module in other sequential recommendation methods to improve their performance with a considerable margin.

[IR-5] Scaling New Frontiers: Insights into Large Recommendation Models

链接: https://arxiv.org/abs/2412.00714
作者: Wei Guo,Hao Wang,Luankang Zhang,Jin Yao Chin,Zhongzhou Liu,Kai Cheng,Qiushi Pan,Yi Quan Lee,Wanqi Xue,Tingjia Shen,Kenan Song,Kefan Wang,Wenjia Xie,Yuyang Ye,Huifeng Guo,Yong Liu,Defu Lian,Ruiming Tang,Enhong Chen
关键词-EN: retrieving relevant information, large recommendation models, recommendation models, large recommendation, Recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommendation systems are essential for filtering data and retrieving relevant information across various applications. Recent advancements have seen these systems incorporate increasingly large embedding tables, scaling up to tens of terabytes for industrial use. However, the expansion of network parameters in traditional recommendation models has plateaued at tens of millions, limiting further benefits from increased embedding parameters. Inspired by the success of large language models (LLMs), a new approach has emerged that scales network parameters using innovative structures, enabling continued performance improvements. A significant development in this area is Meta’s generative recommendation model HSTU, which illustrates the scaling laws of recommendation systems by expanding parameters to thousands of billions. This new paradigm has achieved substantial performance gains in online experiments. In this paper, we aim to enhance the understanding of scaling laws by conducting comprehensive evaluations of large recommendation models. Firstly, we investigate the scaling laws across different backbone architectures of the large recommendation models. Secondly, we conduct comprehensive ablation studies to explore the origins of these scaling laws. We then further assess the performance of HSTU, as the representative of large recommendation models, on complex user behavior modeling tasks to evaluate its applicability. Notably, we also analyze its effectiveness in ranking tasks for the first time. Finally, we offer insights into future directions for large recommendation models. Supplementary materials for our research are available on GitHub at this https URL.

[IR-6] Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex Natural Language Queries on Multi-modal Data

链接: https://arxiv.org/abs/2412.00639
作者: Mahdi Erfanian,Mohsen Dehghankar,Abolfazl Asudeh
关键词-EN: rich information encoded, image data sets, miss the detailed, detailed descriptions, descriptions that properly
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Multi-modal data, such as image data sets, often miss the detailed descriptions that properly capture the rich information encoded in them. This makes answering complex natural language queries a major challenge in these domains. In particular, unlike the traditional nearest-neighbor search, where the tuples and the query are modeled as points in a data cube, the query and the tuples are of different natures, making the traditional query answering solutions not directly applicable for such settings. Existing literature addresses this challenge for image data through vector representations jointly trained on natural language and images. This technique, however, underperforms for complex queries due to various reasons. This paper takes a step towards addressing this challenge by introducing a Generative-AI (GenAI) powered Monte Carlo method that utilizes foundation models to generate synthetic samples that capture the complexity of the natural language query and transform it to the same space of the multi-modal data. Following this method, we develop a system for image data retrieval and propose practical solutions that enable leveraging future advancements in GenAI and vector representations for improving our system’s performance. Our comprehensive experiments on various benchmark datasets verify that our system significantly outperforms state-of-the-art techniques. Subjects: Information Retrieval (cs.IR); Databases (cs.DB) Cite as: arXiv:2412.00639 [cs.IR] (or arXiv:2412.00639v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.00639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-7] he Impact of Generative AI on Student Churn and the Future of Formal Education

链接: https://arxiv.org/abs/2412.00605
作者: Stephen Elbourn
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, presents unprecedented opportunities, Generative Artificial, Generative
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the contemporary educational landscape, the advent of Generative Artificial Intelligence (AI) presents unprecedented opportunities for personalised learning, fundamentally challenging the traditional paradigms of education. This research explores the emerging trend where high school students, empowered by tailored educational experiences provided by Generative AI, opt to forgo traditional university degrees to pursue entrepreneurial ventures at a younger age. To understand and predict the future of education in the age of Generative AI, we employ a comprehensive methodology to analyse social media data. Our approach includes sentiment analysis to gauge public opinion, topic modelling to identify key themes and emerging trends, and user demographic analysis to understand the engagement of different age groups and regions. We also perform influencer analysis to identify key figures shaping the discourse and engagement metrics to measure the level of interest and interaction with AI-related educational content. Content analysis helps us to determine the types of content being shared and the prevalent narratives, while hashtag analysis reveals the connectivity of discussions. The temporal analysis tracks changes over time and identifies event-based spikes in discussions. The insights derived from this analysis include the acceptance and adoption of Generative AI in education, its impact on traditional education models, the influence on students’ entrepreneurial ambitions, and the educational outcomes associated with AI-driven personalised learning. Additionally, we explore public sentiment towards policies and regulations and use predictive modelling to forecast future trends. This comprehensive social media analysis provides a nuanced understanding of the evolving educational landscape, offering valuable perspectives on the role of Generative AI in shaping the future of education.

[IR-8] CDEMapper: Enhancing NIH Common Data Element Normalization using Large Language Models

链接: https://arxiv.org/abs/2412.00491
作者: Yan Wang,Jimin Huang,Huan He,Vincent Zhang,Yujia Zhou,Xubing Hao,Pritham Ram,Lingfei Qian,Qianqian Xie,Ruey-Ling Weng,Fongci Lin,Yan Hu,Licong Cui,Xiaoqian Jiang,Hua Xu,Na Hong
关键词-EN:
类目: Information Retrieval (cs.IR)
*备注: 11 pages,4 figures

点击查看摘要

[IR-9] FairSort: Learning to Fair Rank for Personalized Recommendations in Two-Sided Platforms

链接: https://arxiv.org/abs/2412.00424
作者: Guoli Wu,Zhiyong Feng,Shizhan Chen,Hongyue Wu,Xiao Xue,Jianmao Xiao,Guodong Fan,Hongqi Chen,Jingyu Li
关键词-EN:
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
*备注:

点击查看摘要

附件下载

点击下载今日全部论文列表

目录

概览 (2024-12-03)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载