Arxiv今日论文 | 2024-12-05

本篇博文主要展示 2024-12-05 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决传统社会学研究中依赖人工参与所带来的成本高、难以扩展及伦理问题。解决方案的关键在于利用大型语言模型（LLMs）驱动的代理进行模拟，从而模拟人类行为，复制个体反应，并促进跨学科研究。论文将这些模拟分为三类：个体模拟（Individual Simulation）、场景模拟（Scenario Simulation）和社会模拟（Society Simulation），并详细讨论了每种模拟的架构、目标分类、评估方法及常用数据集和基准。通过这种分类和详细讨论，论文展示了LLMs在社会学研究中的潜力和应用趋势。

链接: https://arxiv.org/abs/2412.03563
作者: Xinyi Mou,Xuanwen Ding,Qi He,Liang Wang,Jingcong Liang,Xinnong Zhang,Libo Sun,Jiayu Lin,Jie Zhou,Xuanjing Huang,Zhongyu Wei
关键词-EN: Traditional sociological research, Traditional sociological, challenging to scale, ethical concerns, sociological research
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. In this paper, we conduct a comprehensive survey of this field, illustrating the recent progress in simulation driven by LLM-empowered agents. We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Society Simulation, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics. These simulations follow a progression, ranging from detailed individual modeling to large-scale societal phenomena. We provide a detailed discussion of each simulation type, including the architecture or key components of the simulation, the classification of objectives or scenarios and the evaluation method. Afterward, we summarize commonly used datasets and benchmarks. Finally, we discuss the trends across these three types of simulation. A repository for the related sources is at \urlthis https URL.
zh

[NLP-1] Best-of-N Jailbreaking

【速读】：该论文试图解决前沿AI系统（如语言模型和视觉语言模型）的安全性问题，特别是如何通过黑盒攻击方法绕过这些系统的防御机制。解决方案的关键在于提出了一种名为“Best-of-N (BoN) Jailbreaking”的算法，该算法通过反复采样和增强输入提示（如随机打乱或改变大小写），直到触发有害响应。BoN Jailbreaking在闭源语言模型（如GPT-4o和Claude 3.5 Sonnet）上实现了高攻击成功率（ASR），并且在开源防御系统（如circuit breakers）上同样有效。此外，该方法还扩展到其他模态（如视觉语言模型和音频语言模型），并展示了随着采样数量的增加，攻击成功率呈幂律增长。BoN Jailbreaking还可以与其他黑盒算法结合，进一步提升攻击效果。总体而言，该研究揭示了语言模型对输入微小变化的敏感性，并展示了跨模态攻击的潜力。

链接: https://arxiv.org/abs/2412.03556
作者: John Hughes,Sara Price,Aengus Lynch,Rylan Schaeffer,Fazl Barez,Sanmi Koyejo,Henry Sleight,Erik Jones,Ethan Perez,Mrinank Sharma
关键词-EN: BoN Jailbreaking, Jailbreaking, frontier AI systems, BoN, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
zh

[NLP-2] Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

【速读】：该论文试图解决的问题是：在通过提示（prompting）适应策略将预训练的大型语言模型（LLMs）用于实际决策系统时，模型内在的偏见（bias）是否会传递到下游任务中。解决方案的关键在于通过实验验证了预训练模型中的内在偏见与其在零样本（zero-shot）和少样本（few-shot）提示下的偏见之间存在强相关性（相关系数rho = 0.94），即使在模型被提示表现出公平或偏见行为时，这种相关性依然显著（rho = 0.92）。此外，即使改变少样本提示的长度和刻板印象成分，偏见传递的相关性仍然很高（rho = 0.97）。这些发现强调了在预训练阶段确保LLMs公平性的重要性，特别是在它们通过提示适应策略用于下游任务时。

链接: https://arxiv.org/abs/2412.03537
作者: Natalie Mackraz,Nivedha Sivakumar,Samira Khorshidi,Krishna Patel,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff
关键词-EN: Large language models, Large language, real-world decision systems, achieve task-specificity, task-specificity for deployment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy models in real-world systems. In contrast to previous works, we establish that intrinsic biases in pre-trained Mistral, Falcon and Llama models are strongly correlated (rho = 0.94) with biases when the same models are zero- and few-shot prompted, using a pronoun co-reference resolution task. Further, we find that bias transfer remains strongly correlated even when LLMs are specifically prompted to exhibit fair or biased behavior (rho = 0.92), and few-shot length and stereotypical composition are varied (rho = 0.97). Our findings highlight the importance of ensuring fairness in pre-trained LLMs, especially when they are later used to perform downstream tasks via prompt adaptation.
zh

[NLP-3] A Review on Scientific Knowledge Extraction using Large Language Models in Biomedical Sciences

【速读】：该论文试图解决大型语言模型（LLMs）在生物医学领域应用中的关键挑战，特别是其在证据合成和数据提取任务中的自动化效果。解决方案的关键在于：1) 解决LLMs在幻觉（hallucinations）、上下文理解（contextual understanding）和跨多样医学任务的泛化能力（generalization）方面的不足；2) 提出统一基准（unified benchmarks）以标准化评估并确保实际应用的可靠性；3) 强调采用如检索增强生成（Retrieval-Augmented Generation, RAG）等先进技术来提升LLMs在证据合成中的表现。通过这些措施，论文旨在提升医学文献的可及性，并促进医疗健康领域的实质性发现。

链接: https://arxiv.org/abs/2412.03531
作者: Gabriel Lino Garcia,João Renato Ribeiro Manesco,Pedro Henrique Paiola,Lucas Miranda,Maria Paola de Salvo,João Paulo Papa
关键词-EN: large language models, evidence synthesis, language models, rapid advancement, advancement of large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 1 table, 1 figure, conference paper

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has opened new boundaries in the extraction and synthesis of medical knowledge, particularly within evidence synthesis. This paper reviews the state-of-the-art applications of LLMs in the biomedical domain, exploring their effectiveness in automating complex tasks such as evidence synthesis and data extraction from a biomedical corpus of documents. While LLMs demonstrate remarkable potential, significant challenges remain, including issues related to hallucinations, contextual understanding, and the ability to generalize across diverse medical tasks. We highlight critical gaps in the current research literature, particularly the need for unified benchmarks to standardize evaluations and ensure reliability in real-world applications. In addition, we propose directions for future research, emphasizing the integration of state-of-the-art techniques such as retrieval-augmented generation (RAG) to enhance LLM performance in evidence synthesis. By addressing these challenges and utilizing the strengths of LLMs, we aim to improve access to medical literature and facilitate meaningful discoveries in healthcare.
zh

[NLP-4] FANAL – Financial Activity News Alerting Language Modeling Framework

【速读】：该论文试图解决金融市场中新闻事件的实时检测与分类问题，特别是如何准确且及时地将市场新闻归类为十二种不同的金融类别。解决方案的关键在于引入了一个名为FANAL的框架，该框架基于BERT模型，并通过以下几个关键技术实现高效的事件检测与分类：1) 利用XGBoost处理银标签数据（silver-labeled data）；2) 采用ORBERT，一种经过Odds Ratio Preference Optimization (ORPO)微调的BERT变体，以提升类别概率校准和与金融事件相关性的对齐；3) 通过先进的微调技术优化模型性能。FANAL在性能和成本效率方面显著优于现有的GPT-4o、Llama-3.1 8B和Phi-3等大型语言模型。

链接: https://arxiv.org/abs/2412.03527
作者: Urjitkumar Patel,Fang-Chun Yeh,Chinmay Gondhalekar,Hari Nalluri
关键词-EN: evolving financial sector, navigate unpredictable events, rapidly evolving financial, Odds Ratio BERT, Odds Ratio Preference
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for the IEEE International Workshop on Large Language Models for Finance, 2024. This is a preprint version

点击查看摘要

Abstract:In the rapidly evolving financial sector, the accurate and timely interpretation of market news is essential for stakeholders needing to navigate unpredictable events. This paper introduces FANAL (Financial Activity News Alerting Language Modeling Framework), a specialized BERT-based framework engineered for real-time financial event detection and analysis, categorizing news into twelve distinct financial categories. FANAL leverages silver-labeled data processed through XGBoost and employs advanced fine-tuning techniques, alongside ORBERT (Odds Ratio BERT), a novel variant of BERT fine-tuned with ORPO (Odds Ratio Preference Optimization) for superior class-wise probability calibration and alignment with financial event relevance. We evaluate FANAL’s performance against leading large language models, including GPT-4o, Llama-3.1 8B, and Phi-3, demonstrating its superior accuracy and cost efficiency. This framework sets a new standard for financial intelligence and responsiveness, significantly outstripping existing models in both performance and affordability.
zh

[NLP-5] KKLIP: Knowledge Distillation Exploiting K-means Clustering for Language-Image Pre-Training

【速读】：该论文试图解决CLIP模型在多模态场景中对图像和文本信息对齐时，其文本和图像编码器提取详细知识能力的局限性。解决方案的关键在于引入了一种新的知识蒸馏（Knowledge Distillation, KD）方法，称为KKLIP，该方法结合了Llama 2的特性。KKLIP方法包括三个主要目标：文本嵌入蒸馏（Text Embedding Distillation）、概念学习（Concept Learning）和对比学习（Contrastive Learning）。文本嵌入蒸馏通过训练KKLIP的文本编码器来模仿Llama 2的教师模型；概念学习通过离线k-means聚类为每个图像-文本对分配软概念标签，使KKLIP能够从这些软概念标签中学习；对比学习则协调文本和图像嵌入，从而提升编码器的质量。

链接: https://arxiv.org/abs/2412.03513
作者: Kuei-Chun Kao
关键词-EN: Text Embedding Distillation, multi-modal scenarios, text, CLIP, Recently
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, CLIP has emerged as a valuable model for aligning image and text information in multi-modal scenarios. However, researchers have observed limitations in the ability of CLIP’s text and image encoders to extract detailed knowledge from caption-image pairs. In response, this paper introduces KKLIP, a novel approach designed to enhance the quality of CLIP by incorporating a new knowledge distillation (KD) method derived from Llama 2. Our method comprises three objectives: Text Embedding Distillation, Concept Learning, and Contrastive Learning. Firstly, Text Embedding Distillation involves training the KKLIP text encoder to emulate the teacher model, Llama 2. Secondly, Concept Learning assigns a soft concept label to each caption-image pair through offline k-means clustering of text information from Llama 2, allowing KKLIP to learn from these soft concept labels. Finally, Contrastive Learning harmonizes text and image embeddings. Our experimental results demonstrate that KKLIP enhances the quality of both text and image encoders.
zh

[NLP-6] YT-30M: A multi-lingual multi-category dataset of YouTube comments

【速读】：该论文旨在解决大规模多语言评论数据的收集和分析问题，并为此引入了两个来自YouTube的评论数据集：YT-30M和YT-100K。解决方案的关键在于公开发布这两个数据集，供进一步研究使用。YT-30M包含超过3200万条评论，而YT-100K是从中随机抽取的10万条评论样本。每个评论数据点包括视频ID、评论ID、评论者名称、评论者频道ID、评论文本、点赞数、原始频道ID以及YouTube频道类别等信息，为多语言评论的分析提供了丰富的数据基础。

链接: https://arxiv.org/abs/2412.03465
作者: Hridoy Sankar Dutta
关键词-EN: large-scale multilingual comment, introduces two large-scale, large-scale multilingual, paper introduces, multilingual comment datasets
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., ‘News Politics’, ‘Science Technology’, etc.).
zh

[NLP-7] RedStone: Curating General Code Math and QA Data for Large Language Models

【速读】：该论文试图解决在大规模语言模型（LLMs）预训练过程中，高质量数据集获取成本高且难以覆盖多样化领域的问题。解决方案的关键在于引入RedStone，这是一个创新且可扩展的数据处理管道，能够从Common Crawl中提取和处理数据，从而创建广泛且多样化的预训练数据集。RedStone的灵活性使其能够轻松适应各种专业领域，显著降低了创建有价值的领域特定数据集的门槛。通过利用Common Crawl的广泛资源，RedStone不仅提升了LLMs的性能和泛化能力，还为领域适应和知识发现开辟了新的途径。

链接: https://arxiv.org/abs/2412.03398
作者: Yaoyao Chang,Lei Cui,Li Dong,Shaohan Huang,Yangyu Huang,Yupan Huang,Scarlett Li,Tengchao Lv,Shuming Ma,Qinzheng Sun,Wenhui Wang,Furu Wei,Ying Xin,Mao Yang,Qiufeng Yin,Xingxing Zhang
关键词-EN: Large Language Models, Pre-training Large Language, Common Crawl, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \urlthis https URL.
zh

[NLP-8] DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles COLING2025

【速读】：该论文试图解决从文本到韵律（prosody）的一对多映射问题，即如何合理且灵活地生成多样化的语音韵律。解决方案的关键在于提出了DiffStyleTTS，这是一个基于条件扩散模块（conditional diffusion module）和改进的无分类器引导（classifier-free guidance）的多说话人声学模型。该模型通过分层建模语音韵律特征，并控制不同的韵律风格来指导韵律预测。实验结果表明，DiffStyleTTS在自然度和合成速度方面均优于所有基线方法，并且通过调整引导尺度（guiding scale），能够有效控制合成韵律的引导强度。

链接: https://arxiv.org/abs/2412.03388
作者: Jiaxuan Liu,Zhaoci Liu,Yajun Hu,Yingying Gao,Shilei Zhang,Zhenhua Ling
关键词-EN: Human speech exhibits, speech exhibits rich, flexible prosodic variations, Human speech, exhibits rich
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: COLING 2025

点击查看摘要

Abstract:Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.
zh

[NLP-9] Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 输出语言多样性不足的问题，这导致了观点和视角的同质化，以及特定群体的代表性不足。解决方案的关键是提出了可能性探索微调 (Possibility Exploration Fine-Tuning, PEFT) 框架，这是一个任务无关的框架，能够在不增加延迟或计算成本的情况下增强 LLMs 的文本多样性。PEFT 允许模型在接收到相同提示时生成多个多样化的响应，每个响应对应一个可控的可能性编号。实验结果表明，PEFT 显著提高了 LLM 输出的多样性，降低了候选响应之间的相似度，并且在对话系统中显著减少了群体偏见。

链接: https://arxiv.org/abs/2412.03343
作者: Long Mai,Julie Carson-Berndsen
关键词-EN: Large Language Models, Large Language, replicating human-like abilities, made significant strides, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have made significant strides in replicating human-like abilities, there are concerns about a reduction in the linguistic diversity of their outputs. This results in the homogenization of viewpoints and perspectives, as well as the underrepresentation of specific demographic groups. Although several fine-tuning and prompting techniques have been suggested to tackle the issue, they are often tailored to specific tasks or come with a substantial increase in computational cost and latency. This makes them challenging to apply to applications that demand very low latency, such as chatbots and virtual assistants. We propose Possibility Exploration Fine-Tuning (PEFT), a task-agnostic framework that enhances the text diversity of LLMs without increasing latency or computational cost. Given the same prompt, models fine-tuned with PEFT can simultaneously generate multiple diverse responses, each corresponding with a controllable possibility number. Experiments on dialogue and story generation tasks demonstrate that PEFT significantly enhances the diversity of LLM outputs, as evidenced by lower similarity between candidate responses. Since PEFT emphasizes semantic diversity over lexical diversity, it can also notably reduce demographic bias in dialogue systems. The implementations and datasets are available in our repository: this https URL
zh

[NLP-10] Yankari: A Monolingual Yoruba Dataset

【速读】：该论文试图解决约鲁巴语（Yoruba）在自然语言处理（NLP）领域资源匮乏的问题。解决方案的关键在于创建了一个大规模的单语数据集——Yankari，该数据集包含了来自13个不同来源的51,407份文档，总计超过3000万词元。论文详细介绍了数据集的创建方法，包括精心选择数据源、自动化质量控制和严格的数据清洗过程，并强调了在数据收集过程中遵循的伦理原则，避免使用有问题的数据源，并解决了现有数据集中常见的问题。通过提供全面的自动化评估，论文展示了Yankari数据集相较于现有资源的高质量，为开发更精确的NLP模型、支持比较语言学研究以及提升约鲁巴语的数字可及性奠定了基础。

链接: https://arxiv.org/abs/2412.03334
作者: Maro Akpobi
关键词-EN: important West African, Natural Language Processing, West African language, West African, paper presents Yankari
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:This paper presents Yankari, a large-scale monolingual dataset for the Yoruba language, aimed at addressing the critical gap in Natural Language Processing (NLP) resources for this important West African language. Despite being spoken by over 30 million people, Yoruba has been severely underrepresented in NLP research and applications. We detail our methodology for creating this dataset, which includes careful source selection, automated quality control, and rigorous data cleaning processes. The Yankari dataset comprises 51,407 documents from 13 diverse sources, totaling over 30 million tokens. Our approach focuses on ethical data collection practices, avoiding problematic sources and addressing issues prevalent in existing datasets. We provide thorough automated evaluations of the dataset, demonstrating its quality compared to existing resources. The Yankari dataset represents a significant advancement in Yoruba language resources, providing a foundation for developing more accurate NLP models, supporting comparative linguistic studies, and contributing to the digital accessibility of the Yoruba language.
zh

[NLP-11] LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings COLING2025

【速读】：该论文试图解决低资源语言（如卢森堡语）在自然语言处理任务中因缺乏平行数据而导致句子嵌入模型性能不佳的问题。解决方案的关键在于编译一个高质量的人工生成的跨语言平行数据集，用于训练一个增强型的句子嵌入模型（\tool），该模型具有强大的跨语言能力。此外，论文还提出，将低资源语言纳入平行训练数据集对其他低资源语言的性能提升比仅依赖高资源语言对更有优势。为了促进进一步研究，论文还创建了一个针对卢森堡语的释义检测基准，以填补低资源语言句子嵌入基准的空白。

链接: https://arxiv.org/abs/2412.03331
作者: Fred Philippy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé
关键词-EN: Natural Language Processing, Topic Modeling, Document Clustering, Recommendation Systems, Clustering and Recommendation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train \tool, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.
zh

[NLP-12] Grounded Language Design for Lightweight Diagramming for Formal Methods

【速读】：该论文试图解决的问题是现有的模型查找工具（如SAT求解器和Alloy等）在生成模型可视化时，默认的可视化器缺乏对特定领域的理解，导致生成的图表不仅无助于理解，甚至可能违反认知和展示原则。解决方案的关键在于设计一种轻量级的图表语言，该语言能够捕捉领域信息的核心要素，并基于认知科学文献和大量自定义可视化示例进行设计。论文提出了一套正交的基本原语，并扩展了Alloy等工具以支持这些原语，从而生成更有效且符合认知原则的图表。通过评估，发现这些图表在推理过程中表现良好，并且与其他绘图语言和工具相比，定义了一个新的轻量级、高效且基于合理原则的领域。

链接: https://arxiv.org/abs/2412.03310
作者: Siddhartha Prasad,Ben Greenman,Tim Nelson,Shriram Krishnamurthi
关键词-EN: Alloy target SAT, SAT solvers, solvers and similar, embedding settings, embodied by SAT
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Model finding, as embodied by SAT solvers and similar tools, is used widely, both in embedding settings and as a tool in its own right. For instance, tools like Alloy target SAT to enable users to incrementally define, explore, verify, and diagnose sophisticated specifications for a large number of complex systems. These tools critically include a visualizer that lets users graphically explore these generated models. As we show, however, default visualizers, which know nothing about the domain, are unhelpful and even actively violate presentational and cognitive principles. At the other extreme, full-blown visualizations require significant effort as well as knowledge a specifier might not possess; they can also exhibit bad failure modes (including silent failure). Instead, we need a language to capture essential domain information for lightweight diagramming. We ground our language design in both the cognitive science literature on diagrams and on a large number of example custom visualizations. This identifies the key elements of lightweight diagrams. We distill these into a small set of orthogonal primitives. We extend an Alloy-like tool to support these primitives. We evaluate the effectiveness of the produced diagrams, finding them good for reasoning. We then compare this against many other drawing languages and tools to show that this work defines a new niche that is lightweight, effective, and driven by sound principles. Subjects: Computation and Language (cs.CL); Programming Languages (cs.PL) ACMclasses: D.3.1; D.2.4; D.3.2 Cite as: arXiv:2412.03310 [cs.CL] (or arXiv:2412.03310v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.03310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] ypologie des comportements utilisateurs : etude exploratoire des sessions de recherche complexe sur le Web

【速读】：该论文试图解决的问题是如何在网络搜索会话中对用户行为进行分类。解决方案的关键在于提出了一种基于通用信息检索变量（如查询数量）和主题研究（从搜索语句中定义的具有不同语义内容的命题）的分类方法。通过收集70名用户在同一任务中的实验数据，并进行多维分析，论文最终提出了一个包含5个类别的用户行为分类体系，该体系基于用户在处理复杂搜索任务时的个体行为。

链接: https://arxiv.org/abs/2412.03309
作者: Claire Ibarboure(CLLE),Ludovic Tanguy(CLLE),Franck Amadieu(CLLE)
关键词-EN: Web search session, exploratory approach aiming, Web search, exploratory approach, approach aiming
类目: Computation and Language (cs.CL)
备注: in French language, CORIA (COnf{é}rence en Recherche d’Information et Applications), 2024, La Rochelle, France

点击查看摘要

Abstract:In this study, we propose an exploratory approach aiming at a typology of user behaviour during a Web search session. We describe a typology based on generic IR variables (e.g. number of queries), but also on the study of topic (propositions with distinct semantic content defined from the search statement). To this end, we gathered experimental data enabling us to study variations across users (N=70) for the same task. We performed a multidimensional analysis and propose a 5 classes typology based on the individual behaviours during the processing of a complex search task.
zh

[NLP-14] Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

【速读】：该论文试图解决多语言数据集中存在的文化偏见问题，这些问题不仅影响数据集作为全球基准的有效性，还通过翻译引入的伪影扭曲了目标语言中问题的意义或清晰度。解决方案的关键在于发布Global-MMLU，这是一个改进的MMLU数据集，涵盖42种语言，并通过聘请专业和社区注释者来验证翻译质量，同时严格评估原始数据集中存在的文化偏见。Global-MMLU还包括标记为文化敏感和文化中立的指定子集，以实现更全面、完整的评估。

链接: https://arxiv.org/abs/2412.03304
作者: Shivalika Singh,Angelika Romanou,Clémentine Fourrier,David I. Adelani,Jian Gang Ngui,Daniel Vila-Suero,Peerat Limkonchotiwat,Kelly Marchisio,Wei Qi Leong,Yosephine Susanto,Raymond Ng,Shayne Longpre,Wei-Yin Ko,Madeline Smith,Antoine Bosselut,Alice Oh,Andre F. T. Martins,Leshem Choshen,Daphne Ippolito,Enzo Ferrante,Marzieh Fadaee,Beyza Ermis,Sara Hooker
关键词-EN: datasets pose significant, pose significant challenges, global benchmarks, Cultural biases, pose significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning or clarity of questions in the target language. A common practice in multilingual evaluation is to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to address these challenges. In this work, we trace the impact of both of these issues on multilingual evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open and proprietary models illustrates that progress on MMLU depends heavily on learning Western-centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions. Rankings of model evaluations change depending on whether they are evaluated on the full portion or the subset of questions annotated as culturally sensitive, showing the distortion to model rankings when blindly relying on translated MMLU. We release Global-MMLU, an improved MMLU with evaluation coverage across 42 languages – with improved overall quality by engaging with compensated professional and community annotators to verify translation quality while also rigorously evaluating cultural biases present in the original dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as culturally sensitive and culturally agnostic to allow for more holistic, complete evaluation.
zh

[NLP-15] AntLM: Bridging Causal and Masked Language Models CONLL

【速读】：该论文试图解决因果语言建模 (Causal Language Modeling, CLM) 和掩码语言建模 (Masked Language Modeling, MLM) 在下游任务中各自优势与劣势的问题。解决方案的关键在于提出了一种新的语言建模范式，名为 AntLM，该范式通过交替使用 CLM 和 MLM 的训练目标以及因果和双向注意力掩码，结合了这两种经典范式的优势。实验结果表明，这种结合显著提升了训练性能，特别是在相同的训练轮次下，AntLM_BabyLlama 和 AntLM_LTG-BERT 分别比基线模型提高了 1% 和 2.2% 的 Macro-average 性能。

链接: https://arxiv.org/abs/2412.03275
作者: Xinru Yu,Bin Guo,Shiwei Luo,Jie Wang,Tao Ji,Yuanbin Wu
关键词-EN: Masked Language Modeling, Decoder-only and Encoder-only, Transformer networks, specifically the Decoder-only, Encoder-only architectures
类目: Computation and Language (cs.CL)
备注: CoNLL Shared Task BabyLM Challenge

点击查看摘要

Abstract:Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two mainstream learning paradigms based on Transformer networks, specifically the Decoder-only and Encoder-only architectures. The strengths of each paradigm in downstream tasks have shown a mix of advantages and disadvantages. In the past BabyLM Challenge 2023, although the MLM paradigm achieved the best average performance, the CLM paradigm demonstrated significantly faster convergence rates. For the BabyLM Challenge 2024, we propose a novel language modeling paradigm named \textbfAntLM , which integrates both CLM and MLM to leverage the advantages of these two classic paradigms. We chose the strict-small track and conducted experiments on two foundation models: BabyLlama, representing CLM, and LTG-BERT, representing MLM. During the training process for specific foundation models, we alternate between applying CLM or MLM training objectives and causal or bidirectional attention masks. Experimental results show that combining the two pretraining objectives leverages their strengths, enhancing overall training performance. Under the same epochs, AntLM_BabyLlama improves Macro-average by 1%, and AntLM_LTG-BERT achieves a 2.2% increase over the baselines.
zh

[NLP-16] Intent-driven In-context Learning for Few-shot Dialogue State Tracking

【速读】：该论文试图解决对话状态跟踪 (Dialogue State Tracking, DST) 任务中用户输入可能包含隐含信息以及数据集构建成本高的问题。解决方案的关键在于提出了基于意图的上下文学习方法 (Intent-driven In-context Learning for Few-shot DST, IDIC-DST)。具体来说，通过提取用户意图，论文设计了一个意图驱动的对话信息增强模块 (Intent-driven Dialogue Information Augmentation module)，用于增强对话信息，从而更有效地跟踪对话状态。此外，论文还通过掩码噪声信息和重写用户输入，在意图驱动的示例检索模块 (Intent-driven Examples Retrieval module) 中检索相似示例，并利用预训练的大语言模型 (pre-trained large language model) 更新对话状态。实验结果表明，IDIC-DST 在 MultiWOZ 2.1 和 MultiWOZ 2.4 数据集上的少样本设置中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03270
作者: Zihao Yi,Zhe Xu,Ying Shen
关键词-EN: task-oriented dialogue systems, DST, plays an essential, essential role, role in task-oriented
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue state tracking (DST) plays an essential role in task-oriented dialogue systems. However, user’s input may contain implicit information, posing significant challenges for DST tasks. Additionally, DST data includes complex information, which not only contains a large amount of noise unrelated to the current turn, but also makes constructing DST datasets expensive. To address these challenges, we introduce Intent-driven In-context Learning for Few-shot DST (IDIC-DST). By extracting user’s intent, we propose an Intent-driven Dialogue Information Augmentation module to augment the dialogue information, which can track dialogue states more effectively. Moreover, we mask noisy information from DST data and rewrite user’s input in the Intent-driven Examples Retrieval module, where we retrieve similar examples. We then utilize a pre-trained large language model to update the dialogue state using the augmented dialogue information and examples. Experimental results demonstrate that IDIC-DST achieves state-of-the-art performance in few-shot settings on MultiWOZ 2.1 and MultiWOZ 2.4 datasets.
zh

[NLP-17] Alignment at Pre-training! Towards Native Alignment for Arabic LLM s NEURIPS2024

【速读】：该论文试图解决大型语言模型（LLMs）在预训练阶段的对齐问题，即如何通过“原生对齐”（native alignment）在预训练阶段就确保模型的对齐性，而不是依赖于后期的指令调优或强化学习阶段的“后对齐”（post alignment）。解决方案的关键在于利用广泛对齐的预训练数据，从一开始就防止不对齐的内容出现，从而提高预训练模型的有效性和可用性。论文特别探讨了这一方法在阿拉伯语LLMs中的应用，并通过实验和消融研究评估了原生对齐对模型性能和对齐稳定性的影响。

链接: https://arxiv.org/abs/2412.03253
作者: Juhao Liang,Zhenyang Cai,Jianqing Zhu,Huang Huang,Kewei Zong,Bang An,Mosen Alharthi,Juncai He,Lian Zhang,Haizhou Li,Benyou Wang,Jinchao Xu
关键词-EN: large language models, safe language models, large language, safe language, language models
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024 main conference. see this https URL

点击查看摘要

Abstract:The alignment of large language models (LLMs) is critical for developing effective and safe language models. Traditional approaches focus on aligning models during the instruction tuning or reinforcement learning stages, referred to in this paper as post alignment'. We argue that alignment during the pre-training phase, which we term native alignment’, warrants investigation. Native alignment aims to prevent unaligned content from the beginning, rather than relying on post-hoc processing. This approach leverages extensively aligned pre-training data to enhance the effectiveness and usability of pre-trained models. Our study specifically explores the application of native alignment in the context of Arabic LLMs. We conduct comprehensive experiments and ablation studies to evaluate the impact of native alignment on model performance and alignment stability. Additionally, we release open-source Arabic LLMs that demonstrate state-of-the-art performance on various benchmarks, providing significant benefits to the Arabic LLM community.
zh

[NLP-18] AIM: Adaptive Inference of Multi-Modal LLM s via Token Merging and Pruning

【速读】：该论文试图解决多模态大语言模型（Multi-modal LLMs）在处理视觉数据时的高计算需求问题，特别是在资源受限环境和长上下文任务中的应用受限问题。解决方案的关键在于提出了一种无需训练的自适应推理方法，该方法包括两个主要步骤：a) 在LLMs之前基于嵌入相似性的迭代令牌合并；b) 在LLM层内基于多模态重要性的渐进令牌剪枝。这种方法通过最小化设计，能够在显著减少计算负载（例如，FLOPs减少7倍）的同时，保持视频和图像LLMs的性能。此外，在相似的计算成本下，该方法在长视频理解任务中优于现有最先进的方法（例如，MLVU提升4.6）。

链接: https://arxiv.org/abs/2412.03248
作者: Yiwu Zhong,Zhuoming Liu,Yin Li,Liwei Wang
关键词-EN: Large language models, exhibit strong comprehension, Large language, LLMs, enabled the creation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a \textbf7-fold reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., \textbf+4.6 on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code will be available at this https URL.
zh

[NLP-19] Benchmarking terminology building capabilities of ChatGPT on an English-Russian Fashion Corpus

【速读】：该论文试图解决的问题是如何在多语言环境下高效且准确地提取和定义时尚领域的术语。解决方案的关键在于比较和评估三种工具（SketchEngine、TBXTools 和 ChatGPT）在术语提取和定义生成方面的性能。通过构建一个包含英语和俄语时尚杂志的语料库以及一个用于评估的“黄金标准”，研究发现 ChatGPT 在术语提取的精确度上表现优异，尤其是在处理大量术语时仍能保持或提高精确度。然而，ChatGPT 在生成术语定义时虽保持了合理的准确性和跨语言一致性，但有时会遗漏关键细节或包含不必要的偏差。研究结果表明，每种工具都有其独特的优势，适用于不同的术语提取和应用场景。

链接: https://arxiv.org/abs/2412.03242
作者: Anastasiia Bezobrazova,Miriam Seghiri,Constantin Orasan
关键词-EN: English and Russian, paper compares, definitions produced, terms, definitions
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of Translating and the Computer 2024 (TC46)

点击查看摘要

Abstract:This paper compares the accuracy of the terms extracted using SketchEngine, TBXTools and ChatGPT. In addition, it evaluates the quality of the definitions produced by ChatGPT for these terms. The research is carried out on a comparable corpus of fashion magazines written in English and Russian collected from the web. A gold standard for the fashion terminology was also developed by identifying web pages that can be harvested automatically and contain definitions of terms from the fashion domain in English and Russian. This gold standard was used to evaluate the quality of the extracted terms and of the definitions produced. Our evaluation shows that TBXTools and SketchEngine, while capable of high recall, suffer from reduced precision as the number of terms increases, which affects their overall performance. Conversely, ChatGPT demonstrates superior performance, maintaining or improving precision as more terms are considered. Analysis of the definitions produced by ChatGPT for 60 commonly used terms in English and Russian shows that ChatGPT maintains a reasonable level of accuracy and fidelity across languages, but sometimes the definitions in both languages miss crucial specifics and include unnecessary deviations. Our research reveals that no single tool excels universally; each has strengths suited to particular aspects of terminology extraction and application.
zh

[NLP-20] Does Safety Training of LLM s Generalize to Semantically Related Natural Prompts? NEURIPS2024

【速读】：该论文试图解决的问题是评估经过安全微调的大型语言模型 (LLMs) 在面对自然提示时是否仍然安全，特别是那些与有毒种子提示语义相关的自然提示。解决方案的关键在于提出了一种名为“响应引导的问题增强 (Response Guided Question Augmentation, ReG-QA)”的方法，该方法通过利用未对齐的LLM生成有毒答案，然后利用LLM生成可能产生这些答案的问题，从而系统地生成能够破解对齐LLMs的自然提示。实验结果表明，即使是经过安全微调的LLMs如GPT-4，在面对这些自然生成的破解提示时也表现出脆弱性，且该方法在攻击成功率和对抗防御措施的稳定性方面优于现有的对抗攻击方法。

链接: https://arxiv.org/abs/2412.03235
作者: Sravanti Addepalli,Yerram Varun,Arun Suggala,Karthikeyan Shanmugam,Prateek Jain
关键词-EN: Large Language Models, Large Language, Language Models, safety fine-tuned LLMs, aligned LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the Safe Generative AI Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find adversarial prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against natural prompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related natural prompts that can jailbreak aligned LLMs. Towards this, we propose a method of Response Guided Question Augmentation (ReG-QA) to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak questions from unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.
zh

[NLP-21] PERL: Pinyin Enhanced Rephrasing Language Model for Chinese ASR N-best Error Correction

【速读】：该论文试图解决中文语音识别（ASR）中未能有效利用拼音（Pinyin）信息的问题。解决方案的关键在于提出了一个拼音增强的重述语言模型（Pinyin Enhanced Rephrasing Language Model, PERL），专门针对N-best纠错场景设计，并引入了一个长度预测模块以解决变长问题。实验结果表明，该方法在Aishell-1数据集上将字符错误率（CER）降低了29.11%，在领域特定数据集上降低了约70%的CER，并通过在标记级别利用拼音相似性，显著优于基线方法。

链接: https://arxiv.org/abs/2412.03230
作者: Junhong Liang
关键词-EN: utilized Pinyin information, effectively utilized Pinyin, Rephrasing Language Model, Enhanced Rephrasing Language, Pinyin Enhanced Rephrasing
类目: Computation and Language (cs.CL)
备注: 2 figures, 6 tables

点击查看摘要

Abstract:ASR correction methods have predominantly focused on general datasets and have not effectively utilized Pinyin information, unique to the Chinese language. In this study, we address this gap by proposing a Pinyin Enhanced Rephrasing Language Model (PERL), specifically designed for N-best correction scenarios. Additionally, we implement a length predictor module to address the variable-length problem. We conduct experiments on the Aishell-1 dataset and our newly proposed DoAD dataset. The results show that our approach outperforms baseline methods, achieving a 29.11% reduction in Character Error Rate (CER) on Aishell-1 and around 70% CER reduction on domain-specific datasets. Furthermore, our approach leverages Pinyin similarity at the token level, providing an advantage over baselines and leading to superior performance.
zh

[NLP-22] Linq-Embed-Mistral Technical Report

【速读】：该论文试图通过先进的数据精炼技术（data refinement techniques）来提升文本检索性能。解决方案的关键在于开发了Linq-Embed-Mistral模型，该模型基于E5-mistral和Mistral-7B-v0.1模型，通过精细的数据处理、数据过滤和负样本挖掘方法，针对每个任务进行高度定制化处理，并应用于现有的基准数据集和通过大型语言模型（LLMs）生成的合成数据集。Linq-Embed-Mistral在MTEB基准测试中表现优异，平均得分达到68.2，并在MTEB排行榜上以60.2的性能得分位列检索任务的第一名，显著提升了搜索的精确性和可靠性。此外，论文还提出了同质任务排序和混合任务微调技术，以增强模型的泛化能力和稳定性，并通过4-bit精度和轻量级检索评估集简化了评估过程，加速了验证过程且不牺牲准确性。

链接: https://arxiv.org/abs/2412.03223
作者: Chanyeol Choi,Junseong Kim,Seolhwa Lee,Jihoon Kwon,Sangmo Gu,Yejin Kim,Minkyung Cho,Jy-yong Sohn
关键词-EN: advanced data refinement, report explores, explores the enhancement, enhancement of text, text retrieval performance
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:This report explores the enhancement of text retrieval performance using advanced data refinement techniques. We develop Linq-Embed-Mistral\footnote\urlthis https URL by building on the E5-mistral and Mistral-7B-v0.1 models, focusing on sophisticated data crafting, data filtering, and negative mining methods, which are highly tailored to each task, applied to both existing benchmark dataset and highly tailored synthetic dataset generated via large language models (LLMs). Linq-Embed-Mistral excels in the MTEB benchmarks (as of May 29, 2024), achieving an average score of 68.2 across 56 datasets, and ranks 1st among all models for retrieval tasks on the MTEB leaderboard with a performance score of 60.2. This performance underscores its superior capability in enhancing search precision and reliability. Our contributions include advanced data refinement methods that significantly improve model performance on benchmark and synthetic datasets, techniques for homogeneous task ordering and mixed task fine-tuning to enhance model generalization and stability, and a streamlined evaluation process using 4-bit precision and a light retrieval evaluation set, which accelerates validation without sacrificing accuracy.
zh

[NLP-23] U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLM s

【速读】：该论文试图解决当前大语言模型（LLMs）在数学技能评估中的局限性问题。现有基准测试要么规模较小，主要集中在小学和中学问题，要么在主题多样性方面存在不足，且对任务中视觉元素的包含探索不足。论文提出的解决方案是引入U-MATH，这是一个包含1,100个未公开的开放式大学水平问题的全新基准，涵盖六个核心学科，其中20%为多模态问题。关键在于，U-MATH的开放式问题性质要求使用LLM来判断生成解决方案的正确性，为此还发布了μ-MATH数据集以评估LLMs在判断解决方案方面的能力。研究结果表明，LLMs在文本任务上的最高准确率为63%，而在视觉问题上的准确率仅为45%，显示出LLMs在处理复杂数学问题和多模态任务时的挑战。

链接: https://arxiv.org/abs/2412.03205
作者: Konstantin Chernyshev,Vitaliy Polshkov,Ekaterina Artemova,Alex Myasnikov,Vlad Stepanov,Alexei Miasnikov,Sergei Tilga
关键词-EN: diversity in topics, mathematical skills, elementary and high-school, lack diversity, primarily focus
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release \mu -MATH, a dataset to evaluate the LLMs’ capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on \mu -MATH. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.03205 [cs.CL] (or arXiv:2412.03205v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.03205 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-24] Weighted-Reward Preference Optimization for Implicit Model Fusion

【速读】：该论文试图解决异构开源大型语言模型（LLMs）融合过程中面临的词汇对齐和分布矩阵合并等复杂问题，这些问题不仅操作复杂，还容易引入噪声和错误。解决方案的关键是提出了隐式融合方法——加权奖励偏好优化（Weighted-Reward Preference Optimization, WRPO），通过偏好优化在源LLMs和目标LLM之间有效传递能力，无需词汇对齐和矩阵融合，且能高效扩展以适应多种LLMs。为应对源和目标LLMs之间的分布偏差，WRPO引入了渐进适应策略，逐步从依赖目标LLM的偏好示例转向依赖源LLMs。实验结果表明，WRPO在多个基准测试中持续优于现有的知识融合方法和各种微调基线。

链接: https://arxiv.org/abs/2412.03187
作者: Ziyi Yang,Fanqi Wan,Longguang Zhong,Tianyuan Shi,Xiaojun Quan
关键词-EN: face significant challenges, merging distribution matrices, fusing heterogeneous open-source, heterogeneous open-source LLMs, methods face significant
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at \urlthis https URL.
zh

[NLP-25] Automatic detection of diseases in Spanish clinical notes combining medical language models and ontologies

【速读】：该论文试图解决皮肤病理学病理的自动检测问题，特别是在医疗报告中识别患者可能患有的皮肤病理。解决方案的关键在于结合大型语言模型与医学本体（medical ontologies），通过训练模型学习皮肤病理的类型、严重程度和身体部位，以及这些特征的学习顺序，从而显著提高分类准确性。实验结果显示，该方法在医疗文本分类中达到了最先进的水平，精确度为0.84，微观和宏观F1分数分别为0.82和0.75。

链接: https://arxiv.org/abs/2412.03176
作者: Leon-Paul Schaub Torre,Pelayo Quiros,Helena Garcia Mieres
关键词-EN: follow-up medical report, automatic detection, medical reports, dermatological pathologies, hybrid method
类目: Computation and Language (cs.CL)
备注: Translation of SEPLN 2024 es paper

点击查看摘要

Abstract:In this paper we present a hybrid method for the automatic detection of dermatological pathologies in medical reports. We use a large language model combined with medical ontologies to predict, given a first appointment or follow-up medical report, the pathology a person may suffer from. The results show that teaching the model to learn the type, severity and location on the body of a dermatological pathology, as well as in which order it has to learn these three features, significantly increases its accuracy. The article presents the demonstration of state-of-the-art results for classification of medical texts with a precision of 0.84, micro and macro F1-score of 0.82 and 0.75, and makes both the method and the data set used available to the community.
zh

[NLP-26] Byte BPE Tokenization as an Inverse string Homomorphism

【速读】：该论文试图解决的问题是理解在大型语言模型（LLMs）训练和推理过程中，分词（tokenization）对模型性能的影响。解决方案的关键在于揭示分词本质上是一种逆同态映射（inverse homomorphism），即源语言的字符空间与分词后的标记空间是同态的，保持了源语言的结构特性。此外，论文还探讨了“适当分词”（proper tokenization）的概念，即分词器返回的无歧义分词结果。研究结果表明，神经架构在识别上下文无关语言方面的表达能力不受分词的影响。

链接: https://arxiv.org/abs/2412.03160
作者: Saibo Geng,Sankalp Gambhir,Chris Wendler,Robert West
关键词-EN: important preprocessing step, large language models, important preprocessing, preprocessing step, training and inference
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of tokenization has not been well understood. In this work, we demonstrate that tokenization, irrespective of the algorithm used, acts as an inverse homomorphism between strings and tokens. This suggests that the character space of the source language and the token space of the tokenized language are homomorphic, preserving the structural properties of the source language. Additionally, we explore the concept of proper tokenization, which refers to an unambiguous tokenization returned from the tokenizer. Our analysis reveals that the expressiveness of neural architectures in recognizing context-free languages is not affected by tokenization.
zh

[NLP-27] Multi-Level Correlation Network For Few-Shot Image Classification

【速读】：该论文试图解决少样本图像分类 (Few-shot Image Classification, FSIC) 中，仅使用图像特征级别的度量可能不足以从基础类泛化到新类的问题。解决方案的关键在于提出了一个多层次相关网络 (Multi-Level Correlation Network, MLCN)，通过有效捕捉局部信息来提升分类性能。具体来说，论文提出了自相关模块和交叉相关模块，用于学习基于学习表示的局部信息的语义对应关系，并引入模式相关模块以捕捉细粒度图像的模式，从而发现基础类和新类之间的相关结构模式。实验结果表明，该方法在四个广泛使用的FSIC基准测试中表现出色。

链接: https://arxiv.org/abs/2412.03159
作者: Yunkai Dang,Min Zhang,Zhengyu Chen,Xinliang Zhang,Zheng Wang,Meijun Sun,Donglin Wang
关键词-EN: Few-shot image classification, Few-shot image, aims to recognize, Few-shot, classes
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Few-shot image classification(FSIC) aims to recognize novel classes given few labeled images from base classes. Recent works have achieved promising classification performance, especially for metric-learning methods, where a measure at only image feature level is usually used. In this paper, we argue that measure at such a level may not be effective enough to generalize from base to novel classes when using only a few images. Instead, a multi-level descriptor of an image is taken for consideration in this paper. We propose a multi-level correlation network (MLCN) for FSIC to tackle this problem by effectively capturing local information. Concretely, we present the self-correlation module and cross-correlation module to learn the semantic correspondence relation of local information based on learned representations. Moreover, we propose a pattern-correlation module to capture the pattern of fine-grained images and find relevant structural patterns between base classes and novel classes. Extensive experiments and analysis show the effectiveness of our proposed method on four widely-used FSIC benchmarks. The code for our approach is available at: this https URL.
zh

[NLP-28] A Measure of the System Dependence of Automated Metrics

【速读】：该论文试图解决机器翻译自动评估指标在公平性和一致性方面的问题。尽管这些指标在相关性上取得了显著进展，但论文强调确保这些指标对所有系统公平且一致地评估同样重要。解决方案的关键在于引入了一种新的方法来评估这些指标在不同系统间的公平性和一致性。

链接: https://arxiv.org/abs/2412.03152
作者: Pius von Däniken,Jan Deriu,Mark Cieliebak
关键词-EN: made significant progress, Machine Translation, time-consuming human evaluations, Translation have made, significant progress
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.
zh

[NLP-29] Fine-Grained Behavior Simulation with Role-Playing Large Language Model on Social Media

【速读】：该论文试图解决大型语言模型（LLMs）在真实世界场景中，如社交媒体上，能否准确模拟用户行为的问题。解决方案的关键在于引入了细粒度行为模拟数据集 FineRob，该数据集收集了1,866个用户在三个社交媒体平台上的完整行为历史，并将每个行为分解为对象、类型和内容三个细粒度元素，形成了78.6万条QA记录。基于FineRob，论文识别出LLMs行为模拟过程中的两种主导推理模式，并提出了OM-CoT微调方法以增强模型的能力。通过综合实验，论文深入分析了行为模拟的关键因素，并验证了OM-CoT方法的有效性。

链接: https://arxiv.org/abs/2412.03148
作者: Kun Li,Chenwei Dai,Wei Zhou,Songlin Hu
关键词-EN: Large language models, demonstrated impressive capabilities, Large language, role-playing tasks, demonstrated impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in role-playing tasks. However, there is limited research on whether LLMs can accurately simulate user behavior in real-world scenarios, such as social media. This requires models to effectively analyze a user’s history and simulate their role. In this paper, we introduce \textbfFineRob, a novel fine-grained behavior simulation dataset. We collect the complete behavioral history of 1,866 distinct users across three social media platforms. Each behavior is decomposed into three fine-grained elements: object, type, and content, resulting in 78.6k QA records. Based on FineRob, we identify two dominant reasoning patterns in LLMs’ behavior simulation processes and propose the \textbfOM-CoT fine-tuning method to enhance the capability. Through comprehensive experiments, we conduct an in-depth analysis of key factors of behavior simulation and also demonstrate the effectiveness of OM-CoT approach\footnoteCode and dataset are available at \urlthis https URL
zh

[NLP-30] A surprisal oracle for when every layer counts

【速读】：该论文试图解决在训练语言模型时如何更有效地构建课程（Curriculum）的问题。解决方案的关键在于提出了一种名为“主动课程语言建模”（Active Curriculum Language Modeling, ACLM）的方法，该方法通过迭代和动态构建课程，并利用模型的不确定性来指导训练过程。具体来说，ACLM优先处理那些与最不确定的候选项目相似的其他训练项目，从而提高训练效率。论文中还改进了相似性模型，使其更具动态性，并在BabyLM 2024任务中应用于ELC-BERT模型，结果显示在常识和世界知识任务上优于官方基准，但在细粒度语法推理上表现稍逊。

链接: https://arxiv.org/abs/2412.03098
作者: Xudong Hong,Sharid Loáiciga,Asad Sayeed
关键词-EN: Active Curriculum Language, Curriculum Language Modeling, Language Modeling, learner directed approach, Active Curriculum
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner directed approach to training a language model. We proposed the original version of this process in our submission to the BabyLM 2023 task, and now we propose an updated ACLM process for the BabyLM 2024 task. ACLM involves an iteratively- and dynamically-constructed curriculum informed over the training process by a model of uncertainty; other training items that are similarly uncertain to a least certain candidate item are prioritized. Our new process improves the similarity model so that it is more dynamic, and we run ACLM over the most successful model from the BabyLM 2023 task: ELC-BERT (Charpentier and Samuel, 2023). We find that while our models underperform on fine-grained grammatical inferences, they outperform the BabyLM 2024 official base-lines on common-sense and world-knowledge tasks. We make our code available at https: //github.com/asayeed/ActiveBaby.
zh

[NLP-31] OOL-ED: Enhancing Empathetic Response Generation with the Tool Calling Capability of LLM

【速读】：该论文试图解决大型语言模型（LLMs）在生成共情对话时过度依赖固定知识库的问题，以及如何在不引入噪声的情况下灵活整合外部知识。解决方案的关键在于提出了情感知识工具调用（Emotional Knowledge Tool Calling, EKTC）框架，该框架将常识知识库封装为共情工具，使LLMs能够通过工具调用灵活地整合外部知识。为了适应新任务，研究者构建了一个基于EMPATHETIC DIALOGUE（ED）数据集的新数据集TOOL-ED，并通过实验验证了EKTC框架能够有效提升LLMs生成共情对话的能力。

链接: https://arxiv.org/abs/2412.03096
作者: Huiying Cao,Yiqun Zhang,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang
关键词-EN: Large Language models, daily conversations, Large Language, crucial characteristic, characteristic in daily
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Empathetic conversation is a crucial characteristic in daily conversations between individuals. Nowadays, Large Language models (LLMs) have shown outstanding performance in generating empathetic responses. Knowledge bases like COMET can assist LLMs in mitigating illusions and enhancing the understanding of users’ intentions and emotions. However, models remain heavily reliant on fixed knowledge bases and unrestricted incorporation of external knowledge can introduce noise. Tool learning is a flexible end-to-end approach that assists LLMs in handling complex problems. In this paper, we propose Emotional Knowledge Tool Calling (EKTC) framework, which encapsulates the commonsense knowledge bases as empathetic tools, enabling LLMs to integrate external knowledge flexibly through tool calling. In order to adapt the models to the new task, we construct a novel dataset TOOL-ED based on the EMPATHETICMPATHETIC DIALOGUE (ED) dataset. We validate EKTC on the ED dataset, and the experimental results demonstrate that our framework can enhance the ability of LLMs to generate empathetic responses effectively.
zh

[NLP-32] Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

【速读】：该论文试图解决在大语言模型（LLMs）系统中进行任务优化的挑战，特别是在手动干预如提示工程（prompt engineering）和超参数调优（hyperparameter tuning）方面的问题。解决方案的关键在于引入了一种名为REVOLVE的优化方法，该方法通过跟踪LLM系统中响应的“演化”（"R"esponses “EVOLVE”）来实现更稳定和有效的优化。REVOLVE通过在每个步骤中进行渐进式的调整，能够更好地应对系统响应缓慢或不可预测的情况，从而在提示优化、解决方案精炼和代码优化等方面显著提升性能，并减少迭代次数，节省计算资源。

链接: https://arxiv.org/abs/2412.03092
作者: Peiyan Zhang,Haibo Jin,Leyang Hu,Xinnuo Li,Liying Kang,Man Luo,Yangqiu Song,Haohan Wang
关键词-EN: large language models, natural language processing, perform complex tasks, Recent advancements, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. To overcome these challenges, more adaptive methods are needed, especially in situations where the system’s response is evolving slowly or unpredictably. In this paper, we introduce REVOLVE, an optimization method that tracks how "R"esponses “EVOLVE” across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experimental results demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8% improvement in prompt optimization, a 20.72% gain in solution refinement, and a 29.17% increase in code optimization. Additionally, REVOLVE converges in fewer iterations, resulting in significant computational savings. These advantages highlight its adaptability and efficiency, positioning REVOLVE as a valuable tool for optimizing LLM-based systems and accelerating the development of next-generation AI technologies. Code is available at: this https URL.
zh

[NLP-33] ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

【速读】：该论文试图解决自动语音识别（ASR）中的错误纠正问题，特别是在中文语言环境下的应用。解决方案的关键在于利用大型语言模型（LLMs）进行错误纠正，并提出了三种主要方法：提示（prompting）、微调（finetuning）和多模态增强（multi-modal augmentation）。实验结果表明，提示方法效果不佳，微调仅对部分LLMs有效，而多模态增强是最有效的方法，能够达到最先进的性能。

链接: https://arxiv.org/abs/2412.03075
作者: Victor Junqiu Wei,Weicheng Wang,Di Jiang,Yuanfeng Song,Lu Wang
关键词-EN: Automatic speech Recognition, ASR error correction, ASR error, error correction, ASR
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic speech Recognition (ASR) is a fundamental and important task in the field of speech and natural language processing. It is an inherent building block in many applications such as voice assistant, speech translation, etc. Despite the advancement of ASR technologies in recent years, it is still inevitable for modern ASR systems to have a substantial number of erroneous recognition due to environmental noise, ambiguity, etc. Therefore, the error correction in ASR is crucial. Motivated by this, this paper studies ASR error correction in the Chinese language, which is one of the most popular languages and enjoys a large number of users in the world. We first create a benchmark dataset named \emphASR-EC that contains a wide spectrum of ASR errors generated by industry-grade ASR systems. To the best of our knowledge, it is the first Chinese ASR error correction benchmark. Then, inspired by the recent advances in \emphlarge language models (LLMs), we investigate how to harness the power of LLMs to correct ASR errors. We apply LLMs to ASR error correction in three paradigms. The first paradigm is prompting, which is further categorized as zero-shot, few-shot, and multi-step. The second paradigm is finetuning, which finetunes LLMs with ASR error correction data. The third paradigm is multi-modal augmentation, which collectively utilizes the audio and ASR transcripts for error correction. Extensive experiments reveal that prompting is not effective for ASR error correction. Finetuning is effective only for a portion of LLMs. Multi-modal augmentation is the most effective method for error correction and achieves state-of-the-art performance. Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2412.03075 [cs.CL] (or arXiv:2412.03075v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.03075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-34] Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model

【速读】：该论文试图解决从无文本配对的语音数据中提取语音表示的问题，这对于扩充语音合成数据集至关重要。解决方案的关键在于利用自监督学习 (Self-Supervised Learning, SSL) 模型生成的离散符号表示 (discrete symbol representations) 进行语音合成，而不是传统的文本表示。通过对比分析，研究发现使用文本表示在保留语义信息方面具有优势，而使用离散符号表示则在保留声学内容（包括韵律和语调信息）方面表现更佳。

链接: https://arxiv.org/abs/2412.03074
作者: Joonyong Park,Daisuke Saito,Nobuaki Minematsu
关键词-EN: raw audio obtained, conventional text representations, text-free speech representations, self-supervised learning, representations
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: APSIPA ASC 2024

点击查看摘要

Abstract:We examine the text-free speech representations of raw audio obtained from a self-supervised learning (SSL) model by analyzing the synthesized speech using the SSL representations instead of conventional text representations. Since raw audio does not have paired speech representations as transcribed texts do, obtaining speech representations from unpaired speech is crucial for augmenting available datasets for speech synthesis. Specifically, the proposed speech synthesis is conducted using discrete symbol representations from the SSL model in comparison with text representations, and analytical examinations of the synthesized speech have been carried out. The results empirically show that using text representations is advantageous for preserving semantic information, while using discrete symbol representations is superior for preserving acoustic content, including prosodic and intonational information.
zh

[NLP-35] Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models

【速读】：该论文试图解决的问题是如何通过语言特征分析来区分人类撰写的文本和大型语言模型（LLMs）生成的文本。解决方案的关键在于采用了一种基于250个不同语言特征的文本分析方法，并使用LFTK工具自动计算这些特征，包括平均句法深度、语义相似性和情感内容。通过主成分分析（PCA）对这些特征进行二维降维，研究发现人类撰写的文本在特征变异性上显著高于LLMs生成的文本，尤其是在语言风格约束较少的文本类型中。这一发现强调了在理解LLMs文本输出时，纳入有意义的语言特征的重要性。

链接: https://arxiv.org/abs/2412.03025
作者: Sergio E. Zanotto,Segun Aroyehun
关键词-EN: large language models, generate natural language, LLMs increasingly indistinguishable, language models, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. Recent research has predominantly focused on using LLMs to classify text as either human-written or machine-generated. In our study, we adopt a different approach by profiling texts spanning four domains based on 250 distinct linguistic features. We select the M4 dataset from the Subtask B of SemEval 2024 Task 8. We automatically calculate various linguistic features with the LFTK tool and additionally measure the average syntactic depth, semantic similarity, and emotional content for each document. We then apply a two-dimensional PCA reduction to all the calculated features. Our analyses reveal significant differences between human-written texts and those generated by LLMs, particularly in the variability of these features, which we find to be considerably higher in human-written texts. This discrepancy is especially evident in text genres with less rigid linguistic style constraints. Our findings indicate that humans write texts that are less cognitively demanding, with higher semantic content, and richer emotional content compared to texts generated by LLMs. These insights underscore the need for incorporating meaningful linguistic features to enhance the understanding of textual outputs of LLMs.
zh

[NLP-36] Advancing Conversational Psychotherapy: Integrating Privacy Dual-Memory and Domain Expertise with Large Language Models NEURIPS2024

【速读】：该论文试图解决传统心理治疗在地理位置、时间、费用和隐私方面的局限性问题。解决方案的关键在于引入了一个名为 SoulSpeak 的大型语言模型 (LLM) 驱动的聊天机器人，该聊天机器人通过结合短期和长期记忆的检索增强生成 (RAG) 技术，提供个性化的心理治疗响应，同时通过专门的隐私模块确保用户隐私和亲密性。此外，SoulSpeak 利用心理治疗师与客户互动的聊天数据集和多种提示技术，使生成的响应与心理治疗方法相一致。论文还引入了两个微调的 BERT 模型，用于评估系统与现有 LLM 和人类治疗师的性能，其中 Conversational Psychotherapy Preference Model (CPPM) 用于模拟人类对响应的偏好，另一个模型用于评估响应与用户输入的相关性。这些创新旨在通过结合传统治疗的专业知识和 LLM 的优势，提供一种解决当前心理健康服务中可及性和个性化差距的有前途的途径。

链接: https://arxiv.org/abs/2412.02987
作者: XiuYu Zhang,Zening Luo
关键词-EN: Retrieval Augmented Generation, Large Language Model, global issue, issue that reveals, reveals the limitations
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted as a Poster at Statistical Foundations of LLMs and Foundation Models (NeurIPS 2024 Workshop)

点击查看摘要

Abstract:Mental health has increasingly become a global issue that reveals the limitations of traditional conversational psychotherapy, constrained by location, time, expense, and privacy concerns. In response to these challenges, we introduce SoulSpeak, a Large Language Model (LLM)-enabled chatbot designed to democratize access to psychotherapy. SoulSpeak improves upon the capabilities of standard LLM-enabled chatbots by incorporating a novel dual-memory component that combines short-term and long-term context via Retrieval Augmented Generation (RAG) to offer personalized responses while ensuring the preservation of user privacy and intimacy through a dedicated privacy module. In addition, it leverages a counseling chat dataset of therapist-client interactions and various prompting techniques to align the generated responses with psychotherapeutic methods. We introduce two fine-tuned BERT models to evaluate the system against existing LLMs and human therapists: the Conversational Psychotherapy Preference Model (CPPM) to simulate human preference among responses and another to assess response relevance to user input. CPPM is useful for training and evaluating psychotherapy-focused language models independent from SoulSpeak, helping with the constrained resources available for psychotherapy. Furthermore, the effectiveness of the dual-memory component and the robustness of the privacy module are also examined. Our findings highlight the potential and challenge of enhancing mental health care by offering an alternative that combines the expertise of traditional therapy with the advantages of LLMs, providing a promising way to address the accessibility and personalization gap in current mental health services.
zh

[NLP-37] Surveying the Effects of Quality Diversity and Complexity in Synthetic Data From Large Language Models

【速读】：该论文试图解决生成式数据（Synthetic Data）在增强自然数据方面的应用中，缺乏对不同生成算法直接比较的问题。解决方案的关键在于提出了一种基于数据质量（Quality）、多样性（Diversity）和复杂性（Complexity）的评估框架，用于分析和比较不同生成算法产生的合成数据。论文强调了质量、多样性和复杂性在开放式任务中的重要性，并指出了训练数据中存在的质量-多样性权衡（Quality-Diversity trade-offs）及其对下游模型性能的影响。通过分析合成数据管道中的各个组件对数据特性的影响，论文进一步对生成算法进行了分类和比较，强调了平衡这些特性对于高效强化学习和自改进算法的重要性。

链接: https://arxiv.org/abs/2412.02980
作者: Alex Havrilla,Andrew Dai,Laura O’Mahony,Koen Oostermeijer,Vera Zisler,Alon Albalak,Fabrizio Milo,Sharath Chandra Raparthy,Kanishk Gandhi,Baber Abbasi,Duy Phung,Maia Iyer,Dakota Mahan,Chase Blagden,Srishti Gureja,Mohammed Hamdy,Wen-Ding Li,Giovanni Paolini,Pawan Sasanka Ammanamanchi,Elliot Meyerson
关键词-EN: Large Language Models, Large Language, Synthetic data generation, Synthetic data, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.
zh

[NLP-38] Curriculum-style Data Augmentation for LLM -based Metaphor Detection

【速读】：该论文试图解决利用大型语言模型（LLMs）进行隐喻检测时，依赖闭源LLMs导致的高推理成本和高延迟问题。解决方案的关键在于通过微调开源LLMs来减少推理成本和延迟，并引入课程式数据增强（Curriculum-style Data Augmentation, CDA）来应对隐喻检测中的数据稀缺问题。具体来说，CDA方法在微调前评估训练数据，将正确预测的实例用于微调，而错误预测的实例则作为种子数据进行数据增强，从而使模型能够逐步学习更复杂的知识，提升性能。实验结果表明，该方法在所有基线中达到了最先进的性能，并通过详细的消融研究验证了CDA的有效性。

链接: https://arxiv.org/abs/2412.02956
作者: Kaidi Jia,Yanxia Wu,Rongsheng Li
关键词-EN: utilizing large language, utilizing large, large language models, achieved promising results, Recently
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, utilizing large language models (LLMs) for metaphor detection has achieved promising results. However, these methods heavily rely on the capabilities of closed-source LLMs, which come with relatively high inference costs and latency. To address this, we propose a method for metaphor detection by fine-tuning open-source LLMs, effectively reducing inference costs and latency with a single inference step. Furthermore, metaphor detection suffers from a severe data scarcity problem, which hinders effective fine-tuning of LLMs. To tackle this, we introduce Curriculum-style Data Augmentation (CDA). Specifically, before fine-tuning, we evaluate the training data to identify correctly predicted instances for fine-tuning, while incorrectly predicted instances are used as seed data for data augmentation. This approach enables the model to quickly learn simpler knowledge and progressively acquire more complex knowledge, thereby improving performance incrementally. Experimental results demonstrate that our method achieves state-of-the-art performance across all baselines. Additionally, we provide detailed ablation studies to validate the effectiveness of CDA.
zh

[NLP-39] Dynamic Graph Neural Ordinary Differential Equation Network for Multi-modal Emotion Recognition in Conversation

【速读】：该论文试图解决多模态情感识别（Multimodal Emotion Recognition in Conversation, MERC）中现有图卷积网络（GCN）方法存在的过拟合问题以及无法有效捕捉说话者情感的时间依赖性问题。解决方案的关键在于提出了一个动态图神经普通微分方程网络（Dynamic Graph Neural Ordinary Differential Equation Network, DGODE），通过结合情感的动态变化来捕捉说话者情感的时间依赖性，并有效缓解GCN的过拟合问题。具体来说，DGODE利用自适应混合跳跃机制（adaptive mixhop mechanism）来提升GCN的泛化能力，并采用图ODE演化网络来表征节点表示随时间的连续动态变化，从而捕捉时间依赖性。

链接: https://arxiv.org/abs/2412.02935
作者: Yuntao Shou,Tao Meng,Wei Ai,Keqin Li
关键词-EN: Multimodal emotion recognition, classifying human emotional, human emotional states, Multimodal emotion, existing multimodal emotion
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Multimodal emotion recognition in conversation (MERC) refers to identifying and classifying human emotional states by combining data from multiple different modalities (e.g., audio, images, text, video, etc.). Most existing multimodal emotion recognition methods use GCN to improve performance, but existing GCN methods are prone to overfitting and cannot capture the temporal dependency of the speaker’s emotions. To address the above problems, we propose a Dynamic Graph Neural Ordinary Differential Equation Network (DGODE) for MERC, which combines the dynamic changes of emotions to capture the temporal dependency of speakers’ emotions, and effectively alleviates the overfitting problem of GCNs. Technically, the key idea of DGODE is to utilize an adaptive mixhop mechanism to improve the generalization ability of GCNs and use the graph ODE evolution network to characterize the continuous dynamics of node representations over time and capture temporal dependencies. Extensive experiments on two publicly available multimodal emotion recognition datasets demonstrate that the proposed DGODE model has superior performance compared to various baselines. Furthermore, the proposed DGODE can also alleviate the over-smoothing problem, thereby enabling the construction of a deep GCN network.
zh

[NLP-40] Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

【速读】：该论文试图解决单细胞基因组学中细胞类型注释过程的自动化问题，这一过程通常需要大量的人力和专家知识。解决方案的关键在于利用大型语言模型（LLMs）来高效处理和综合大量文本数据，自动提取关键生物学知识，如标记基因，从而实现更高效的细胞类型注释。论文通过引入SOAR，一个全面的LLMs在细胞类型注释任务中的基准研究，评估了8个指令调优的LLMs在11个跨物种和细胞类型的数据集上的表现。研究不仅探索了LLMs在单细胞RNA测序（scRNA-seq）数据中准确分类和注释细胞类型的潜力，还通过跨模态转换扩展到多组学数据的应用。此外，论文评估了思维链（CoT）提示技术在注释过程中生成详细生物学见解的有效性，结果表明LLMs能够在无需额外微调的情况下提供对单细胞数据的稳健解释，推动了基因组学研究中细胞类型注释的自动化进程。

链接: https://arxiv.org/abs/2412.02915
作者: Junhao Liu,Siwei Xu,Lei Zhang,Jing Zhang
关键词-EN: underlying disease mechanisms, simultaneous molecular profiling, uncover underlying disease, cell type, cell type annotation
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Over the past decade, the revolution in single-cell sequencing has enabled the simultaneous molecular profiling of various modalities across thousands of individual cells, allowing scientists to investigate the diverse functions of complex tissues and uncover underlying disease mechanisms. Among all the analytical steps, assigning individual cells to specific types is fundamental for understanding cellular heterogeneity. However, this process is usually labor-intensive and requires extensive expert knowledge. Recent advances in large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract essential biological knowledge, such as marker genes, potentially promoting more efficient and automated cell type annotations. To thoroughly evaluate the capability of modern instruction-tuned LLMs in automating the cell type identification process, we introduce SOAR, a comprehensive benchmarking study of LLMs for cell type annotation tasks in single-cell genomics. Specifically, we assess the performance of 8 instruction-tuned LLMs across 11 datasets, spanning multiple cell types and species. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data, while extending their application to multiomics data through cross-modality translation. Additionally, we evaluate the effectiveness of chain-of-thought (CoT) prompting techniques in generating detailed biological insights during the annotation process. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning, advancing the automation of cell type annotation in genomics research.
zh

[NLP-41] Does Few-Shot Learning Help LLM Performance in Code Synthesis?

【速读】：该论文试图解决在代码生成任务中，如何通过优化提示（prompt）中的少样本示例（few-shot examples）来提升大型语言模型（LLMs）的编码能力。解决方案的关键在于提出了两种选择少样本示例的方法：一种是模型无关的方法，称为CODEEXEMPLAR-FREE，另一种是基于模型的方法，称为CODEEXEMPLAR-BASED。这两种方法在提高性能和依赖训练数据及可解释性之间提供了权衡，显著提升了CodeLlama在HumanEval+编码基准上的表现。

链接: https://arxiv.org/abs/2412.02906
作者: Derek Xu,Tong Xie,Botao Xia,Haoyu Li,Yunsheng Bai,Yizhou Sun,Wei Wang
关键词-EN: Large language models, made significant strides, Large language, improved model design, model design
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM’s coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CODEEXEMPLAR-FREE, and a model-based method, CODEEXEMPLAR-BASED. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama’s coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.
zh

[NLP-42] Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在生成自然语言时可能产生的幻觉 (LLM hallucinations) 问题，即模型生成看似可信但实际上错误的信息。解决方案的关键在于提出一种不确定性感知的微调方法 (uncertainty-aware fine-tuning approach)，通过引入一种基于决策理论的新颖不确定性感知因果语言建模损失函数 (uncertainty-aware causal language modeling loss function)，来增强模型在生成自然语言时的不确定性估计能力。这种方法不仅提高了模型在自由形式自然语言生成任务中的不确定性估计的校准度，还显著提升了模型检测幻觉和识别域外提示的能力。

链接: https://arxiv.org/abs/2412.02904
作者: Ranganath Krishnan,Piyush Khanna,Omesh Tickoo
关键词-EN: Large language models, Large language, revolutionized the field, impressive reasoning, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized the field of natural language processing with their impressive reasoning and question-answering capabilities. However, these models are sometimes prone to generating credible-sounding but incorrect information, a phenomenon known as LLM hallucinations. Reliable uncertainty estimation in LLMs is essential for fostering trust in their generated responses and serves as a critical tool for the detection and prevention of erroneous or hallucinated outputs. To achieve reliable and well-calibrated uncertainty quantification in open-ended and free-form natural language generation, we propose an uncertainty-aware fine-tuning approach for LLMs. This approach enhances the model’s ability to provide reliable uncertainty estimates without compromising accuracy, thereby guiding them to produce more trustworthy responses. We introduce a novel uncertainty-aware causal language modeling loss function, grounded in the principles of decision theory. Through rigorous evaluation on multiple free-form question-answering datasets and models, we demonstrate that our uncertainty-aware fine-tuning approach yields better calibrated uncertainty estimates in natural language generation tasks than fine-tuning with the standard causal language modeling loss. Furthermore, the experimental results show that the proposed method significantly improves the model’s ability to detect hallucinations and identify out-of-domain prompts.
zh

[NLP-43] MLD-EA: Check and Complete Narrative Coherence by Introducing Emotions and Actions

【速读】：该论文试图解决自然语言处理（NLP）中叙事理解和故事生成任务中逻辑连贯性不足的问题。解决方案的关键在于引入了一个名为“情感与动作缺失逻辑检测器（Missing Logic Detector by Emotion and Action, MLD-EA）”的模型，该模型利用大型语言模型（LLMs）来识别叙事中的逻辑断层，并生成与故事情感和逻辑流相一致的连贯句子。实验结果表明，MLD-EA模型显著提升了叙事理解和故事生成的质量，强调了LLMs在故事写作中作为有效逻辑检查工具的潜力，从而填补了NLP研究中的一个重要空白。

链接: https://arxiv.org/abs/2412.02897
作者: Jinming Zhang,Yunfei Long
关键词-EN: natural language processing, existing research focused, question-answering tasks, critical challenges, challenges in natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Narrative understanding and story generation are critical challenges in natural language processing (NLP), with much of the existing research focused on summarization and question-answering tasks. While previous studies have explored predicting plot endings and generating extended narratives, they often neglect the logical coherence within stories, leaving a significant gap in the field. To address this, we introduce the Missing Logic Detector by Emotion and Action (MLD-EA) model, which leverages large language models (LLMs) to identify narrative gaps and generate coherent sentences that integrate seamlessly with the story’s emotional and logical flow. The experimental results demonstrate that the MLD-EA model enhances narrative understanding and story generation, highlighting LLMs’ potential as effective logic checkers in story writing with logical coherence and emotional consistency. This work fills a gap in NLP research and advances border goals of creating more sophisticated and reliable story-generation systems.
zh

[NLP-44] Removing Spurious Correlation from Neural Network Interpretations

【速读】：该论文试图解决现有算法在识别导致不良行为的神经元时未考虑混杂因素（confounders）影响的问题，特别是对话主题（topic of the conversation）的影响。解决方案的关键在于提出了一种新的因果中介方法（causal mediation approach），该方法通过控制对话主题的影响来减少虚假相关性（spurious correlations）。实验结果表明，在调整对话主题的影响后，有害行为的神经元定位（localization）变得更加分散。

链接: https://arxiv.org/abs/2412.02893
作者: Milad Fotouhi,Mohammad Taha Bahadori,Oluwaseyi Feyisetan,Payman Arabshahi,David Heckerman
关键词-EN: existing algorithms, algorithms for identification, identification of neurons, neurons responsible, responsible for undesired
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.
zh

[NLP-45] DD-Bench Verified: Can LLM s Generate Tests for Issues Before They Get Resolved?

【速读】：该论文试图解决在测试驱动开发 (Test-driven Development, TDD) 中自动化生成测试的问题。解决方案的关键在于提出了一个高质量的基准测试套件 TDD-Bench Verified，该套件包含从真实世界的 GitHub 代码库中挖掘的 449 个问题，并通过人工评审和执行评估工具进行筛选。此外，论文还介绍了基于大型语言模型 (LLM) 的解决方案 Auto-TDD，该方案能够根据问题描述和问题解决前的代码库生成用于验证问题修复的测试。Auto-TDD 在失败到通过的比率 (fail-to-pass rate) 和覆盖率充分性 (coverage adequacy) 方面均优于先前的工作，从而提高了开发者解决问题的效率并促进了更健壮的修复。

链接: https://arxiv.org/abs/2412.02883
作者: Toufique Ahmed,Martin Hirzel,Rangeet Pan,Avraham Shinnar,Saurabh Sinha
关键词-EN: Test-driven development, numerous benefits, TDD expound, expound its numerous, TDD
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-driven development (TDD) is the practice of writing tests first and coding later, and the proponents of TDD expound its numerous benefits. For instance, given an issue on a source code repository, tests can clarify the desired behavior among stake-holders before anyone writes code for the agreed-upon fix. Although there has been a lot of work on automated test generation for the practice “write code first, test later”, there has been little such automation for TDD. Ideally, tests for TDD should be fail-to-pass (i.e., fail before the issue is resolved and pass after) and have good adequacy with respect to covering the code changed during issue resolution. This paper introduces TDD-Bench Verified, a high-quality benchmark suite of 449 issues mined from real-world GitHub code repositories. The benchmark’s evaluation harness runs only relevant tests in isolation for simple yet accurate coverage measurements, and the benchmark’s dataset is filtered both by human judges and by execution in the harness. This paper also presents Auto-TDD, an LLM-based solution that takes as input an issue description and a codebase (prior to issue resolution) and returns as output a test that can be used to validate the changes made for resolving the issue. Our evaluation shows that Auto-TDD yields a better fail-to-pass rate than the strongest prior work while also yielding high coverage adequacy. Overall, we hope that this work helps make developers more productive at resolving issues while simultaneously leading to more robust fixes.
zh

[NLP-46] CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural Networks

【速读】：该论文试图解决传统检索增强生成（Retrieval-Augmented Generation, RAG）中单一向量搜索的局限性，提出了一种名为CAISSON的新型层次化方法。解决方案的关键在于利用双自组织映射（Self-Organizing Maps, SOMs）构建多视角聚类框架，通过处理文本和元数据嵌入以及元数据与概念嵌入的组合，捕捉文档间细粒度的语义关系和高层次的概念模式。这种双视角方法通过结合不同组织层面的证据，实现了更细致的文档发现，显著提升了复杂多实体查询的检索效果，同时保持了适用于交互式应用的响应时间。

链接: https://arxiv.org/abs/2412.02835
作者: Igor Halperin
关键词-EN: transforms traditional single-vector, traditional single-vector search, Retrieval-Augmented Generation, transforms traditional, traditional single-vector
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 26 pages, 7 figures, 8 tables

点击查看摘要

Abstract:We present CAISSON, a novel hierarchical approach to Retrieval-Augmented Generation (RAG) that transforms traditional single-vector search into a multi-view clustering framework. At its core, CAISSON leverages dual Self-Organizing Maps (SOMs) to create complementary organizational views of the document space, where each view captures different aspects of document relationships through specialized embeddings. The first view processes combined text and metadata embeddings, while the second operates on metadata enriched with concept embeddings, enabling a comprehensive multi-view analysis that captures both fine-grained semantic relationships and high-level conceptual patterns. This dual-view approach enables more nuanced document discovery by combining evidence from different organizational perspectives. To evaluate CAISSON, we develop SynFAQA, a framework for generating synthetic financial analyst notes and question-answer pairs that systematically tests different aspects of information retrieval capabilities. Drawing on HotPotQA’s methodology for constructing multi-step reasoning questions, SynFAQA generates controlled test cases where each question is paired with the set of notes containing its ground-truth answer, progressing from simple single-entity queries to complex multi-hop retrieval tasks involving multiple entities and concepts. Our experimental results demonstrate substantial improvements over both basic and enhanced RAG implementations, particularly for complex multi-entity queries, while maintaining practical response times suitable for interactive applications.
zh

[NLP-47] RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂、知识密集型任务（如常识推理和医学推理）时，推理准确性和事实完整性不足的问题。解决方案的关键在于引入RARE（Retrieval-Augmented Reasoning Enhancement），这是一种对互推理框架（rStar）的多功能扩展。RARE通过在蒙特卡洛树搜索（MCTS）框架中集成两个创新动作（A6和A7）来增强推理过程：A6基于初始问题生成搜索查询，通过信息检索获取相关数据，并利用这些数据来完善最终答案；A7则专门用于处理生成的子问题，通过信息检索重新回答这些子问题，确保答案的上下文相关性。此外，论文还提出了检索增强的事实性评分器（Retrieval-Augmented Factuality Scorer），以替代原有的判别器，优先考虑符合高事实标准的推理路径。实验结果表明，RARE使开源LLMs在性能上能够与顶尖的开源模型（如GPT-4和GPT-4o）相媲美，从而确立了RARE在提升LLMs在逻辑一致性和事实完整性要求高的领域中的可扩展解决方案。

链接: https://arxiv.org/abs/2412.02830
作者: Hieu Tran,Zonghai Yao,Junda Wang,Yifan Zhang,Zhichao Yang,Hong Yu
关键词-EN: Retrieval-Augmented Reasoning Enhancement, work introduces RARE, Monte Carlo Tree, enhancing reasoning accuracy, Carlo Tree Search
类目: Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a versatile extension to the mutual reasoning framework (rStar), aimed at enhancing reasoning accuracy and factual integrity across large language models (LLMs) for complex, knowledge-intensive tasks such as commonsense and medical reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree Search (MCTS) framework: A6, which generates search queries based on the initial problem statement, performs information retrieval using those queries, and augments reasoning with the retrieved data to formulate the final answer; and A7, which leverages information retrieval specifically for generated sub-questions and re-answers these sub-questions with the relevant contextual information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed to replace the original discriminator, prioritizing reasoning paths that meet high standards of factuality. Experimental results with LLaMA 3.1 show that RARE enables open-source LLMs to achieve competitive performance with top open-source models like GPT-4 and GPT-4o. This research establishes RARE as a scalable solution for improving LLMs in domains where logical coherence and factual integrity are critical.
zh

[NLP-48] Minimization of Boolean Complexity in In-Context Concept Learning

【速读】：该论文试图解决的问题是：影响大型语言模型（LLMs）在上下文学习（in-context learning）中成功与否的因素及其相应困难。解决方案的关键在于通过精心设计的概念学习任务，验证任务表现与概念的布尔复杂度（Boolean complexity）高度相关，从而揭示上下文学习表现出对简单性的学习偏差，类似于人类的学习方式。

链接: https://arxiv.org/abs/2412.02823
作者: Leroy Z. Wang,R. Thomas McCoy,Shane Steinert-Threlkeld
关键词-EN: Large Language Models, Language Models, Large Language, factors contribute, relative success
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What factors contribute to the relative success and corresponding difficulties of in-context learning for Large Language Models (LLMs)? Drawing on insights from the literature on human concept learning, we test LLMs on carefully designed concept learning tasks, and show that task performance highly correlates with the Boolean complexity of the concept. This suggests that in-context learning exhibits a learning bias for simplicity in a way similar to humans.
zh

[NLP-49] CNNSum: Exploring Long-Conext Summarization with Large Language Models in Chinese Novels

【速读】：该论文试图解决长上下文任务中高质量摘要数据集稀缺的问题，解决方案的关键在于引入了CNNSum，这是一个多尺度中文长上下文小说摘要基准，涵盖四个子集，总长度从16k到128k，包含695个样本，且标注由人工完成。论文通过评估商业和开源模型在CNNSum上的表现，并进行详细分析，发现当前长上下文摘要主要依赖于记忆能力，具有稳定长上下文长度的小型语言模型（LLMs）是最具成本效益的。此外，使用由短上下文摘要拼接而成的长数据显著提升了性能。论文还探讨了提示模板对性能的影响，并指出通过微调可以缓解这一问题。最后，论文强调了RoPE基础缩放模型在结合其他插值方法时的性能变化，并建议谨慎选择。CNNSum的发布旨在推动该领域的研究。

链接: https://arxiv.org/abs/2412.02819
作者: Lingxiao Wei,He Yan,Xiangju Lu,Junmin Zhu,Jun Wang,Wei Zhang
关键词-EN: Large Language Models, Large Language, Language Models, Language, long-context tasks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been well-researched in many long-context tasks. However, due to high annotation costs, high-quality long-context summary datasets for training or evaluation are scarce, limiting further research. In this work, we introduce CNNSum, a new multi-scale Chinese long-context novel summarization benchmark, including four subsets, length covering 16k\textasciitilde128k, 695 samples in total, the annotations are human-driven. We evaluate commercial and open-source models on CNNSum and conduct a detailed analysis. Based on the observations, we further conduct fine-tuning exploration with short-context summary data. In our study: (1) GPT-4o underperformed, due to excessive subjective commentary. (2) Currently, long-context summarization mainly relies on memory ability, small LLMs with stable longer context lengths are the most cost-effective. Using long data concatenated from short-context summaries makes a significant improvement. (3) Prompt templates may cause a large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned Chat or Instruction versions may harm the Base model and further fine-tuning cannot bridge performance gap. (5) while models with RoPE base scaling exhibit strong extrapolation potential, their performance may vary significantly when combined with other interpolation methods and need careful selection. (6) CNNSum provides more reliable and insightful evaluation results than other benchmarks. We release CNNSum to advance research in this field.
zh

[NLP-50] An Evolutionary Large Language Model for Hallucination Mitigation

【速读】：该论文试图解决大语言模型（LLMs）在生成文本、图像和视频时出现的幻觉问题（hallucination），特别是在医疗和法律等对信息准确性要求极高的专业领域。解决方案的关键是提出了EvoLLMs框架，该框架受进化计算（Evolutionary Computation）启发，利用遗传算法（genetic algorithms）模拟进化过程中的选择、变异和突变，自动化生成高质量的问答（QA）数据集，同时最小化幻觉现象。EvoLLMs在深度（Depth）、相关性（Relevance）和覆盖率（Coverage）等关键指标上优于人工生成的数据集，并且在减少幻觉方面几乎达到了人工水平，显著减少了手动数据集构建的时间和资源需求。

链接: https://arxiv.org/abs/2412.02790
作者: Abdennour Boulesnane,Abdelhakim Souilah
关键词-EN: artificial intelligence applications, intelligence applications characterized, applications generating text, ChatGPT and Gemini, high-impact applications generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of LLMs, like ChatGPT and Gemini, has marked the modern era of artificial intelligence applications characterized by high-impact applications generating text, images, and videos. However, these models usually ensue with one critical challenge called hallucination: confident presentation of inaccurate or fabricated information. This problem attracts serious concern when these models are applied to specialized domains, including healthcare and law, where the accuracy and preciseness of information are absolute conditions. In this paper, we propose EvoLLMs, an innovative framework inspired by Evolutionary Computation, which automates the generation of high-quality Question-answering (QA) datasets while minimizing hallucinations. EvoLLMs employs genetic algorithms, mimicking evolutionary processes like selection, variation, and mutation, to guide LLMs in generating accurate, contextually relevant question-answer pairs. Comparative analysis shows that EvoLLMs consistently outperforms human-generated datasets in key metrics such as Depth, Relevance, and Coverage, while nearly matching human performance in mitigating hallucinations. These results highlight EvoLLMs as a robust and efficient solution for QA dataset generation, significantly reducing the time and resources required for manual curation.
zh

[NLP-51] Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset

【速读】：该论文试图解决学术问答系统在处理异构数据源时的局限性问题，即现有的学术问答方法通常仅依赖于单一类型的数据源，如文本或知识图谱 (Knowledge Graphs, KGs)。解决方案的关键在于引入了一个名为Hybrid-SQuAD的新型大规模问答数据集，该数据集包含了10.5K个问题-答案对，通过结合DBLP和SemOpenAlex知识图谱以及对应的Wikipedia文本生成。此外，论文还提出了一种基于RAG（Retrieval-Augmented Generation）的混合问答模型，该模型在Hybrid-SQuAD测试集上达到了69.65的精确匹配分数，从而有效地整合了来自多个异构数据源的信息。

链接: https://arxiv.org/abs/2412.02788
作者: Tilahun Abedissa Taffa,Debayan Baneerje,Yaregal Assabie,Ricardo Usbeck
关键词-EN: Existing Scholarly Question, Knowledge Graphs, methods typically target, Scholarly Question Answering, typically target homogeneous
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing Scholarly Question Answering (QA) methods typically target homogeneous data sources, relying solely on either text or Knowledge Graphs (KGs). However, scholarly information often spans heterogeneous sources, necessitating the development of QA systems that can integrate information from multiple heterogeneous data sources. To address this challenge, we introduce Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale QA dataset designed to facilitate answering questions incorporating both text and KG facts. The dataset consists of 10.5K question-answer pairs generated by a large language model, leveraging the KGs - DBLP and SemOpenAlex alongside corresponding text from Wikipedia. In addition, we propose a RAG-based baseline hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD test set.
zh

[NLP-52] Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training

【速读】：该论文试图解决土耳其语言模型在少样本和零样本学习场景下的准确性问题。解决方案的关键在于采用新的语料库选择和训练方法，具体包括：1) 适应大型语言模型生成的数据集；2) 将英语数据集翻译成土耳其语；3) 将这些资源整合到训练过程中。通过这些方法，模型在土耳其语理解和逻辑查询处理方面的性能显著提升，强调了优化语料库选择策略对于提升多语言模型性能的重要性，尤其是在资源匮乏的语言如土耳其语中。

链接: https://arxiv.org/abs/2412.02775
作者: H. Toprak Kesgin,M. Kaan Yuce,Eren Dogan,M. Egemen Uzun,Atahan Uz,Elif Ince,Yusuf Erdem,Osama Shbib,Ahmed Zeer,M. Fatih Amasyali
关键词-EN: develop and assess, Large Language Model, Turkish language, adapted Large Language, Turkish
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 Innovations in Intelligent Systems and Applications Conference (ASYU)

点击查看摘要

Abstract:In this study, we develop and assess new corpus selection and training methodologies to improve the effectiveness of Turkish language models. Specifically, we adapted Large Language Model generated datasets and translated English datasets into Turkish, integrating these resources into the training process. This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios. Furthermore, the merging of these adapted models was found to markedly improve their performance. Human evaluative metrics, including task-specific performance assessments, further demonstrated that these adapted models possess a greater aptitude for comprehending the Turkish language and addressing logic-based queries. This research underscores the importance of refining corpus selection strategies to optimize the performance of multilingual models, particularly for under-resourced languages like Turkish.
zh

[NLP-53] Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: G"orselle Sohbet Etmek

【速读】：该论文旨在解决土耳其语视觉指令模型在性能上的不足，特别是通过开发和分析不同的模型架构和数据集组合来提升模型性能。解决方案的关键在于设计和实现Cosmos-LLaVA模型，该模型通过结合不同的大型语言模型（Large Language Models）和图像编码器（Image Coders）来克服土耳其语的缺陷。实验结果表明，模型架构和数据集的选择对模型性能有显著影响，因此精细调整和优化这些因素是提升模型性能的核心。

链接: https://arxiv.org/abs/2412.02760
作者: Ahmed Zeer,Eren Dogan,Yusuf Erdem,Elif Ince,Osama Shbib,M. Egemen Uzun,Atahan Uz,M. Kaan Yuce,H. Toprak Kesgin,M. Fatih Amasyali
关键词-EN: Turkish visual instruction, visual instruction model, Turkish visual, Turkish language, visual instruction
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: in Turkish language, 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP)

点击查看摘要

Abstract:In this study, a Turkish visual instruction model was developed and various model architectures and dataset combinations were analysed to improve the performance of this model. The Cosmos-LLaVA model, which is built by combining different large language models and image coders, is designed to overcome the deficiencies in the Turkish language. In the experiments, the effects of fine-tuning with various datasets on the model performance are analysed in detail. The results show that model architecture and dataset selection have a significant impact on performance. Bu çalışmada bir Türkçe görsel talimat modeli geliştirilerek bu modelin performansını artırmaya yönelik çeşitli model mimarileri ve veri kümesi kombinasyonları derinlemesine incelenmiştir. Farklı büyük dil modelleri ve görüntü kodlayıcılarının bir araya getirilmesiyle oluşturulan Cosmos-LLaVA modeli, Türkçe dilindeki eksiklikleri gidermeye yönelik olarak tasarlanmıştır. Yapılan deneylerde, çeşitli veri kümeleri ile yapılan ince ayarların model performansını nasıl etkilediği detaylı olarak ele alınmıştır. Sonuçlar, model mimarisi ve veri kümesi seçiminin performans üzerinde önemli bir etkiye sahip olduğunu göstermektedir. Comments: in Turkish language, 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.02760 [cs.AI] (or arXiv:2412.02760v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.02760 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/IDAP64064.2024.10710874 Focus to learn more DOI(s) linking to related resources
zh

计算机视觉

[CV-0] Navigation World Models WWW

【速读】：该论文试图解决具有视觉-运动能力的代理在复杂环境中进行导航的问题。解决方案的关键在于引入了一个可控的视频生成模型——导航世界模型 (Navigation World Model, NWM)，该模型基于过去的观察和导航动作预测未来的视觉观察。NWM采用了一种条件扩散变换器 (Conditional Diffusion Transformer, CDiT)，并在大规模的以自我为中心的视频数据集上进行训练，参数规模达到10亿。NWM不仅能够在熟悉环境中通过模拟和评估导航轨迹来规划路径，还能在规划过程中动态地整合约束条件。此外，NWM利用其学习的视觉先验知识，能够从单一输入图像中想象出不熟悉环境中的轨迹，从而为下一代导航系统提供了一个灵活且强大的工具。

链接: https://arxiv.org/abs/2412.03572
作者: Amir Bar,Gaoyue Zhou,Danny Tran,Trevor Darrell,Yann LeCun
关键词-EN: Conditional Diffusion Transformer, visual-motor capabilities, Navigation World Model, NWM, fundamental skill
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: project page: this https URL

点击查看摘要

Abstract:Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.
zh

[CV-1] Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation

【速读】：该论文试图解决从内容图像和风格图像生成风格化3D对象的问题。解决方案的关键在于提出了Style3D方法，该方法通过两个主要过程实现：多视图双特征对齐（multi-view dual-feature alignment）和稀疏视图空间重建（sparse-view spatial reconstruction）。具体来说，论文引入了MultiFusion Attention技术，通过内容图像的查询特征保持几何一致性，同时利用风格图像的键和值特征引导风格迁移，确保多视图图像的空间一致性和风格保真度。最后，通过引入大型3D重建模型生成连贯的风格化3D对象。这种方法通过在多视图间建立结构和风格特征的相互作用，实现了整体3D风格化过程，显著提高了计算效率和视觉质量。

链接: https://arxiv.org/abs/2412.03571
作者: Bingjie Song,Xin Huang,Ruting Xie,Xue Wang,Qing Wang
关键词-EN: object stylization, stylization, objects, object, image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality.
zh

[CV-2] Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis NEURIPS2024

【速读】：该论文试图解决在多视角图像中推断3D结构时，由于相机姿态估计不准确和观测数据稀疏导致的3D重建质量下降的问题。解决方案的关键在于提出了一种名为SparseAGS的方法，该方法通过以下两个关键策略来增强3D重建和姿态估计的鲁棒性：a) 结合基于新视角合成的生成式先验（novel-view-synthesis-based generative priors）与光度目标（photometric objectives），以提升推断3D的质量；b) 显式地处理异常值（outliers），并采用离散搜索与连续优化相结合的策略来纠正这些异常值。通过这些策略，SparseAGS显著提高了现有姿态估计系统的精度，并生成了高质量的3D重建结果，超越了当前多视角重建的基线方法。

链接: https://arxiv.org/abs/2412.03570
作者: Qitao Zhao,Shubham Tulsiani
关键词-EN: requires precise camera, images typically requires, typically requires solving, precise camera poses, predicting camera poses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024. Project website: this https URL

点击查看摘要

Abstract:Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks – accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems’ pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.
zh

[CV-3] Streaming Detection of Queried Event Start

链接: https://arxiv.org/abs/2412.03567
作者: Cristobal Eyzaguirre,Eric Tang,Shyamal Buch,Adrien Gaidon,Jiajun Wu,Juan Carlos Niebles
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-4] FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

【速读】：该论文试图解决自动驾驶领域中摄像头模拟方法在非记录轨迹视角下渲染质量严重下降的问题。解决方案的关键在于提出了一个生成式增强模型（generative enhancement model），并结合匹配数据构建策略，能够在偏离记录轨迹的视角下生成高质量图像。此外，论文还提出了渐进重建策略（progressive reconstruction strategy），通过逐步将未记录视角的生成图像加入重建过程，从轻微偏离轨迹的视角逐步扩展到较大偏移（超过3米）的视角，从而实现高质量的非轨迹视角合成。

链接: https://arxiv.org/abs/2412.03566
作者: Lue Fan,Hao Zhang,Qitai Wang,Hongsheng Li,Zhaoxiang Zhang
关键词-EN: camera simulation method, autonomous driving, camera simulation, simulation method, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose FreeSim, a camera simulation method for autonomous driving. FreeSim emphasizes high-quality rendering from viewpoints beyond the recorded ego trajectories. In such viewpoints, previous methods have unacceptable degradation because the training data of these viewpoints is unavailable. To address such data scarcity, we first propose a generative enhancement model with a matched data construction strategy. The resulting model can generate high-quality images in a viewpoint slightly deviated from the recorded trajectories, conditioned on the degraded rendering of this viewpoint. We then propose a progressive reconstruction strategy, which progressively adds generated images of unrecorded views into the reconstruction process, starting from slightly off-trajectory viewpoints and moving progressively farther away. With this progressive generation-reconstruction pipeline, FreeSim supports high-quality off-trajectory view synthesis under large deviations of more than 3 meters.
zh

[CV-5] Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

【速读】：该论文试图解决大模态模型（Large Multimodal Models, LMMs）在实例级理解（instance-level understanding）方面的不足，特别是在需要更细致理解和对齐特定元素的情况下。解决方案的关键在于引入了一个自动化标注流水线，该流水线利用GPT-4o通过显式视觉提示（explicit visual prompting）从图像和视频中提取实例级信息，并提出了Inst-IT，即通过显式视觉提示指令调优（explicit visual prompt Instruction Tuning）来增强LMMs的实例理解能力。Inst-IT包括一个用于诊断多模态实例级理解的基准（benchmark）、一个大规模指令调优数据集（instruction-tuning dataset）以及一个连续指令调优训练范式（continuous instruction-tuning training paradigm），以有效提升现有LMMs的空间-时间实例理解能力。

链接: https://arxiv.org/abs/2412.03565
作者: Wujian Peng,Lingchen Meng,Yitong Chen,Yiweng Xie,Yang Liu,Tao Gui,Hang Xu,Xipeng Qiu,Zuxuan Wu,Yu-Gang Jiang
关键词-EN: Large Multimodal Models, Large Multimodal, made significant breakthroughs, instruction tuning, understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
zh

[CV-6] FLAIR: VLM with Fine-grained Language-informed Image Representations

【速读】：该论文试图解决CLIP模型在捕捉图像细节特征方面的局限性，其解决方案的关键在于提出了FLAIR（Fine-grained Language-informed Image Representations）方法。FLAIR通过利用长而详细的图像描述来学习局部化的图像嵌入，通过采样描述图像细粒度细节的多样化子描述，训练视觉-语言模型生成不仅包含全局嵌入，还包括文本特定图像表示。该方法引入了文本条件注意力池化机制，以生成细粒度的图像表示，从而在检索图像细节内容方面表现出色，并在细粒度检索任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03561
作者: Rui Xiao,Sanghwan Kim,Mariana-Iuliana Georgescu,Zeynep Akata,Stephan Alaniz
关键词-EN: shown impressive results, CLIP matches images, Image, shown impressive, impressive results
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at this https URL .
zh

[CV-7] MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

【速读】：该论文试图解决从单张图像生成复杂3D场景的问题。解决方案的关键在于引入了一种新的多实例扩散模型（MIDI），该模型扩展了预训练的图像到3D对象生成模型，使其能够同时生成多个3D实例，并确保这些实例之间具有准确的空间关系和高度的泛化能力。MIDI的核心创新在于其多实例注意力机制，该机制在生成过程中直接捕捉对象间的交互和空间一致性，无需复杂的分步处理。此外，MIDI利用部分对象图像和全局场景上下文作为输入，直接在3D生成过程中进行对象补全，并通过有限的场景级数据和单对象数据进行监督和正则化，从而保持了预训练模型的泛化能力。

链接: https://arxiv.org/abs/2412.03558
作者: Zehuan Huang,Yuan-Chen Guo,Xingqiao An,Yunhan Yang,Yangguang Li,Zi-Xin Zou,Ding Liang,Xihui Liu,Yan-Pei Cao,Lu Sheng
关键词-EN: paper introduces MIDI, paradigm for compositional, paper introduces, introduces MIDI, MIDI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.
zh

[CV-8] PaliGemma 2: A Family of Versatile VLMs for Transfer

【速读】：该论文旨在通过升级PaliGemma开放视觉语言模型（Vision-Language Model, VLM）来解决多任务迁移学习中的性能优化问题。解决方案的关键在于结合SigLIP-So400m视觉编码器与Gemma 2系列语言模型，从2B到27B不同规模的模型，并在三种分辨率（224px, 448px, 和 896px）下进行多阶段训练，以增强模型的迁移学习能力。通过这种方式，研究团队能够分析学习率、任务类型、模型大小和分辨率等因素对迁移性能的影响，并在包括表格结构识别、分子结构识别、乐谱识别、长细粒度描述生成和放射报告生成等广泛的OCR相关任务中，实现了最先进的性能。

链接: https://arxiv.org/abs/2412.03555
作者: Andreas Steiner,André Susano Pinto,Michael Tschannen,Daniel Keysers,Xiao Wang,Yonatan Bitton,Alexey Gritsenko,Matthias Minderer,Anthony Sherbondy,Shangbang Long,Siyang Qin,Reeve Ingle,Emanuele Bugliarello,Sahar Kazemzadeh,Thomas Mesnard,Ibrahim Alabdulmohsin,Lucas Beyer,Xiaohua Zhai
关键词-EN: open Vision-Language Model, PaliGemma open Vision-Language, open Vision-Language, Vision-Language Model, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
zh

[CV-9] Imagine360: Immersive 360 Video Generation from Perspective Anchor

【速读】：该论文试图解决从标准视角视频（perspective videos）生成高质量360度全景视频（360° equirectangular videos）的问题，以实现更用户友好和个性化的内容创作。解决方案的关键在于Imagine360框架，该框架通过以下几个关键设计实现高质量的360度视频生成：1) 采用双分支设计，包括一个视角视频去噪分支和一个全景视频去噪分支，以提供局部和全局的约束；2) 设计了反极性掩码（antipodal mask）来捕捉长距离的运动依赖关系，增强反极性像素之间的相机运动反转；3) 提出适应不同视角视频输入的仰角感知设计（elevation-aware designs），以应对由于帧间仰角变化导致的视频遮罩变化。这些设计使得Imagine360在图形质量和运动连贯性方面优于现有的360度视频生成方法。

链接: https://arxiv.org/abs/2412.03552
作者: Jing Tan,Shuai Yang,Tong Wu,Jingwen He,Yuwei Guo,Ziwei Liu,Dahua Lin
关键词-EN: circ, video, scene from full, offer a hyper-immersive, hyper-immersive experience
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract: 360^\circ videos offer a hyper-immersive experience that allows the viewers to explore a dynamic scene from full 360 degrees. To achieve more user-friendly and personalized content creation in 360^\circ video format, we seek to lift standard perspective videos into 360^\circ equirectangular videos. To this end, we introduce Imagine360, the first perspective-to- 360^\circ video generation framework that creates high-quality 360^\circ videos with rich and diverse motion patterns from video anchors. Imagine360 learns fine-grained spherical visual and motion patterns from limited 360^\circ video data with several key designs. 1) Firstly we adopt the dual-branch design, including a perspective and a panorama video denoising branch to provide local and global constraints for 360^\circ video generation, with motion module and spatial LoRA layers fine-tuned on extended web 360^\circ videos. 2) Additionally, an antipodal mask is devised to capture long-range motion dependencies, enhancing the reversed camera motion between antipodal pixels across hemispheres. 3) To handle diverse perspective video inputs, we propose elevation-aware designs that adapt to varying video masking due to changing elevations across frames. Extensive experiments show Imagine360 achieves superior graphics quality and motion coherence among state-of-the-art 360^\circ video generation methods. We believe Imagine360 holds promise for advancing personalized, immersive 360^\circ video creation.
zh

[CV-10] Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

【速读】：该论文试图解决多模态语言模型（MLMs）在视觉感知任务中的局限性问题，特别是在需要深度估计和物体检测的任务中，MLMs无法生成中间的深度图或边界框来进行推理。解决方案的关键在于引入感知令牌（Perception Tokens），这是一种内在的图像表示，旨在辅助语言模型在语言不足以进行推理时的任务。感知令牌作为辅助推理令牌，类似于语言模型中的思维链提示（chain-of-thought prompts）。论文提出的AURORA训练方法通过利用VQVAE将中间图像表示（如深度图）转换为令牌化格式，并结合多任务训练框架，显著提升了MLMs在视觉输入上的推理能力，特别是在计数基准和相对深度估计任务上表现出色。

链接: https://arxiv.org/abs/2412.03548
作者: Mahtab Bigverdi,Zelun Luo,Cheng-Yu Hsieh,Ethan Shen,Dongping Chen,Linda G. Shapiro,Ranjay Krishna
关键词-EN: Multimodal language models, Multimodal language, Perception Tokens, perception, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal language models (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn’t generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in language models. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.
zh

[CV-11] Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

【速读】：该论文试图解决静态前馈场景重建模型在处理动态内容和跨环境泛化能力不足的问题。解决方案的关键在于提出了一种名为BTimer（BulletTimer）的实时运动感知前馈模型，该模型通过在3D高斯Splatting表示中聚合所有上下文帧的信息，实现了对动态场景在目标时间戳（‘bullet’ timestamp）的完整重建和新型视角合成。这一方法不仅提高了模型的扩展性和泛化能力，还能在150毫秒内完成动态场景的重建，达到了与基于优化的方法相媲美的最先进性能。

链接: https://arxiv.org/abs/2412.03526
作者: Hanxue Liang,Jiawei Ren,Ashkan Mirzaei,Antonio Torralba,Ziwei Liu,Igor Gilitschenski,Sanja Fidler,Cengiz Oztireli,Huan Ling,Zan Gojcic,Jiahui Huang
关键词-EN: demonstrated significant progress, Recent advancements, demonstrated significant, significant progress, progress in high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project website: this https URL

点击查看摘要

Abstract:Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for BulletTimer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target (‘bullet’) timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.
zh

[CV-12] Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

【速读】：该论文试图解决自动驾驶训练中多视角视频生成的跨视图和跨帧一致性问题。解决方案的关键在于提出了CogDriving网络，该网络采用Diffusion Transformer架构和整体4D注意力模块（holistic-4D attention modules），能够在空间、时间和视角维度上实现同时关联，从而增强一致性。此外，论文还引入了轻量级控制器Micro-Controller，仅使用标准ControlNet参数的1.1%，以精确控制鸟瞰图布局（Bird’s-Eye-View layouts）。为了提升对象实例的生成质量，论文还提出了重新加权的学习目标（re-weighted learning objective），动态调整训练过程中对象实例的学习权重。这些创新使得CogDriving在nuScenes验证集上表现出色，FVD得分达到37.8，展示了其生成逼真驾驶视频的能力。

链接: https://arxiv.org/abs/2412.03520
作者: Hannan Lu,Xiaohe Wu,Shudong Wang,Xiameng Qin,Xinyu Zhang,Junyu Han,Wangmeng Zuo,Ji Tao
关键词-EN: Generating multi-view videos, Generating multi-view, recently gained, challenge of addressing, addressing both cross-view
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird’s-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at this https URL.
zh

[CV-13] Dense Scene Reconstruction from Light-Field Images Affected by Rolling Shutter

【速读】：该论文试图解决光场图像（LF images）中强滚动快门（RS）效应导致的深度估计不准确问题。解决方案的关键在于提出了一种两阶段方法，通过2D高斯溅射（2D Gaussians Splatting）实现“渲染与比较”策略，并结合点云表示。第一阶段利用子孔径图像子集估计与场景目标形状相关的RS无关的3D形状；第二阶段通过估计可接受的相机运动来计算3D形状的变形。该方法通过实验验证了其在不同场景和运动类型中的有效性和优势，并为了评估目的，作者还设计了一个新的合成RS光场图像数据集。

链接: https://arxiv.org/abs/2412.03518
作者: Hermes McGriff,Renato Martins,Nicolas Andreff,Cedric Demonceaux
关键词-EN: strong rolling shutter, dense depth estimation, depth estimation approach, rolling shutter, depth estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a dense depth estimation approach from light-field (LF) images that is able to compensate for strong rolling shutter (RS) effects. Our method estimates RS compensated views and dense RS compensated disparity maps. We present a two-stage method based on a 2D Gaussians Splatting that allows for a render and compare" strategy with a point cloud formulation. In the first stage, a subset of sub-aperture images is used to estimate an RS agnostic 3D shape that is related to the scene target shape up to a motion". In the second stage, the deformation of the 3D shape is computed by estimating an admissible camera motion. We demonstrate the effectiveness and advantages of this approach through several experiments conducted for different scenes and types of motions. Due to lack of suitable datasets for evaluation, we also present a new carefully designed synthetic dataset of RS LF images. The source code, trained models and dataset will be made publicly available at: this https URL
zh

[CV-14] NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

【速读】：该论文试图解决现有方法在多视角数据的新视角合成（Novel View Synthesis, NVS）中依赖外部多视角对齐过程的问题，特别是在视角间重叠不足或存在遮挡导致对齐不稳定的情况下。解决方案的关键是提出了NVComposer，一种无需显式外部对齐的新方法。NVComposer通过引入两个核心组件来实现这一目标：1) 图像-姿态双流扩散模型（image-pose dual-stream diffusion model），该模型同时生成目标新视角和条件相机姿态；2) 几何感知特征对齐模块（geometry-aware feature alignment module），该模块在训练过程中从密集立体模型中提取几何先验信息。这些组件使得生成模型能够隐式推断多视角之间的空间和几何关系，从而在不依赖外部对齐的情况下实现高质量的新视角合成，显著提高了模型的灵活性和可访问性。

链接: https://arxiv.org/abs/2412.03517
作者: Lingen Li,Zhaoyang Zhang,Yaowei Li,Jiale Xu,Xiaoyu Li,Wenbo Hu,Weihao Cheng,Jinwei Gu,Tianfan Xue,Ying Shan
关键词-EN: Recent advancements, significantly improved, Recent, multi-view data, alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Recent advancements in generative models have significantly improved novel view synthesis (NVS) from multi-view data. However, existing methods depend on external multi-view alignment processes, such as explicit pose estimation or pre-reconstruction, which limits their flexibility and accessibility, especially when alignment is unstable due to insufficient overlap or occlusions between views. In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. NVComposer enables the generative model to implicitly infer spatial and geometric relationships between multiple conditional views by introducing two key components: 1) an image-pose dual-stream diffusion model that simultaneously generates target novel views and condition camera poses, and 2) a geometry-aware feature alignment module that distills geometric priors from dense stereo models during training. Extensive experiments demonstrate that NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks, removing the reliance on external alignment and thus improving model accessibility. Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases, highlighting its potential for more flexible and accessible generative NVS systems.
zh

[CV-15] Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

【速读】：该论文试图解决基于扩散模型（Diffusion Models）的3D LiDAR场景补全模型在实际应用中采样速度慢的问题。解决方案的关键在于提出了一种名为ScoreLiDAR的新型蒸馏方法，该方法通过显著减少蒸馏后模型的采样步骤，实现了高效且高质量的场景补全。此外，论文还引入了一种新的结构损失（Structural Loss），该损失包含场景级和点级约束，以鼓励蒸馏模型捕捉3D LiDAR场景的几何结构，从而提高补全质量。实验结果表明，ScoreLiDAR在SemanticKITTI数据集上将补全时间从30.55秒大幅缩短至5.37秒，性能优于现有的最先进模型。

链接: https://arxiv.org/abs/2412.03515
作者: Shengyuan Zhang,An Zhao,Ling Yang,Zejian Li,Chenye Meng,Haoran Xu,Tianrun Chen,AnYang Wei,Perry Pengyun GU,Lingyun Sun
关键词-EN: LiDAR scene completion, strong training stability, scene completion models, scene completion, LiDAR scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality. However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed \textbfScoreLiDAR , which achieves efficient yet high-quality scene completion. ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation. To improve completion quality, we also introduce a novel \textbfStructural Loss , which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene. The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration. Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ( 5 \times ) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models. Our code is publicly available at this https URL.
zh

[CV-16] Distillation of Diffusion Features for Semantic Correspondence WACV2025

【速读】：该论文试图解决在语义对应任务中，现有方法依赖于结合多个大型模型导致的高计算需求和低效率问题。解决方案的关键在于提出了一种新的知识蒸馏技术，通过将两个大型视觉基础模型的能力蒸馏到一个更小的模型中，从而在保持高精度的同时显著降低计算成本。此外，通过引入3D数据增强，进一步提升了模型的性能，无需依赖人工标注的对应关系。最终，该方法在性能上超越了当前最先进的方法，同时大幅减少了计算负荷，增强了实际应用中的实用性，特别是在语义视频对应等场景中。

链接: https://arxiv.org/abs/2412.03512
作者: Frank Fundel,Johannes Schusterbauer,Vincent Tao Hu,Björn Ommer
关键词-EN: visual place recognition, object tracking, place recognition, task of determining, determining relationships
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025, Page: this https URL

点击查看摘要

Abstract:Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.
zh

[CV-17] A Bidirectional Siamese Recurrent Neural Network for Accurate Gait Recognition Using Body Landmarks

【速读】：该论文试图解决步态识别（gait recognition）在实际应用中的准确性和可靠性问题。解决方案的关键在于采用了先进的序列步态地标（sequential gait landmarks）提取技术，通过Mediapipe姿态估计模型获取步态特征，并利用Procrustes分析进行对齐。此外，论文提出了一种Siamese biGRU-dualStack神经网络架构，用于捕捉时间依赖性，从而显著提升了步态识别的准确性。实验结果表明，该方法在CASIA-B、SZU RGB-D、OU-MVLP和Gait3D等多个大规模跨视角数据集上均取得了优异的识别精度，分别为95.7%、94.44%、87.71%和86.6%。

链接: https://arxiv.org/abs/2412.03498
作者: Proma Hossain Progga,Md. Jobayer Rahman,Swapnil Biswas,Md. Shakil Ahmed,Arif Reza Anwary,Swakkhar Shatabda
关键词-EN: significant biometric technique, physiological biometrics, person identification, impractical or ineffective, Gait recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Gait recognition is a significant biometric technique for person identification, particularly in scenarios where other physiological biometrics are impractical or ineffective. In this paper, we address the challenges associated with gait recognition and present a novel approach to improve its accuracy and reliability. The proposed method leverages advanced techniques, including sequential gait landmarks obtained through the Mediapipe pose estimation model, Procrustes analysis for alignment, and a Siamese biGRU-dualStack Neural Network architecture for capturing temporal dependencies. Extensive experiments were conducted on large-scale cross-view datasets to demonstrate the effectiveness of the approach, achieving high recognition accuracy compared to other models. The model demonstrated accuracies of 95.7%, 94.44%, 87.71%, and 86.6% on CASIA-B, SZU RGB-D, OU-MVLP, and Gait3D datasets respectively. The results highlight the potential applications of the proposed method in various practical domains, indicating its significant contribution to the field of gait recognition.
zh

[CV-18] Data Fusion of Semantic and Depth Information in the Context of Object Detection

【速读】：该论文旨在解决自动驾驶系统中对周围物体（如行人）的分类及其在自车3D坐标系中的位置估计问题，并测量自车与物体之间的距离。解决方案的关键在于利用基于区域卷积神经网络（R-CNN）的快速检测算法，结合Inception v2架构进行物体分类，并通过一系列计算机视觉技术（如立体视觉生成视差图）来计算物体的3D参考点位置和距离。

链接: https://arxiv.org/abs/2412.03490
作者: Md Abu Yusuf,Md Rezaul Karim Khan,Partha Pratim Saha,Mohammed Mahbubur Rahaman
关键词-EN: Considerable study, autonomous driving, autonomous driving system, modern era, Region-based Convolution Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Considerable study has already been conducted regarding autonomous driving in modern era. An autonomous driving system must be extremely good at detecting objects surrounding the car to ensure safety. In this paper, classification, and estimation of an object’s (pedestrian) position (concerning an ego 3D coordinate system) are studied and the distance between the ego vehicle and the object in the context of autonomous driving is measured. To classify the object, faster Region-based Convolution Neural Network (R-CNN) with inception v2 is utilized. First, a network is trained with customized dataset to estimate the reference position of objects as well as the distance from the vehicle. From camera calibration to computing the distance, cutting-edge technologies of computer vision algorithms in a series of processes are applied to generate a 3D reference point of the region of interest. The foremost step in this process is generating a disparity map using the concept of stereo vision.
zh

[CV-19] Urban4D: Semantic-Guided 4D Gaussian Splatting for Urban Scene Reconstruction

【速读】：该论文试图解决动态城市场景重建中的几何结构和时空动态问题。解决方案的关键在于引入Urban4D框架，该框架利用2D语义地图（2D semantic maps）进行动态和静态高斯分布的分类，并通过4D高斯喷射（4D Gaussian splatting, 4DGS）表示法显式建模动态物体。具体来说，Urban4D通过语义引导的分解策略区分潜在动态物体，并利用可学习的时间嵌入（learnable time embeddings）和多层感知器（MLP）预测高斯分布在不同时间点的形变。此外，为提高静态重建的准确性，论文还设计了基于k近邻（KNN）的一致性正则化方法来处理低纹理的地面特征。

链接: https://arxiv.org/abs/2412.03473
作者: Ziwen Li,Jiaxin Huang,Runnan Chen,Yunlong Che,Yandong Guo,Tongliang Liu,Fakhri Karray,Mingming Gong
关键词-EN: presents significant challenges, intrinsic geometric structures, Reconstructing dynamic urban, scenes presents significant, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic urban scenes presents significant challenges due to their intrinsic geometric structures and spatiotemporal dynamics. Existing methods that attempt to model dynamic urban scenes without leveraging priors on potentially moving regions often produce suboptimal results. Meanwhile, approaches based on manual 3D annotations yield improved reconstruction quality but are impractical due to labor-intensive labeling. In this paper, we revisit the potential of 2D semantic maps for classifying dynamic and static Gaussians and integrating spatial and temporal dimensions for urban scene representation. We introduce Urban4D, a novel framework that employs a semantic-guided decomposition strategy inspired by advances in deep 2D semantic map generation. Our approach distinguishes potentially dynamic objects through reliable semantic Gaussians. To explicitly model dynamic objects, we propose an intuitive and effective 4D Gaussian splatting (4DGS) representation that aggregates temporal information through learnable time embeddings for each Gaussian, predicting their deformations at desired timestamps using a multilayer perceptron (MLP). For more accurate static reconstruction, we also design a k-nearest neighbor (KNN)-based consistency regularization to handle the ground surface due to its low-texture characteristic. Extensive experiments on real-world datasets demonstrate that Urban4D not only achieves comparable or better quality than previous state-of-the-art methods but also effectively captures dynamic objects while maintaining high visual fidelity for static elements.
zh

[CV-20] Measure Anything: Real-time Multi-stage Vision-based Dimensional Measurement using Segment Anything

【速读】：该论文试图解决基于视觉的物体（特别是具有圆形截面的物体）尺寸测量问题，特别是针对农业领域中油菜茎的直径测量，这是一个与作物健康和产量密切相关的表型特征。解决方案的关键在于利用Segment Anything Model (SAM)进行全面的视觉分析，包括分割、掩膜处理、骨架构建和2D-3D转换，从而准确估计物体的直径、长度和体积等几何特征。该框架通过集成智能模型（如关键点检测），实现了对高吞吐量应用的自动化测量，并展示了其在机器人抓取中的多功能性，通过提取的几何特征识别最佳抓取点。

链接: https://arxiv.org/abs/2412.03472
作者: Yongkyu Lee,Shivam Kumar Panda,Wei Wang,Mohammad Khalid Jawed
关键词-EN: comprehensive vision-based framework, circular cross-sections, comprehensive vision-based, objects with circular, Segment Anything Model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Measure Anything, a comprehensive vision-based framework for dimensional measurement of objects with circular cross-sections, leveraging the Segment Anything Model (SAM). Our approach estimates key geometric features – including diameter, length, and volume – for rod-like geometries with varying curvature and general objects with constant skeleton slope. The framework integrates segmentation, mask processing, skeleton construction, and 2D-3D transformation, packaged in a user-friendly interface. We validate our framework by estimating the diameters of Canola stems – collected from agricultural fields in North Dakota – which are thin and non-uniform, posing challenges for existing methods. Measuring its diameters is critical, as it is a phenotypic traits that correlates with the health and yield of Canola crops. This application also exemplifies the potential of Measure Anything, where integrating intelligent models – such as keypoint detection – extends its scalability to fully automate the measurement process for high-throughput applications. Furthermore, we showcase its versatility in robotic grasping, leveraging extracted geometric features to identify optimal grasp points.
zh

[CV-21] raining-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning

【速读】：该论文试图解决的问题是多模态指令调优对语言模型（如 Vicuna 和 Mistral）的语言推理能力的影响。解决方案的关键在于通过实验比较原始语言模型与其多模态适应版本在八项语言推理任务中的表现，发现多模态学习对不同模型的影响不同：对 Mistral 的语言推理能力有所下降，而对 Vicuna 则有所提升。此外，研究还发现多模态指令学习在数学推理任务（如 GSM8K）上的表现有所下降，但在常识推理任务（如 CommonsenseQA）上有所提升。最后，论文提出了一种无需训练的模型合并技术，可以有效缓解多模态适应后 Mistral 的语言推理能力下降，并提升其在视觉任务上的表现。

链接: https://arxiv.org/abs/2412.03467
作者: Neale Ratzlaff,Man Luo,Xin Su,Vasudev Lal,Phillip Howard
关键词-EN: powerful large language, models typically combine, Multimodal, language reasoning, typically combine
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear whether this adaptation compromises their original language reasoning capabilities. In this work, we explore the effects of multimodal instruction tuning on language reasoning performance. We focus on LLaVA, a leading multimodal framework that integrates LLMs such as Vicuna or Mistral with the CLIP vision encoder. We compare the performance of the original LLMs with their multimodal-adapted counterparts across eight language reasoning tasks. Our experiments yield several key insights. First, the impact of multimodal learning varies between Vicuna and Mistral: we observe a degradation in language reasoning for Mistral but improvements for Vicuna across most tasks. Second, while multimodal instruction learning consistently degrades performance on mathematical reasoning tasks (e.g., GSM8K), it enhances performance on commonsense reasoning tasks (e.g., CommonsenseQA). Finally, we demonstrate that a training-free model merging technique can effectively mitigate the language reasoning degradation observed in multimodal-adapted Mistral and even improve performance on visual tasks.
zh

[CV-22] Gesture Classification in Artworks Using Contextual Image Features

【速读】：该论文试图解决在历史艺术品中识别气味手势（smell gestures）的问题，以增强对艺术作品的理解并强调嗅觉在文化遗产中的作用。解决方案的关键在于结合局部特征（local features）与全局图像上下文（global image context），这种方法显著提升了不同骨干网络（backbones）上的分类性能。

链接: https://arxiv.org/abs/2412.03456
作者: Azhar Hussian,Mathias Zinnen,Thi My Hang Tran,Andreas Maier,Vincent Christlein
关键词-EN: Recognizing gestures, cultural heritage, add a valuable, valuable dimension, dimension to art
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recognizing gestures in artworks can add a valuable dimension to art understanding and help to acknowledge the role of the sense of smell in cultural heritage. We propose a method to recognize smell gestures in historical artworks. We show that combining local features with global image context improves classification performance notably on different backbones.
zh

[CV-23] Pre-trained Multiple Latent Variable Generative Models are good defenders against Adversarial Attacks

【速读】：该论文试图解决对抗性攻击问题，即攻击者通过在分类器的输入中添加微小噪声来改变最终预测结果。解决方案的关键在于提出了一种名为多重潜在变量生成模型 (Multiple Latent Variable Generative Models, MLVGMs) 的生成网络，用于对抗性净化。这些模型通过多个潜在变量自然地分离粗略和精细特征，利用这些特性对图像进行自编码，保留与类别相关的信息，同时丢弃并重新采样包括对抗性噪声在内的细节。该方法完全无需训练，利用预训练的MLVGMs在对抗性净化下游任务中探索其泛化能力。尽管模型规模较小，未经过数十亿样本的训练，但研究表明这些较小的MLVGMs已经能够与传统方法竞争，并可作为基础模型使用。

链接: https://arxiv.org/abs/2412.03453
作者: Dario Serez,Marco Cristani,Alessio Del Bue,Vittorio Murino,Pietro Morerio
关键词-EN: altering final predictions, deliberately perturb classifiers’, Attackers can deliberately, perturb classifiers’ input, altering final
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attackers can deliberately perturb classifiers’ input with subtle noise, altering final predictions. Among proposed countermeasures, adversarial purification employs generative networks to preprocess input images, filtering out adversarial noise. In this study, we propose specific generators, defined Multiple Latent Variable Generative Models (MLVGMs), for adversarial purification. These models possess multiple latent variables that naturally disentangle coarse from fine features. Taking advantage of these properties, we autoencode images to maintain class-relevant information, while discarding and re-sampling any detail, including adversarial noise. The procedure is completely training-free, exploring the generalization abilities of pre-trained MLVGMs on the adversarial purification downstream task. Despite the lack of large models, trained on billions of samples, we show that smaller MLVGMs are already competitive with traditional methods, and can be used as foundation models. Official code released at this https URL.
zh

[CV-24] PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes

【速读】：该论文试图解决多视角室内图像的快速且精确的表面重建问题。解决方案的关键在于提出了PlanarSplatting方法，该方法以3D平面为主要目标，利用其紧凑性和结构表达能力，通过显式优化框架将3D平面投影到2.5D深度和法线图中，从而实现室内场景的表面重建。PlanarSplatting直接在3D平面基元上操作，避免了依赖2D/3D平面检测和平面匹配跟踪，同时结合基于CUDA的平面投影函数实现，显著提高了重建速度和几何精度。该方法在ScanNet和ScanNet++数据集上的大规模定量评估中展示了其优势，预计将在未来的表面重建结构化数据管理中得到应用。

链接: https://arxiv.org/abs/2412.03451
作者: Bin Tan,Rui Yu,Yujun Shen,Nan Xue
关键词-EN: multiview indoor images, paper presents PlanarSplatting, surface reconstruction approach, planar surface reconstruction, surface reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper presents PlanarSplatting, an ultra-fast and accurate surface reconstruction approach for multiview indoor images. We take the 3D planes as the main objective due to their compactness and structural expressiveness in indoor scenes, and develop an explicit optimization framework that learns to fit the expected surface of indoor scenes by splatting the 3D planes into 2.5D depth and normal maps. As our PlanarSplatting operates directly on the 3D plane primitives, it eliminates the dependencies on 2D/3D plane detection and plane matching and tracking for planar surface reconstruction. Furthermore, the essential merits of plane-based representation plus CUDA-based implementation of planar splatting functions, PlanarSplatting reconstructs an indoor scene in 3 minutes while having significantly better geometric accuracy. Thanks to our ultra-fast reconstruction speed, the largest quantitative evaluation on the ScanNet and ScanNet++ datasets over hundreds of scenes clearly demonstrated the advantages of our method. We believe that our accurate and ultrafast planar surface reconstruction method will be applied in the structured data curation for surface reconstruction in the future. The code of our CUDA implementation will be publicly available. Project page: this https URL
zh

[CV-25] CleanDIFT: Diffusion Features without Noise

【速读】：该论文试图解决大规模预训练扩散模型在提取语义特征时对图像噪声的依赖问题。解决方案的关键在于提出了一种轻量级的无监督微调方法，使扩散模型的骨干网络能够在无噪声图像上提供高质量的语义特征。这种方法显著提升了特征提取的性能，并在多种下游任务中表现优于以往的扩散特征，甚至超过了基于集成的方法，且计算成本大幅降低。

链接: https://arxiv.org/abs/2412.03439
作者: Nick Stracke,Stefan Andreas Baumann,Kolja Bauer,Frank Fundel,Björn Ommer
关键词-EN: powerful semantic descriptors, large-scale pre-trained diffusion, Internal features, pre-trained diffusion models, large-scale pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for the project page and code, view this https URL

点击查看摘要

Abstract:Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
zh

[CV-26] SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

【速读】：该论文试图解决歌唱视频生成领域中现有模型在处理歌唱音频与说话音频差异时表现不佳的问题。解决方案的关键在于设计了两个核心模块：多尺度频谱模块（multi-scale spectral module）和频谱过滤模块（spectral-filtering module）。多尺度频谱模块帮助模型在频谱域中学习歌唱模式，而频谱过滤模块则辅助模型捕捉与歌唱音频相关的人类行为特征。这两个模块被集成到扩散模型（diffusion model）中，形成了名为SINGER的模型，显著提升了歌唱视频生成的质量。此外，论文还通过收集一个高质量的野外歌唱音视频数据集（in-the-wild audio-visual singing dataset）来弥补该领域数据资源的不足，从而推动相关研究的发展。

链接: https://arxiv.org/abs/2412.03430
作者: Yan Li,Ziya Zhou,Zhiqiang Wang,Wei Xue,Wenhan Luo,Yike Guo
关键词-EN: Recent advancements, significantly enhanced talking, talking face video, generation remains underexplored, face video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.
zh

[CV-27] 2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction

【速读】：该论文试图解决室内场景重建中的高保真度问题，特别是在处理复杂空间结构和纹理缺失区域时的挑战。解决方案的关键在于引入了一种名为2DGS-Room的新方法，该方法利用2D高斯分布（2D Gaussian Splatting）进行室内场景的高保真重建。具体来说，该方法通过种子引导机制来控制2D高斯分布的分布，并通过自适应生长和剪枝机制动态优化种子点的密度。此外，为了提高几何精度，该方法还结合了单目深度和法线先验，分别对细节和纹理缺失区域提供约束。最后，通过多视角一致性约束来减少伪影并进一步增强重建质量。实验结果表明，该方法在ScanNet和ScanNet++数据集上达到了最先进的室内场景重建性能。

链接: https://arxiv.org/abs/2412.03428
作者: Wanting Zhang,Haodong Xiang,Zhichao Liao,Xiansong Lai,Xinghui Li,Long Zeng
关键词-EN: remains challenging due, scenes remains challenging, Gaussian Splatting, indoor scenes remains, remains challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The reconstruction of indoor scenes remains challenging due to the inherent complexity of spatial structures and the prevalence of textureless regions. Recent advancements in 3D Gaussian Splatting have improved novel view synthesis with accelerated processing but have yet to deliver comparable performance in surface reconstruction. In this paper, we introduce 2DGS-Room, a novel method leveraging 2D Gaussian Splatting for high-fidelity indoor scene reconstruction. Specifically, we employ a seed-guided mechanism to control the distribution of 2D Gaussians, with the density of seed points dynamically optimized through adaptive growth and pruning mechanisms. To further improve geometric accuracy, we incorporate monocular depth and normal priors to provide constraints for details and textureless regions respectively. Additionally, multi-view consistency constraints are employed to mitigate artifacts and further enhance reconstruction quality. Extensive experiments on ScanNet and ScanNet++ datasets demonstrate that our method achieves state-of-the-art performance in indoor scene reconstruction.
zh

[CV-28] Deep Learning for Sea Surface Temperature Reconstruction under Cloud Occlusion

【速读】：该论文试图解决卫星红外辐射探测在监测海表温度（Sea Surface Temperature, SST）时因云层覆盖导致的观测数据缺失问题。解决方案的关键在于利用深度神经网络（deep neural networks）重建卫星图像中被云层遮挡的部分，同时保持云层未覆盖区域的观测值完整性。通过使用MODIS卫星获取的SST观测数据，研究展示了其最佳模型架构在误差指标上显著优于现有方法，从而提高了数据集的完整性和可靠性，为环境评估、数据驱动模型训练、气候研究及模型数据同化流程的无缝集成提供了更准确的数据支持。

链接: https://arxiv.org/abs/2412.03413
作者: Andrea Asperti,Ali Aydogdu,Emanuela Clementi,Angelo Greco,Lorenzo Mentaschi,Fabio Merizzi,Pietro Miraglio,Paolo Oddo,Nadia Pinardi,Alessandro Testa
关键词-EN: Sea Surface Temperature, understanding Earth oceans, marine ecosystem health, significantly influencing weather, global energy balance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sea Surface Temperature (SST) is crucial for understanding Earth’s oceans and climate, significantly influencing weather patterns, ocean currents, marine ecosystem health, and the global energy balance. Large-scale SST monitoring relies on satellite infrared radiation detection, but cloud cover presents a major challenge, creating extensive observational gaps and hampering our ability to fully capture large-scale ocean temperature patterns. Efforts to address these gaps in existing L4 datasets have been made, but they often exhibit notable local and seasonal biases, compromising data reliability and accuracy. To tackle this challenge, we employed deep neural networks to reconstruct cloud-covered portions of satellite imagery while preserving the integrity of observed values in cloud-free areas, using MODIS satellite derived observations of SST. Our best-performing architecture showed significant skill improvements over established methodologies, achieving substantial reductions in error metrics when benchmarked against widely used approaches and datasets. These results underscore the potential of advanced AI techniques to enhance the completeness of satellite observations in Earth-science remote sensing, providing more accurate and reliable datasets for environmental assessments, data-driven model training, climate research, and seamless integration into model data assimilation workflows.
zh

[CV-29] PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

【速读】：该论文试图解决大型视觉-语言模型（LVLMs）在推理过程中由于长输入和输出序列导致的显著计算和内存开销问题。解决方案的关键在于提出了一种名为PrefixKV的方法，该方法通过重新定义为所有层确定KV缓存大小的挑战，将其转化为寻找最优全局前缀配置的任务。PrefixKV采用基于二分搜索的自适应层级KV保留策略，确保在每一层中最大限度地保留上下文信息，从而在提高推理效率的同时保持生成质量。实验结果表明，该方法在推理效率和生成质量之间实现了最佳平衡，展示了在实际应用中的潜力。

链接: https://arxiv.org/abs/2412.03409
作者: Ao Wang,Hui Chen,Jianchao Tan,Kefeng Zhang,Xunliang Cai,Zijia Lin,Jungong Han,Guiguang Ding
关键词-EN: rapidly gained popularity, large vision-language models, diverse multimodal inputs, large vision-language, rapidly gained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures;

点击查看摘要

Abstract:Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at \urlthis https URL.
zh

[CV-30] Skel3D: Skeleton Guided Novel View Synthesis

【速读】：该论文试图解决单目开放集新视角合成 (monocular open-set novel view synthesis, NVS) 的问题，特别是在不依赖显式3D表示的情况下，提高合成视角的姿态准确性和多视角一致性。解决方案的关键在于引入骨骼引导层 (skeleton guide layer)，该层接在现有的光线条件归一化层 (ray conditioning normalization, RCN) 之后，利用Objaverse数据集中包含的具有骨骼结构的可动画对象，为生成模型提供详细的结构信息。这种方法显著提升了合成视角的质量，并在实验中展示了在Objaverse数据集上跨不同对象类别的显著一致性和准确性提升，超越了现有的最先进NVS技术。

链接: https://arxiv.org/abs/2412.03407
作者: Aron Fóthi,Bence Fazekas,Natabara Máté Gyöngyössy,Kristian Fenech
关键词-EN: underlying diffusion model, leverages object skeletons, monocular open-set, underlying diffusion, Objaverse dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present an approach for monocular open-set novel view synthesis (NVS) that leverages object skeletons to guide the underlying diffusion model. Building upon a baseline that utilizes a pre-trained 2D image generator, our method takes advantage of the Objaverse dataset, which includes animated objects with bone structures. By introducing a skeleton guide layer following the existing ray conditioning normalization (RCN) layer, our approach enhances pose accuracy and multi-view consistency. The skeleton guide layer provides detailed structural information for the generative model, improving the quality of synthesized views. Experimental results demonstrate that our skeleton-guided method significantly enhances consistency and accuracy across diverse object categories within the Objaverse dataset. Our method outperforms existing state-of-the-art NVS techniques both quantitatively and qualitatively, without relying on explicit 3D representations.
zh

[CV-31] Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

【速读】：该论文试图解决在机器人辅助微创食管切除术 (RAMIE) 中，新手外科医生因空间定位能力不足而面临的挑战。解决方案的关键在于开发了一个全面的语义分割数据集，用于辅助手术导航。该数据集包含了迄今为止最大的关键解剖结构和手术器械集合，旨在帮助识别复杂的结构如神经。论文通过基准测试八种实时深度学习模型，评估了传统和基于注意力机制的网络，发现基于注意力的模型在捕捉全局模式和处理遮挡问题上表现更优，其中SegNeXt和Mask2Former在Dice分数和平均对称表面距离方面表现突出。此外，研究发现使用ADE20k数据集进行预训练比使用ImageNet更有效。

链接: https://arxiv.org/abs/2412.03401
作者: Ronald L.P.D. de Jong,Yasmina al Khalil,Tim J.M. Jaspers,Romy C. van Jaarsveld,Gino M. Kuiper,Yiping Li,Richard van Hillegersberg,Jelle P. Ruurda,Marcel Breeuwer,Fons van der Sommen
关键词-EN: Esophageal cancer, cancer worldwide, common types, Esophageal, types of cancer
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the SPIE Medical Imaging Conference, 2025

点击查看摘要

Abstract:Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.
zh

[CV-32] Implicit Priors Editing in Stable Diffusion via Targeted Token Adjustment

【速读】：该论文试图解决文本到图像生成任务中隐含假设和先验知识可能包含过时概念、不准确性或社会偏见的问题。解决方案的关键是提出了一种名为“嵌入式编辑 (Embedding-only Editing, Embedit)”的方法，该方法通过微调目标对象的词嵌入 (Word Token Embedding, WTE) 来优化Stable Diffusion模型中文本编码器的最后一层隐藏状态，从而在不改变模型其他部分权重和无关对象词嵌入的情况下，高效调整模型的隐含假设和先验知识。这种方法的效率高，仅修改768或2048个参数，且易于恢复原始状态，实验结果表明其性能优于先前的方法，至少提升了6.01%。

链接: https://arxiv.org/abs/2412.03400
作者: Feng He,Chao Zhang,Zhixue Zhao
关键词-EN: lack sufficient context, textual prompts lack, prompts lack sufficient, sufficient context, lack sufficient
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit assumptions and priors are often necessary in text-to-image generation tasks, especially when textual prompts lack sufficient context. However, these assumptions can sometimes reflect outdated concepts, inaccuracies, or societal bias embedded in the training data. We present Embedding-only Editing (Embedit), a method designed to efficiently adjust implict assumptions and priors in the model without affecting its interpretation of unrelated objects or overall performance. Given a “source” prompt (e.g., “rose”) that elicits an implicit assumption (e.g., rose is red) and a “destination” prompt that specifies the desired attribute (e.g., “blue rose”), Embedit fine-tunes only the word token embedding (WTE) of the target object (“rose”) to optimize the last hidden state of text encoder in Stable Diffusion, a SOTA text-to-image model. This targeted adjustment prevents unintended effects on other objects in the model’s knowledge base, as the WTEs for unrelated objects and the model weights remain unchanged. Consequently, when a prompt does not contain the edited object, all representations, and the model outputs are identical to those of the original, unedited model. Our method is highly efficient, modifying only 768 parameters for Stable Diffusion 1.4 and 2048 for XL in a single edit, matching the WTE dimension of each respective model. This minimal scope, combined with rapid execution, makes Embedit highly practical for real-world applications. Additionally, changes are easily reversible by restoring the original WTE layers. Our experimental results demonstrate that Embedit consistently outperforms previous methods across various models, tasks, and editing scenarios (both single and sequential multiple edits), achieving at least a 6.01% improvement (from 87.17% to 93.18%).
zh

[CV-33] Mapping using Transformers for Volumes – Network for Super-Resolution with Long-Range Interactions

【速读】：该论文试图解决三维体积超分辨率（volumetric super-resolution）中无法充分利用基于Transformer模型的最新进展的问题。由于三维体积中自注意力机制所需的内存限制了感受野，导致长距离交互在三维中不如二维中充分，从而未能充分发挥Transformer的优势。解决方案的关键在于提出了一种基于多尺度Transformer的模型，该模型结合了分层注意力块和多尺度载体标记（carrier tokens）。通过在每个分辨率级别使用Transformer层，从粗分辨率到细分辨率逐步传递信息，从而在每个尺度上限制标记数量，并实现比以往更大的区域注意力。实验结果表明，该方法（MTVNet）在五个三维数据集上优于现有的最先进模型，特别是在处理大于常用三维数据集中的图像时，其优势尤为明显。

链接: https://arxiv.org/abs/2412.03379
作者: August Leander Høeg,Sophia W. Bardenfleth,Hans Martin Kjer,Tim B. Dyrby,Vedrana Andersen Dahl,Anders Dahl
关键词-EN: utilize the recent, recent advances, receptive field, Abstract, volumetric super-resolution
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 14 pages, 8 Figures with supplementary material

点击查看摘要

Abstract:Until now, it has been difficult for volumetric super-resolution to utilize the recent advances in transformer-based models seen in 2D super-resolution. The memory required for self-attention in 3D volumes limits the receptive field. Therefore, long-range interactions are not used in 3D to the extent done in 2D and the strength of transformers is not realized. We propose a multi-scale transformer-based model based on hierarchical attention blocks combined with carrier tokens at multiple scales to overcome this. Here information from larger regions at coarse resolution is sequentially carried on to finer-resolution regions to predict the super-resolved image. Using transformer layers at each resolution, our coarse-to-fine modeling limits the number of tokens at each scale and enables attention over larger regions than what has previously been possible. We experimentally compare our method, MTVNet, against state-of-the-art volumetric super-resolution models on five 3D datasets demonstrating the advantage of an increased receptive field. This advantage is especially pronounced for images that are larger than what is seen in popularly used 3D datasets. Our code is available at this https URL
zh

[CV-34] Volumetrically Consistent 3D Gaussian Rasterization

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在视图合成中由于其基于喷射的渲染模型所做的近似而导致物理精度降低的问题。解决方案的关键在于直接体积积分3D高斯函数以解析计算透射率，从而推导出比3DGS更物理准确的alpha值。这种方法不仅更接近体积渲染方程（类似于光线追踪），同时保留了光栅化的速度优势。通过这种方式，论文提出的方法在表示不透明表面时具有更高的准确性和更少的点数，从而在视图合成（以SSIM和LPIPS衡量）中优于3DGS，并且在断层扫描中也能直接应用，与最先进的3DGS断层扫描方法相比，使用更少的点数即可达到相同的效果。

链接: https://arxiv.org/abs/2412.03378
作者: Chinmay Talegaonkar,Yash Belhe,Ravi Ramamoorthi,Nicholas Antipa
关键词-EN: high inference speeds, enabled photorealistic view, Gaussian Splatting, enabled photorealistic, high inference
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has enabled photorealistic view synthesis at high inference speeds. However, its splatting-based rendering model makes several approximations to the rendering equation, reducing physical accuracy. We show that splatting and its approximations are unnecessary, even within a rasterizer; we instead volumetrically integrate 3D Gaussians directly to compute the transmittance across them analytically. We use this analytic transmittance to derive more physically-accurate alpha values than 3DGS, which can directly be used within their framework. The result is a method that more closely follows the volume rendering equation (similar to ray-tracing) while enjoying the speed benefits of rasterization. Our method represents opaque surfaces with higher accuracy and fewer points than 3DGS. This enables it to outperform 3DGS for view synthesis (measured in SSIM and LPIPS). Being volumetrically consistent also enables our method to work out of the box for tomography. We match the state-of-the-art 3DGS-based tomography method with fewer points. Being volumetrically consistent also enables our method to work out of the box for tomography. We match the state-of-the-art 3DGS-based tomography method with fewer points.
zh

[CV-35] SGSST: Scaling Gaussian Splatting StyleTransfer

【速读】：该论文试图解决将风格迁移应用于完整3D环境的问题，特别是在高分辨率3D场景中的应用。解决方案的关键在于引入了一种基于优化的方法，称为SGSST（Scaling Gaussian Splatting Style Transfer），并创新性地使用了名为SOS（Simultaneously Optimized Scales）的多尺度损失函数。SOS基于全局神经统计，使得风格迁移能够在超高分辨率的3D场景中实现，不仅提高了图像分辨率，还显著提升了视觉质量，通过定性、定量和感知比较验证了其优越性。

链接: https://arxiv.org/abs/2412.03371
作者: Bruno Galerne,Jianling Wang,Lara Raad,Jean-Michel Morel
关键词-EN: Applying style transfer, Scaling Gaussian Splatting, style transfer, Applying style, Gaussian Splatting Style
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Applying style transfer to a full 3D environment is a challenging task that has seen many developments since the advent of neural rendering. 3D Gaussian splatting (3DGS) has recently pushed further many limits of neural rendering in terms of training speed and reconstruction quality. This work introduces SGSST: Scaling Gaussian Splatting Style Transfer, an optimization-based method to apply style transfer to pretrained 3DGS scenes. We demonstrate that a new multiscale loss based on global neural statistics, that we name SOS for Simultaneously Optimized Scales, enables style transfer to ultra-high resolution 3D scenes. Not only SGSST pioneers 3D scene style transfer at such high image resolutions, it also produces superior visual quality as assessed by thorough qualitative, quantitative and perceptual comparisons.
zh

[CV-36] ASR: Timestep-Aware Diffusion Model for Image Super-Resolution

【速读】：该论文试图解决图像超分辨率领域中扩散模型在处理低分辨率（LR）图像时信息传递效率不高的问题。解决方案的关键在于引入了一种时间步长感知（timestep-aware）的扩散模型，该模型通过ControlNet和预训练的Stable Diffusion（SD）模型的自适应特征融合，优化了LR信息在去噪过程早期阶段的传递，从而保证了图像的保真度，并在后期阶段更多地激发SD模型的生成能力，以增强生成图像的细节。此外，论文还提出了一种时间步长感知的训练策略，采用不同时间步长下的不同损失函数，作用于不同的模块，以提升模型的整体性能。

链接: https://arxiv.org/abs/2412.03355
作者: Qinwei Lin,Xiaopeng Sun,Yu Gao,Yujie Zhong,Dengjie Li,Zheng Zhao,Haoqian Wang
关键词-EN: recently achieved outstanding, achieved outstanding results, recently achieved, achieved outstanding, outstanding results
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via this http URL this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: this https URL
zh

[CV-37] Intuitive Axial Augmentation Using Polar-Sine-Based Piecewise Distortion for Medical Slice-Wise Segmentation

【速读】：该论文试图解决医学图像分析中数据驱动模型依赖通用增强方法的问题，这些方法虽然有效，但其不明确的机制阻碍了其在医学界的广泛接受和信任。解决方案的关键在于提出了一种针对医学图像的特定增强算法，该算法在弹性上更优，并与放射扫描过程高度一致。具体来说，该方法在极坐标系下根据半径进行分段仿射变换，并结合正弦畸变射线，模拟人体在扫描台上的不确定姿态，从而在不改变轴向平面基本相对位置的情况下生成人体内脏分布。此外，引入了两种非自适应算法——基于元数据的扫描台移除和相似性引导的参数搜索，以增强该增强方法的鲁棒性。实验结果表明，该方法在不增加数据样本的情况下，提高了多个著名分割框架的准确性。

链接: https://arxiv.org/abs/2412.03352
作者: Yiqin Zhang,Qingkui Chen,Chen Huang,Zhengjie Zhang,Meiling Chen,Zhibing Fu
关键词-EN: image analysis rely, medical image analysis, data-driven models, analysis rely, rely on universal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most data-driven models for medical image analysis rely on universal augmentations to improve performance. Experimental evidence has confirmed their effectiveness, but the unclear mechanism underlying them poses a barrier to the widespread acceptance and trust in such methods within the medical community. We revisit and acknowledge the unique characteristics of medical images apart from traditional digital images, and consequently, proposed a medical-specific augmentation algorithm that is more elastic and aligns well with radiology scan procedure. The method performs piecewise affine with sinusoidal distorted ray according to radius on polar coordinates, thus simulating uncertain postures of human lying flat on the scanning table. Our method could generate human visceral distribution without affecting the fundamental relative position on axial plane. Two non-adaptive algorithms, namely Meta-based Scan Table Removal and Similarity-Guided Parameter Search, are introduced to bolster robustness of our augmentation method. Experiments show our method improves accuracy across multiple famous segmentation frameworks without requiring more data samples. Our preview code is available in: this https URL.
zh

[CV-38] Fairer Analysis and Demographically Balanced Face Generation for Fairer Face Verification

【速读】：该论文试图解决人脸识别和验证任务中的公平性问题，特别是在生成式 AI (Generative AI) 生成虚假身份时仍存在的偏见问题。解决方案的关键在于引入了一种新的控制生成流程，基于现有的 DCFace SOTA 框架，通过经典公平性指标和基于对数模型与方差分析的深入统计分析，证明了该生成流程在提高公平性方面优于其他偏见缓解方法，同时略微提升了原始性能。

链接: https://arxiv.org/abs/2412.03349
作者: Alexandre Fournier-Montgieux,Michael Soumm,Adrian Popescu,Bertrand Luvison,Hervé Le Borgne
关键词-EN: computer vision tasks, deep representations, recognition and verification, computer vision, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition and verification are two computer vision tasks whose performances have advanced with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive nature of face data and biases in real-world training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems remain. Using the existing DCFace SOTA framework, we introduce a new controlled generation pipeline that improves fairness. Through classical fairness metrics and a proposed in-depth statistical analysis based on logit models and ANOVA, we show that our generation pipeline improves fairness more than other bias mitigation approaches while slightly improving raw performance.
zh

[CV-39] DIVE: Taming DINO for Subject-Driven Video Editing

【速读】：该论文试图解决视频编辑中保持时间一致性和运动对齐的挑战。解决方案的关键在于提出了一种名为DINO-guided Video Editing (DIVE)的框架，该框架利用预训练的DINOv2模型提取的强大语义特征作为隐式对应关系，来指导编辑过程。具体来说，DIVE通过DINO特征与源视频的运动轨迹对齐，以确保时间上的运动一致性。此外，为了实现精确的主体编辑，DIVE将参考图像的DINO特征整合到预训练的文本到图像模型中，以学习低秩适应（Low-Rank Adaptations, LoRAs），从而有效地注册目标主体的身份。

链接: https://arxiv.org/abs/2412.03347
作者: Yi Huang,Wei Xiong,He Zhang,Chaoqi Chen,Jianzhuang Liu,Mingfu Yan,Shifeng Chen
关键词-EN: gained substantial attention, recently gained substantial, substantial attention, success of diffusion, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject’s identity. Project page: this https URL
zh

[CV-40] UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection

【速读】：该论文试图解决跨领域视觉异常检测 (Visual Anomaly Detection, VAD) 中存在的通用性和可扩展性问题。现有方法通常针对特定领域设计，且遵循“一类一模型”的范式，导致模型泛化能力差，难以在不同领域间进行统一评估。解决方案的关键是提出了一种通用的少样本视觉异常检测方法，称为 UniVAD。UniVAD 通过训练无需特定领域的统一模型，利用少量的正常样本作为测试时的参考，能够在未见过的对象中检测异常。其核心技术包括基于聚类和视觉基础模型的上下文组件聚类模块 (Contextual Component Clustering, C^3)、组件感知补丁匹配模块 (Component-Aware Patch Matching, CAPM) 和图增强组件建模模块 (Graph-Enhanced Component Modeling, GECM)，这些模块在不同语义层次上检测异常，并通过聚合生成最终的检测结果。实验结果表明，UniVAD 在跨领域的少样本异常检测任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03342
作者: Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Ming Tang,Jinqiao Wang
关键词-EN: Visual Anomaly Detection, identify abnormal samples, Visual Anomaly, aims to identify, identify abnormal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Visual Anomaly Detection (VAD) aims to identify abnormal samples in images that deviate from normal patterns, covering multiple domains, including industrial, logical, and medical fields. Due to the domain gaps between these fields, existing VAD methods are typically tailored to each domain, with specialized detection techniques and model architectures that are difficult to generalize across different domains. Moreover, even within the same domain, current VAD approaches often follow a “one-category-one-model” paradigm, requiring large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, such as industrial, logical, and medical anomalies, with a training-free unified model. UniVAD only needs few normal samples as references during testing to detect anomalies in previously unseen objects, without training on the specific domain. Specifically, UniVAD employs a Contextual Component Clustering ( C^3 ) module based on clustering and vision foundation models to segment components within the image accurately, and leverages Component-Aware Patch Matching (CAPM) and Graph-Enhanced Component Modeling (GECM) modules to detect anomalies at different semantic levels, which are aggregated to produce the final detection result. We conduct experiments on nine datasets spanning industrial, logical, and medical fields, and the results demonstrate that UniVAD achieves state-of-the-art performance in few-shot anomaly detection tasks across multiple domains, outperforming domain-specific anomaly detection models. The code will be made publicly available.
zh

[CV-41] A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs

【速读】：该论文试图解决大型视觉语言模型（Vision-language Models, VLMs）在处理大量视觉标记时遇到的效率问题。解决方案的关键在于提出了一种无需训练的方法，即小VLM引导加速大VLM (Small VLM Guidance for accelerating Large VLMs, SGL)。具体来说，该方法利用小VLM生成的全局注意力图来指导大VLM中的视觉标记剪枝，并通过早期退出机制动态调用大VLM，仅在必要时进行计算，从而在保持竞争性能的同时实现高达91%的视觉标记剪枝率。

链接: https://arxiv.org/abs/2412.03324
作者: Wangbo Zhao,Yizeng Han,Jiasheng Tang,Zhikai Li,Yibing Song,Kai Wang,Zhangyang Wang,Yang You
关键词-EN: shown remarkable success, encounter significant efficiency, significant efficiency challenges, efficiency challenges due, Vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains comparable performance under aggressive pruning. However, the attention maps from all layers requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce a \textbftraining-free method, \underline\textbfSmall VLM \underline\textbfGuidance for accelerating \underline\textbfLarge VLMs (\textbfSGL). Specifically, we employ the attention map aggregated from a small VLM to guide visual token pruning in a large VLM. Additionally, an early exiting mechanism is developed to fully use the small VLM’s predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computation. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of SGL, achieving up to 91% pruning ratio for visual tokens while retaining competitive performance.
zh

[CV-42] Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis

【速读】：该论文试图解决从卫星图像生成地面图像（Sat2Grd）和从地面图像生成卫星图像（Grd2Sat）的跨视图合成问题。解决方案的关键在于利用扩散模型（diffusion models）来处理跨视图合成中的一对多映射问题，即同一输入图像可能对应多种不同的输出图像。为此，论文提出了一种几何引导的跨视图条件（Geometry-guided Cross-view Condition, GCC）策略，通过建立卫星视图和地面视图特征之间的显式几何对应关系，来解决相机姿态引起的几何模糊问题，从而提升跨视图图像合成的质量和多样性。

链接: https://arxiv.org/abs/2412.03315
作者: Tao Jun Lin,Wenqing Wang,Yujiao Shi,Akhil Perincherry,Ankit Vora,Hongdong Li
关键词-EN: generating plausible ground-level, plausible ground-level images, vice versa, paper presents, aimed at generating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.
zh

[CV-43] Equivariant Representation Learning for Augmentation-based Self-Supervised Learning via Image Reconstruction

【速读】：该论文试图解决自监督视觉表示学习中，基于增强的自监督学习方法在不变特征学习上表现出色，但在等变特征学习上存在不足的问题。这种不足限制了基础模型在需要等变特征的下游任务中的泛化能力。解决方案的关键在于引入图像重建任务作为辅助组件，通过交叉注意力机制融合从两个增强视图中学习到的特征，并重建其中一个视图，从而促进等变特征的学习，而无需增加额外参数。该方法适用于多种数据集和基于增强对的学习方法，并通过在人工（3DIEBench）和自然（ImageNet）数据集上的下游任务评估，证明了其在等变特征学习上的显著改进，特别是在涉及组合增强的场景中，显著优于标准的基于增强的自监督学习方法和现有最先进的方法。

链接: https://arxiv.org/abs/2412.03314
作者: Qin Wang,Kai Krajsek,Hanno Scharr
关键词-EN: shown remarkable success, Augmentation-based self-supervised learning, Augmentation-based self-supervised, shown remarkable, remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Augmentation-based self-supervised learning methods have shown remarkable success in self-supervised visual representation learning, excelling in learning invariant features but often neglecting equivariant ones. This limitation reduces the generalizability of foundation models, particularly for downstream tasks requiring equivariance. We propose integrating an image reconstruction task as an auxiliary component in augmentation-based self-supervised learning algorithms to facilitate equivariant feature learning without additional parameters. Our method implements a cross-attention mechanism to blend features learned from two augmented views, subsequently reconstructing one of them. This approach is adaptable to various datasets and augmented-pair based learning methods. We evaluate its effectiveness on learning equivariant features through multiple linear regression tasks and downstream applications on both artificial (3DIEBench) and natural (ImageNet) datasets. Results consistently demonstrate significant improvements over standard augmentation-based self-supervised learning methods and state-of-the-art approaches, particularly excelling in scenarios involving combined augmentations. Our method enhances the learning of both invariant and equivariant features, leading to more robust and generalizable visual representations for computer vision tasks.
zh

[CV-44] Composed Image Retrieval for Training-Free Domain Conversion WACV2025

【速读】：该论文试图解决在领域转换背景下组合图像检索的问题，即根据查询图像的内容在由查询文本指定的领域中检索相关图像。解决方案的关键在于利用强大的视觉-语言模型（vision-language model）提供的描述能力，无需额外训练。具体方法是通过文本反转（textual inversion）将查询图像映射到文本输入空间，不同于在文本标记的连续空间中进行反转的常规做法，该研究在文本词汇的离散词空间中通过最近邻搜索进行反转。这种反转使得图像能够软性地映射到词汇表中，并通过基于检索的增强方法提高鲁棒性。最终，通过结合映射词和领域文本的加权组合文本查询，从数据库中检索图像。该方法在标准和新引入的基准测试中显著优于现有技术。

链接: https://arxiv.org/abs/2412.03297
作者: Nikos Efthymiadis,Bill Psomas,Zakaria Laskar,Konstantinos Karantzalos,Yannis Avrithis,Ondřej Chum,Giorgos Tolias
关键词-EN: work addresses composed, addresses composed image, composed image retrieval, work addresses, addresses composed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:This work addresses composed image retrieval in the context of domain conversion, where the content of a query image is retrieved in the domain specified by the query text. We show that a strong vision-language model provides sufficient descriptive power without additional training. The query image is mapped to the text input space using textual inversion. Unlike common practice that invert in the continuous space of text tokens, we use the discrete word space via a nearest-neighbor search in a text vocabulary. With this inversion, the image is softly mapped across the vocabulary and is made more robust using retrieval-based augmentation. Database images are retrieved by a weighted ensemble of text queries combining mapped words with the domain text. Our method outperforms prior art by a large margin on standard and newly introduced benchmarks. Code: this https URL
zh

[CV-45] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

【速读】：该论文试图解决的是如何有效地结合自回归模型和扩散模型来学习视觉运动策略的问题。解决方案的关键在于提出了一种名为DiffusionVLA的新框架，该框架通过一个下一标记预测目标（next-token prediction objective）来使模型能够在当前观察的背景下有效推理用户的查询，并随后通过扩散模型生成稳健的动作输出。此外，论文引入了一个新颖的推理注入模块（reasoning injection module），将推理短语直接集成到策略学习过程中，从而增强了模型的自推理能力。该框架简单灵活，易于部署和升级，并通过在多个真实机器人上的广泛实验验证了其有效性。

链接: https://arxiv.org/abs/2412.03293
作者: Junjie Wen,Minjie Zhu,Yichen Zhu,Zhibin Tang,Jinming Li,Zhongyi Zhou,Chengmeng Li,Xiaoyu Liu,Yaxin Peng,Chaomin Shen,Feifei Feng
关键词-EN: learning visuomotor policy, seamlessly combines, combines the autoregression, model, learning visuomotor
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at: this http URL

点击查看摘要

Abstract:In this paper, we present DiffusionVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user’s query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade. We conduct extensive experiments using multiple real robots to validate the effectiveness of DiffusionVLA. Our tests include a challenging factory sorting task, where DiffusionVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module makes the model interpretable. It allows observers to understand the model thought process and identify potential causes of policy failures. Additionally, we test DiffusionVLA on a zero-shot bin-picking task, achieving 63.7% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiffusionVLA can follow novel instructions and retain conversational ability. Notably, DiffusionVLA is data-efficient and fast at inference; our smallest DiffusionVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.
zh

[CV-46] Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

【速读】：该论文试图解决生成式模型（如潜在扩散模型 (LDMs)）中语义水印（如Tree-Rings和Gaussian Shading）的安全性问题。论文揭示了语义水印存在一个根本性的安全漏洞，即攻击者可以利用不相关的模型（即使具有不同的潜在空间和架构，如UNet vs DiT）进行强大的伪造攻击。解决方案的关键在于设计了两种水印伪造攻击方法：第一种通过操纵无关LDM的潜在表示，将目标水印嵌入到真实图像中，甚至用于水印移除；第二种通过反转水印图像并使用任意提示重新生成图像，生成带有目标水印的新图像。这两种攻击仅需一个带有目标水印的参考图像。这些发现质疑了语义水印在实际应用中的适用性，因为攻击者可以轻易地伪造或移除这些水印。

链接: https://arxiv.org/abs/2412.03283
作者: Andreas Müller,Denis Lukovnikov,Jonas Thietke,Asja Fischer,Erwin Quiring
关键词-EN: Integrating watermarking, Gaussian Shading, simplifies detection, generated content, generation process
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 21 figures, 6 tables

点击查看摘要

Abstract:Integrating watermarking into the generation process of latent diffusion models (LDMs) simplifies detection and attribution of generated content. Semantic watermarks, such as Tree-Rings and Gaussian Shading, represent a novel class of watermarking techniques that are easy to implement and highly robust against various perturbations. However, our work demonstrates a fundamental security vulnerability of semantic watermarks. We show that attackers can leverage unrelated models, even with different latent spaces and architectures (UNet vs DiT), to perform powerful and realistic forgery attacks. Specifically, we design two watermark forgery attacks. The first imprints a targeted watermark into real images by manipulating the latent representation of an arbitrary image in an unrelated LDM to get closer to the latent representation of a watermarked image. We also show that this technique can be used for watermark removal. The second attack generates new images with the target watermark by inverting a watermarked image and re-generating it with an arbitrary prompt. Both attacks just need a single reference image with the target watermark. Overall, our findings question the applicability of semantic watermarks by revealing that attackers can easily forge or remove these watermarks under realistic conditions.
zh

[CV-47] RFSR: Improving ISR Diffusion Models via Reward Feedback Learning

【速读】：该论文试图解决生成式扩散模型（Generative diffusion models, DM）在图像超分辨率（Image Super-Resolution, ISR）任务中，仅依赖去噪损失（denoising loss）进行模型优化可能无法充分提升生成图像的感知和美学质量的问题。解决方案的关键在于提出了一种时间步长感知（timestep-aware）的训练策略，结合奖励反馈学习（reward feedback learning）来微调现有模型。具体来说，在ISR扩散的初始去噪阶段，应用低频约束（low-frequency constraints）以保持结构稳定性；在后期去噪阶段，利用奖励反馈学习提升感知和美学质量。此外，通过引入Gram-KL正则化（Gram-KL regularization）来缓解奖励反馈学习可能导致的风格化问题。该方法可以以即插即用的方式集成到任何基于扩散的ISR模型中，实验结果表明，经过微调的ISR扩散模型在感知和美学质量上显著提升。

链接: https://arxiv.org/abs/2412.03268
作者: Xiaopeng Sun,Qinwei Lin,Yu Gao,Yujie Zhong,Chengjian Feng,Dengjie Li,Zheng Zhao,Jie Hu,Lin Ma
关键词-EN: Generative diffusion models, Generative diffusion, reward feedback learning, extensively utilized, feedback learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative diffusion models (DM) have been extensively utilized in image super-resolution (ISR). Most of the existing methods adopt the denoising loss from DDPMs for model optimization. We posit that introducing reward feedback learning to finetune the existing models can further improve the quality of the generated images. In this paper, we propose a timestep-aware training strategy with reward feedback learning. Specifically, in the initial denoising stages of ISR diffusion, we apply low-frequency constraints to super-resolution (SR) images to maintain structural stability. In the later denoising stages, we use reward feedback learning to improve the perceptual and aesthetic quality of the SR images. In addition, we incorporate Gram-KL regularization to alleviate stylization caused by reward hacking. Our method can be integrated into any diffusion-based ISR model in a plug-and-play manner. Experiments show that ISR diffusion models, when fine-tuned with our method, significantly improve the perceptual and aesthetic quality of SR images, achieving excellent subjective results. Code: this https URL
zh

[CV-48] NeRF and Gaussian Splatting SLAM in the Wild

【速读】：该论文试图解决在户外环境中使用视觉同步定位与地图构建（SLAM）系统时面临的挑战，特别是动态场景、光照变化和季节性变化带来的问题。解决方案的关键在于评估和比较传统SLAM方法与基于深度学习的方法（如神经辐射场和基于高斯光栅的SLAM）在自然户外环境中的性能，重点关注相机跟踪精度、对环境因素的鲁棒性以及计算效率。研究结果表明，神经SLAM方法在低光等挑战性条件下表现出更高的鲁棒性，但计算成本较高；而传统方法在不同季节表现最佳，但对光照变化的敏感性较高。

链接: https://arxiv.org/abs/2412.03263
作者: Fabian Schmidt,Markus Enzweiler,Abhinav Valada
关键词-EN: visual Simultaneous Localization, Localization and Mapping, Simultaneous Localization, requiring robust solutions, systems poses significant
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Navigating outdoor environments with visual Simultaneous Localization and Mapping (SLAM) systems poses significant challenges due to dynamic scenes, lighting variations, and seasonal changes, requiring robust solutions. While traditional SLAM methods struggle with adaptability, deep learning-based approaches and emerging neural radiance fields as well as Gaussian Splatting-based SLAM methods, offer promising alternatives. However, these methods have primarily been evaluated in controlled indoor environments with stable conditions, leaving a gap in understanding their performance in unstructured and variable outdoor settings. This study addresses this gap by evaluating these methods in natural outdoor environments, focusing on camera tracking accuracy, robustness to environmental factors, and computational efficiency, highlighting distinct trade-offs. Extensive evaluations demonstrate that neural SLAM methods achieve superior robustness, particularly under challenging conditions such as low light, but at a high computational cost. At the same time, traditional methods perform the best across seasons but are highly sensitive to variations in lighting conditions. The code of the benchmark is publicly available at this https URL.
zh

[CV-49] GERD: Geometric event response data generation

【速读】：该论文试图解决事件相机数据在几何方法研究中的应用问题，特别是缺乏与传统相机模型相媲美的几何和物理基础。解决方案的关键在于引入了一种生成事件数据的方法，通过将原型物体置于随时间变化的控制变换中，生成精心策划的事件视频。这种方法旨在简化事件相机在几何方法研究中的应用，为事件相机数据提供了一种可控的生成方式，从而促进相关研究的开展。

链接: https://arxiv.org/abs/2412.03259
作者: Jens Egholm Pedersen,Dimitris Korakovounis,Jörg Conradt
关键词-EN: higher dynamic range, higher dynamic, dynamic range, low-power consumption, sensors are appealing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based vision sensors are appealing because of their time resolution, higher dynamic range, and low-power consumption. They also provide data that is fundamentally different from conventional frame-based cameras: events are sparse, discrete, and require integration in time. Unlike conventional models grounded in established geometric and physical principles, event-based models lack comparable foundations. We introduce a method to generate event-based data under controlled transformations. Specifically, we subject a prototypical object to transformations that change over time to produce carefully curated event videos. We hope this work simplifies studies for geometric approaches in event-based vision. GERD is available at this https URL
zh

[CV-50] DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation

【速读】：该论文试图解决现有文本到图像扩散模型在处理多条件控制时存在的效率低下和条件冲突问题。解决方案的关键在于提出了一种名为DynamicControl的新框架，该框架支持动态组合多种控制信号，允许自适应选择不同数量和类型的条件。其核心创新包括：1) 采用双循环控制器生成初始实数评分排序，通过预训练的条件生成模型和判别模型评估条件间的相似性和与源图像的像素级相似性；2) 集成多模态大语言模型（MLLM）构建条件评估器，优化条件排序；3) 联合优化MLLM和扩散模型，利用MLLM的推理能力促进多条件文本到图像（T2I）任务；4) 通过并行多控制适配器学习动态视觉条件的特征图，并将其整合以调制ControlNet，从而增强生成图像的控制力。

链接: https://arxiv.org/abs/2412.03255
作者: Qingdong He,Jinlong Peng,Pengcheng Xu,Boyuan Jiang,Xiaobin Hu,Donghao Luo,Yong Liu,Yabiao Wang,Chengjie Wang,Xiangtai Li,Jiangning Zhang
关键词-EN: current ControlNet-like models, dictate image attributes, conditions, current ControlNet-like, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To enhance the controllability of text-to-image diffusion models, current ControlNet-like models have explored various control signals to dictate image attributes. However, existing methods either handle conditions inefficiently or use a fixed number of conditions, which does not fully address the complexity of multiple conditions and their potential conflicts. This underscores the need for innovative approaches to manage multiple conditions effectively for more reliable and detailed image synthesis. To address this issue, we propose a novel framework, DynamicControl, which supports dynamic combinations of diverse control signals, allowing adaptive selection of different numbers and types of conditions. Our approach begins with a double-cycle controller that generates an initial real score sorting for all input conditions by leveraging pre-trained conditional generation models and discriminative models. This controller evaluates the similarity between extracted conditions and input conditions, as well as the pixel-level similarity with the source image. Then, we integrate a Multimodal Large Language Model (MLLM) to build an efficient condition evaluator. This evaluator optimizes the ordering of conditions based on the double-cycle controller’s score ranking. Our method jointly optimizes MLLMs and diffusion models, utilizing MLLMs’ reasoning capabilities to facilitate multi-condition text-to-image (T2I) tasks. The final sorted conditions are fed into a parallel multi-control adapter, which learns feature maps from dynamic visual conditions and integrates them to modulate ControlNet, thereby enhancing control over generated images. Through both quantitative and qualitative comparisons, DynamicControl demonstrates its superiority over existing methods in terms of controllability, generation quality and composability under various conditional controls.
zh

[CV-51] ask-driven Image Fusion with Learnable Fusion Loss

【速读】：该论文试图解决当前多模态图像融合方法在下游任务中使用预定义融合目标可能与任务需求不匹配的问题，从而限制了模型的自适应性和灵活性。解决方案的关键在于提出了任务驱动图像融合框架 (Task-driven Image Fusion, TDFusion)，该框架通过引入一个可学习的融合损失 (fusion loss)，该损失由任务损失 (task loss) 引导，并通过元学习方式进行监督。具体来说，融合损失包含由神经网络模型化的可学习参数，称为损失生成模块 (loss generation module)。通过迭代更新融合模块和损失模块，确保融合网络朝着最小化任务损失的方向进化，从而使融合过程更符合任务目标。TDFusion 的训练仅依赖于下游任务的损失，因此可以适应任何特定任务，并且可以应用于任何融合和任务网络架构。

链接: https://arxiv.org/abs/2412.03240
作者: Haowen Bai,Jiangshe Zhang,Zixiang Zhao,Yichen Wu,Lilun Deng,Yukun Cui,Tao Feng,Shuang Xu
关键词-EN: Multi-modal image fusion, multiple sensor sources, achieving superior visual, superior visual quality, perceptual characteristics compared
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual characteristics compared to any single source, often enhancing downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the loss of downstream tasks in a meta-learning manner. The learning objective is to minimize the task loss of the fused images, once the fusion module has been optimized by the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion’s training relies solely on the loss of downstream tasks, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion’s performance in both fusion and task-related applications, including four public fusion datasets, semantic segmentation, and object detection. The code will be released.
zh

[CV-52] MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

【速读】：该论文试图解决虚拟环境创作和逆向渲染中高质量材料生成的问题。解决方案的关键在于提出了MaterialPicker，这是一个利用扩散变换器（Diffusion Transformer, DiT）架构的多模态材料生成器，能够从文本提示和/或照片中生成高质量的材料。MaterialPicker能够处理自然场景照片中常见的扭曲、角度偏差或部分遮挡问题，生成基于材料样本图像裁剪的材料，并允许用户通过文本提示提供额外的生成指导。论文通过微调预训练的基于DiT的视频生成器，将其转化为材料生成器，其中每个材料图被视为视频序列中的一帧。该方法在定量和定性评估中均显示出比以往工作更高的多样性和更好的畸变校正能力。

链接: https://arxiv.org/abs/2412.03225
作者: Xiaohe Ma,Valentin Deschaintre,Miloš Hašan,Fujun Luan,Kun Zhou,Hongzhi Wu,Yiwei Hu
关键词-EN: virtual environment authoring, inverse rendering, High-quality material generation, key for virtual, virtual environment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.
zh

[CV-53] Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

【速读】：该论文试图解决的问题是：在视觉表示的自监督学习（Self-Supervised Learning, SSL）中，掩码图像建模（Masked Image Modeling, MIM）预训练模型在高层次感知任务上的表现不如联合嵌入架构（Joint-Embedding Architectures, JEA）。论文通过分析Vision Transformers (ViT)在两种方法下的信息流，揭示了MIM模型在聚合图像内容时缺乏选择性，导致其表示质量较低。解决方案的关键在于提出了一种选择性聚合相关补丁标记（patch tokens）的方法，无需微调即可显著提升MIM表示的质量。这一发现强调了MIM中表示聚合的有效性问题，并为未来的自监督学习研究提供了改进方向。

链接: https://arxiv.org/abs/2412.03215
作者: Marcin Przewięźlikowski,Randall Balestriero,Wojciech Jasiński,Marek Śmieja,Bartosz Zieliński
关键词-EN: Masked Image Modeling, Masked Image, Image Modeling, prominent SSL paradigm, popular method
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.
zh

[CV-54] Continual Low-Rank Scaled Dot-product Attention

【速读】：该论文试图解决Transformer模型在处理流数据时面临的计算和内存资源限制问题。解决方案的关键在于引入了一种基于Nyström近似的新型Scaled Dot-product Attention（缩放点积注意力）公式，该公式特别适用于持续推理（Continual Inference）场景。通过这种方法，论文在在线音频分类和在线动作检测任务中，将操作数量减少了多达三个数量级，同时保持了与竞争模型相当的预测性能。

链接: https://arxiv.org/abs/2412.03214
作者: Ginés Carreto Picón,Illia Oleksiienko,Lukas Hedegaard,Arian Bakhtiarnia,Alexandros Iosifidis
关键词-EN: Scaled Dot-product Attention, capture data relations, Continual Scaled Dot-product, Scaled Dot-product, Dot-product Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Transformers are widely used for their ability to capture data relations in sequence processing, with great success for a wide range of static tasks. However, the computational and memory footprint of their main component, i.e., the Scaled Dot-product Attention, is commonly overlooked. This makes their adoption in applications involving stream data processing with constraints in response latency, computational and memory resources infeasible. Some works have proposed methods to lower the computational cost of transformers, i.e. low-rank approximations, sparsity in attention, and efficient formulations for Continual Inference. In this paper, we introduce a new formulation of the Scaled Dot-product Attention based on the Nyström approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude compared to the original Transformers while retaining the predictive performance of competing models.
zh

[CV-55] Semi-Supervised Transfer Boosting (SS-TrBoosting)

【速读】：该论文试图解决半监督领域自适应（Semi-supervised Domain Adaptation, SSDA）中存在的两个主要问题：一是难以找到一个特征空间，使得源域和目标域的条件概率分布相同；二是缺乏灵活有效的策略将现有的无监督领域自适应（Unsupervised Domain Adaptation, UDA）方法扩展到SSDA设置中。解决方案的关键在于提出了一种新的微调框架——半监督迁移增强（Semi-supervised Transfer Boosting, SS-TrBoosting）。该框架通过利用一个经过良好训练的深度学习UDA或SSDA模型作为初始模型，生成额外的基学习器并通过增强技术进行集成。具体来说，一半的基学习器通过监督领域自适应生成，另一半通过半监督学习生成。此外，为了提高数据传输效率和增强数据隐私保护，论文还提出了一种源数据生成方法，将SS-TrBoosting扩展到半监督无源领域自适应（Semi-supervised Source-Free Domain Adaptation, SS-SFDA）。实验结果表明，SS-TrBoosting能够应用于多种现有的UDA、SSDA和SFDA方法，进一步提升其性能。

链接: https://arxiv.org/abs/2412.03212
作者: Lingfei Deng,Changming Zhao,Zhenbang Du,Kun Xia,Dongrui Wu
关键词-EN: domain adaptation, aims at training, training a high-performance, plenty of auxiliary, SSDA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semi-supervised domain adaptation (SSDA) aims at training a high-performance model for a target domain using few labeled target data, many unlabeled target data, and plenty of auxiliary data from a source domain. Previous works in SSDA mainly focused on learning transferable representations across domains. However, it is difficult to find a feature space where the source and target domains share the same conditional probability distribution. Additionally, there is no flexible and effective strategy extending existing unsupervised domain adaptation (UDA) approaches to SSDA settings. In order to solve the above two challenges, we propose a novel fine-tuning framework, semi-supervised transfer boosting (SS-TrBoosting). Given a well-trained deep learning-based UDA or SSDA model, we use it as the initial model, generate additional base learners by boosting, and then use all of them as an ensemble. More specifically, half of the base learners are generated by supervised domain adaptation, and half by semi-supervised learning. Furthermore, for more efficient data transmission and better data privacy protection, we propose a source data generation approach to extend SS-TrBoosting to semi-supervised source-free domain adaptation (SS-SFDA). Extensive experiments showed that SS-TrBoosting can be applied to a variety of existing UDA, SSDA and SFDA approaches to further improve their performance.
zh

[CV-56] Parametric Enhancement of PerceptNet: A Human-Inspired Approach for Image Quality Assessment

【速读】：该论文试图解决深度学习模型在模拟人类视觉过程中缺乏生物学合理性的问题。解决方案的关键在于通过参数化神经网络层，使其操作更符合人类视觉的生物学特性，从而减少可训练参数的数量并提高模型的可解释性。具体来说，论文提出了一种方法，即仅优化与人类视觉功能形式相关的参数，而不是独立优化所有卷积张量元素。通过这种方式，模型在保持或接近最先进性能的同时，显著减少了参数数量，并展示了更好的训练行为和可解释性。此外，论文还探讨了基于人类感知实验数据拟合的模型，尽管初始参数具有生物学合理性，但在训练过程中仍可能收敛到生物学上不正确的结果，这强调了需要多样化的评估方法来衡量模型的“人类性”。

链接: https://arxiv.org/abs/2412.03210
作者: Jorge Vila-Tomás,Pablo Hernández-Cámara,Valero Laparra,Jesús Malo
关键词-EN: learn human-like features, earlier levels, deep learning models, modeling human vision, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:While deep learning models can learn human-like features at earlier levels, which suggests their utility in modeling human vision, few attempts exist to incorporate these features by design. Current approaches mostly optimize all parameters blindly, only constraining minor architectural aspects. This paper demonstrates how parametrizing neural network layers enables more biologically-plausible operations while reducing trainable parameters and improving interpretability. We constrain operations to functional forms present in human vision, optimizing only these functions’ parameters rather than all convolutional tensor elements independently. We present two parametric model versions: one with hand-chosen biologically plausible parameters, and another fitted to human perception experimental data. We compare these with a non-parametric version. All models achieve comparable state-of-the-art results, with parametric versions showing orders of magnitude parameter reduction for minimal performance loss. The parametric models demonstrate improved interpretability and training behavior. Notably, the model fitted to human perception, despite biological initialization, converges to biologically incorrect results. This raises scientific questions and highlights the need for diverse evaluation methods to measure models’ humanness, rather than assuming task performance correlates with human-like behavior.
zh

[CV-57] Fab-ME: A Vision State-Space and Attention-Enhanced Framework for Fabric Defect Detection

【速读】：该论文试图解决纺织品缺陷检测中高精度、实时性能和全局信息提取效率的问题。解决方案的关键在于提出了基于YOLOv8s的先进框架Fab-ME，并引入了两个创新模块：跨阶段部分瓶颈与双卷积视觉状态空间模块（C2F-VMamba）和增强型多尺度通道注意力模块（EMCA）。C2F-VMamba模块通过将视觉状态空间（VSS）块集成到YOLOv8s特征融合网络的颈部，增强了细节和全局上下文的捕捉能力，同时保持高速处理。EMCA模块则被整合到特征提取网络的最后一层，显著提高了对小目标的敏感性。实验结果表明，Fab-ME在Tianchi纺织品缺陷检测数据集上的mAP@0.5指标比原始YOLOv8s提高了3.3%，验证了其高效和精确的缺陷检测能力。

链接: https://arxiv.org/abs/2412.03200
作者: Shuai Wang,Huiyan Kong,Baotian Li,Fa Zheng
关键词-EN: Effective defect detection, Effective defect, fabric defect detection, ensuring the quality, textile products
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Effective defect detection is critical for ensuring the quality, functionality, and economic value of textile products. However, existing methods face challenges in achieving high accuracy, real-time performance, and efficient global information extraction. To address these issues, we propose Fab-ME, an advanced framework based on YOLOv8s, specifically designed for the accurate detection of 20 fabric defect types. Our contributions include the introduction of the cross-stage partial bottleneck with two convolutions (C2F) vision state-space (C2F-VMamba) module, which integrates visual state-space (VSS) blocks into the YOLOv8s feature fusion network neck, enhancing the capture of intricate details and global context while maintaining high processing speeds. Additionally, we incorporate an enhanced multi-scale channel attention (EMCA) module into the final layer of the feature extraction network, significantly improving sensitivity to small targets. Experimental results on the Tianchi fabric defect detection dataset demonstrate that Fab-ME achieves a 3.3% improvement in mAP@0.5 compared to the original YOLOv8s, validating its effectiveness for precise and efficient fabric defect detection.
zh

[CV-58] Biologically-inspired Semi-supervised Semantic Segmentation for Biomedical Imaging

【速读】：该论文试图解决在训练下采样-上采样语义分割架构时，由于数据标签稀缺导致的性能瓶颈问题。解决方案的关键在于提出了一种新颖的两阶段半监督学习方法：第一阶段采用生物启发的Hebbian原则（“fire together, wire together”）作为局部学习规则，通过无监督方式更新卷积和转置卷积层的权重，以发现数据特征；第二阶段则在小部分标记数据上使用标准反向传播进行微调。实验结果表明，该方法在不同标签可用性水平下均优于现有的最先进（SOTA）方法，并且使用无监督阶段初始化的SOTA方法也能获得性能提升。

链接: https://arxiv.org/abs/2412.03192
作者: Luca Ciampi,Gabriele Lagani,Giuseppe Amato,Fabrizio Falchi
关键词-EN: semantic segmentation architectures, training downsampling-upsampling semantic, downsampling-upsampling semantic segmentation, two-stage semi-supervised learning, semi-supervised learning approach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel two-stage semi-supervised learning approach for training downsampling-upsampling semantic segmentation architectures. The first stage does not use backpropagation. Rather, it exploits the bio-inspired Hebbian principle “fire together, wire together” as a local learning rule for updating the weights of both convolutional and transpose-convolutional layers, allowing unsupervised discovery of data features. In the second stage, the model is fine-tuned with standard backpropagation on a small subset of labeled data. We evaluate our methodology through experiments conducted on several widely used biomedical datasets, deeming that this domain is paramount in computer vision and is notably impacted by data scarcity. Results show that our proposed method outperforms SOTA approaches across different levels of label availability. Furthermore, we show that using our unsupervised stage to initialize the SOTA approaches leads to performance improvements. The code to replicate our experiments can be found at: this https URL
zh

[CV-59] Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization WACV2025

【速读】：该论文试图解决多任务学习 (Multi-Task Learning, MTL) 在计算机视觉密集预测任务中存在的两个主要问题：1) 任务间交互不足导致的任务特定预测的几何和预测一致性差；2) 现有方法中损失权重策略不当，未能有效应对训练过程中任务演变的内在变异性。解决方案的关键在于：1) 采用先进的视觉变换器 (vision transformers) 结合任务特定解码器，并通过引入回溯方法 (trace-back method) 增强任务间的几何和预测特征一致性；2) 提出一种动态任务平衡策略，将任务损失投影到同一尺度，并在训练过程中优先处理更具挑战性的任务。这些创新显著提升了模型性能，在两个基准数据集上达到了新的最先进水平。

链接: https://arxiv.org/abs/2412.03179
作者: Maxime Fontana,Michael Spratling,Miaojing Shi
关键词-EN: offering notable advantages, Multi-Task Learning, involves the concurrent, offering notable, notable advantages
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:Multi-Task Learning (MTL) involves the concurrent training of multiple tasks, offering notable advantages for dense prediction tasks in computer vision. MTL not only reduces training and inference time as opposed to having multiple single-task models, but also enhances task accuracy through the interaction of multiple tasks. However, existing methods face limitations. They often rely on suboptimal cross-task interactions, resulting in task-specific predictions with poor geometric and predictive coherence. In addition, many approaches use inadequate loss weighting strategies, which do not address the inherent variability in task evolution during training. To overcome these challenges, we propose an advanced MTL model specifically designed for dense vision tasks. Our model leverages state-of-the-art vision transformers with task-specific decoders. To enhance cross-task coherence, we introduce a trace-back method that improves both cross-task geometric and predictive features. Furthermore, we present a novel dynamic task balancing approach that projects task losses onto a common scale and prioritizes more challenging tasks during training. Extensive experiments demonstrate the superiority of our method, establishing new state-of-the-art performance across two benchmark datasets. The code is available at:this https URL
zh

[CV-60] owards Understanding and Quantifying Uncertainty for Text-to-Image Generation

【速读】：该论文试图解决文本到图像生成模型（T2I）中的不确定性量化问题，特别是针对输入提示（prompt）的不确定性。解决方案的关键在于引入了一种名为Prompt-based UNCertainty Estimation for T2I models (PUNC)的新方法，该方法利用大型视觉-语言模型（Large Vision-Language Models, LVLMs）来评估生成图像与原始提示之间的语义一致性。PUNC通过将生成图像的描述与原始提示进行比较，在更具语义意义的文本空间中量化不确定性，并能够区分偶然不确定性（aleatoric uncertainty）和认知不确定性（epistemic uncertainty）。这种方法不仅在各种实验设置中表现优于现有的不确定性估计技术，还为偏差检测、版权保护和异常检测等应用提供了新的可能性。

链接: https://arxiv.org/abs/2412.03178
作者: Gianni Franchi,Dat Nguyen Trong,Nacim Belkhir,Guoxuan Xia,Andrea Pilzer
关键词-EN: understanding model behavior, improving output reliability, output reliability, crucial for understanding, PUNC
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages and 22 figures

点击查看摘要

Abstract:Uncertainty quantification in text-to-image (T2I) generative models is crucial for understanding model behavior and improving output reliability. In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. Alongside adapting existing approaches designed to measure uncertainty in the image space, we also introduce Prompt-based UNCertainty Estimation for T2I models (PUNC), a novel method leveraging Large Vision-Language Models (LVLMs) to better address uncertainties arising from the semantics of the prompt and generated images. PUNC utilizes a LVLM to caption a generated image, and then compares the caption with the original prompt in the more semantically meaningful text space. PUNC also enables the disentanglement of both aleatoric and epistemic uncertainties via precision and recall, which image-space approaches are unable to do. Extensive experiments demonstrate that PUNC outperforms state-of-the-art uncertainty estimation techniques across various settings. Uncertainty quantification in text-to-image generation models can be used on various applications including bias detection, copyright protection, and OOD detection. We also introduce a comprehensive dataset of text prompts and generation pairs to foster further research in uncertainty quantification for generative models. Our findings illustrate that PUNC not only achieves competitive performance but also enables novel applications in evaluating and improving the trustworthiness of text-to-image models.
zh

[CV-61] PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

【速读】：该论文试图解决无微调个性化图像生成中生成图像质量低且与参考图像不一致的问题。解决方案的关键在于提出了PatchDPO方法，该方法通过引入额外的训练阶段，利用预训练的视觉模型和自监督训练方法来估计生成图像中每个图像块的质量，并采用加权训练策略对模型进行训练，从而奖励高质量的图像块并惩罚低质量的图像块。这种方法显著提升了多个预训练个性化生成模型的性能，并在单对象和多对象个性化图像生成任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03177
作者: Qihan Huang,Long Chan,Jinlong Liu,Wanggui He,Hao Jiang,Mingli Song,Jie Song
关键词-EN: attracting wide research, wide research interest, research interest owing, personalized image generation, synthesize customized images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Finetuning-free personalized image generation can synthesize customized images without test-time finetuning, attracting wide research interest owing to its high efficiency. Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. However, traditional DPO only determines the overall superiority or inferiority of two samples, which is not suitable for personalized image generation because the generated images are commonly inconsistent with the reference images only in some local image patches. To tackle this problem, this work proposes PatchDPO that estimates the quality of image patches within each generated image and accordingly trains the model. To this end, PatchDPO first leverages the pre-trained vision model with a proposed self-supervised training method to estimate the patch quality. Next, PatchDPO adopts a weighted training approach to train the model with the estimated patch quality, which rewards the image patches with high quality while penalizing the image patches with low quality. Experiment results demonstrate that PatchDPO significantly improves the performance of multiple pre-trained personalized generation models, and achieves state-of-the-art performance on both single-object and multi-object personalized image generation. Our code is available at this https URL.
zh

[CV-62] IRisPath: Enhancing Off-Road Navigation with Robust IR-RGB Fusion for Improved Day and Night Traversability

【速读】：该论文试图解决在农业、建筑、搜救和防御等应用中，传统道路自动驾驶方法在动态地形上表现不佳的问题。解决方案的关键在于提出了一种多模态融合网络 FuseIsPath，该网络能够利用长波红外 (LWIR) 和 RGB 图像，以提高在动态天气和光照条件下的鲁棒性。此外，论文还开发了一种新的无目标外参标定方法，用于精确对齐 LWIR、LiDAR 和 RGB 摄像头，其平移精度达到 1.7cm，旋转精度为 0.827度。

链接: https://arxiv.org/abs/2412.03173
作者: Saksham Sharma,Akshit Raizada,Suresh Sundaram
关键词-EN: Autonomous off-road navigation, applications in agriculture, search and rescue, rescue and defence, required for applications
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous off-road navigation is required for applications in agriculture, construction, search and rescue and defence. Traditional on-road autonomous methods struggle with dynamic terrains, leading to poor vehicle control on off-road. Recent deep-learning models have used perception sensors along with kinesthetic feedback for navigation on such terrains. However, this approach has out-of-domain uncertainty. Factors like change in weather and time of day impacts the performance of the model. We propose a multi modal fusion network FuseIsPath capable of using LWIR and RGB images to provide robustness against dynamic weather and light conditions. To aid further works in this domain, we also open-source a day-night dataset with LWIR and RGB images along with pseudo-labels for traversability. In order to co-register the two images we developed a novel method for targetless extrinsic calibration of LWIR, LiDAR and RGB cameras with translation accuracy of 1.7cm and rotation accuracy of 0.827degree.
zh

[CV-63] Are Explanations Helpful? A Comparative Analysis of Explainability Methods in Skin Lesion Classifiers

【速读】：该论文试图解决深度学习模型在皮肤病变诊断中的决策透明性问题。解决方案的关键在于识别并评估用于解释这些模型决策过程的方法，特别是像素级归因（如Grad-CAM, Score-CAM, LIME, SHAP）和高层次概念（如ACE, ICE, CME）的解释技术。通过分析这些方法在解释皮肤病变模型中的表现，论文指出尽管这些技术能够揭示模型的某些偏见，但仍需进一步改进解释的全面性，以实现模型在临床应用中的透明性和可靠性。

链接: https://arxiv.org/abs/2412.03166
作者: Rosa Y. G. Paccotacya-Yanque,Alceu Bissoto,Sandra Avila
关键词-EN: computer vision tasks, shown outstanding results, Learning has shown, vision tasks, Deep Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Paper accepted at 20th International Symposium on Medical Information Processing and Analysis (SIPAIM)

点击查看摘要

Abstract:Deep Learning has shown outstanding results in computer vision tasks; healthcare is no exception. However, there is no straightforward way to expose the decision-making process of DL models. Good accuracy is not enough for skin cancer predictions. Understanding the model’s behavior is crucial for clinical application and reliable outcomes. In this work, we identify desiderata for explanations in skin-lesion models. We analyzed seven methods, four based on pixel-attribution (Grad-CAM, Score-CAM, LIME, SHAP) and three on high-level concepts (ACE, ICE, CME), for a deep neural network trained on the International Skin Imaging Collaboration Archive. Our findings indicate that while these techniques reveal biases, there is room for improving the comprehensiveness of explanations to achieve transparency in skin-lesion models.
zh

[CV-64] Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis

【速读】：该论文试图解决基于示例的语义图像合成中，传统结构引导模型（如 ControlNet）无法直接利用示例图像作为输入，仅依赖文本提示控制外观的问题。解决方案的关键在于提出了外观匹配适配器（Appearance Matching Adapter, AM-Adapter），这是一个可学习的框架，通过在增强的自注意力机制中结合分割图的语义信息，提升跨图像匹配的效果。为有效分离生成和匹配过程，采用了分阶段训练方法，首先训练结构引导和生成网络，然后固定这些网络，单独训练 AM-Adapter。在推理阶段，引入自动示例检索方法以高效选择示例图像-分割图对。尽管使用有限的参数，该方法在语义对齐和局部外观保真度方面达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03150
作者: Siyoon Jin,Jisu Nam,Jiyoung Kim,Dahyun Chung,Yeong-Seok Kim,Joonhyung Park,Heonjeong Chu,Seungryong Kim
关键词-EN: Exemplar-based semantic image, image synthesis aims, generate images aligned, Exemplar-based semantic, synthesis aims
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Exemplar-based semantic image synthesis aims to generate images aligned with given semantic content while preserving the appearance of an exemplar image. Conventional structure-guidance models, such as ControlNet, are limited in that they cannot directly utilize exemplar images as input, relying instead solely on text prompts to control appearance. Recent tuning-free approaches address this limitation by transferring local appearance from the exemplar image to the synthesized image through implicit cross-image matching in the augmented self-attention mechanism of pre-trained diffusion models. However, these methods face challenges when applied to content-rich scenes with significant geometric deformations, such as driving scenes. In this paper, we propose the Appearance Matching Adapter (AM-Adapter), a learnable framework that enhances cross-image matching within augmented self-attention by incorporating semantic information from segmentation maps. To effectively disentangle generation and matching processes, we adopt a stage-wise training approach. Initially, we train the structure-guidance and generation networks, followed by training the AM-Adapter while keeping the other networks frozen. During inference, we introduce an automated exemplar retrieval method to efficiently select exemplar image-segmentation pairs. Despite utilizing a limited number of learnable parameters, our method achieves state-of-the-art performance, excelling in both semantic alignment preservation and local appearance fidelity. Extensive ablation studies further validate our design choices. Code and pre-trained weights will be publicly available.: this https URL
zh

[CV-65] Splats in Splats: Embedding Invisible 3D Watermark within Gaussian Splatting

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）资产的版权保护问题，同时确保3D资产的可用性。解决方案的关键在于提出了WaterGS，这是首个在3DGS中嵌入3D内容的数字水印框架，无需修改原始3DGS的任何属性。通过深入研究球谐函数（Spherical Harmonics, SH）并设计了一种重要性分级SH系数加密策略，WaterGS成功地将隐藏的SH系数嵌入到3DGS中。此外，利用卷积自编码器建立了原始高斯基元不透明度与隐藏高斯基元不透明度之间的映射关系。实验结果表明，WaterGS在场景保真度和渲染速度方面显著优于现有的3D隐写技术，同时确保了安全性、鲁棒性和用户体验。

链接: https://arxiv.org/abs/2412.03121
作者: Yijia Guo,Wenkai Huang,Yang Li,Gaolei Li,Hang Zhang,Liwen Hu,Jianhua Li,Tiejun Huang,Lei Ma
关键词-EN: explicit scene representations, demonstrated impressive, performance with explicit, Gaussian splatting, reconstruction performance
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) has demonstrated impressive 3D reconstruction performance with explicit scene representations. Given the widespread application of 3DGS in 3D reconstruction and generation tasks, there is an urgent need to protect the copyright of 3DGS assets. However, existing copyright protection techniques for 3DGS overlook the usability of 3D assets, posing challenges for practical deployment. Here we describe WaterGS, the first 3DGS watermarking framework that embeds 3D content in 3DGS itself without modifying any attributes of the vanilla 3DGS. To achieve this, we take a deep insight into spherical harmonics (SH) and devise an importance-graded SH coefficient encryption strategy to embed the hidden SH coefficients. Furthermore, we employ a convolutional autoencoder to establish a mapping between the original Gaussian primitives’ opacity and the hidden Gaussian primitives’ opacity. Extensive experiments indicate that WaterGS significantly outperforms existing 3D steganography techniques, with 5.31% higher scene fidelity and 3X faster rendering speed, while ensuring security, robustness, and user experience. Codes and data will be released at this https URL.
zh

[CV-66] ObjectFinder: Open-Vocabulary Assistive System for Interactive Object Search by Blind People

【速读】：该论文试图解决盲人在日常生活中搜索物体时面临的挑战。解决方案的关键在于开发了一个名为ObjectFinder的开词汇交互式物体搜索原型系统，该系统结合了物体检测、场景描述和导航功能，使盲人能够自主地检测并导航到所需的物体。通过采用协同设计方法和需求发现访谈，研究团队深入了解了物体搜索中的实际困难，并在实验室环境中模拟生活和工作场景进行了用户测试。结果表明，与现有的BeMyEyes和Lookout系统相比，ObjectFinder在提升用户独立性和心理地图构建方面表现更优，尤其是在更高效的硬件平台上部署时。这一解决方案的核心在于其能够主动定义目标物体，增强了用户的自主性和环境感知能力。

链接: https://arxiv.org/abs/2412.03118
作者: Ruiping Liu,Jiaming Zhang,Angela Schön,Karin Müller,Junwei Zheng,Kailun Yang,Kathrin Gerling,Rainer Stiefelhagen
关键词-EN: Assistive technology, daily lives, people when searching, Assistive, blind people
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assistive technology can be leveraged by blind people when searching for objects in their daily lives. We created ObjectFinder, an open-vocabulary interactive object-search prototype, which combines object detection with scene description and navigation. It enables blind persons to detect and navigate to objects of their choice. Our approach used co-design for the development of the prototype. We further conducted need-finding interviews to better understand challenges in object search, followed by a study with the ObjectFinder prototype in a laboratory setting simulating a living room and an office, with eight blind users. Additionally, we compared the prototype with BeMyEyes and Lookout for object search. We found that most participants felt more independent with ObjectFinder and preferred it over the baselines when deployed on more efficient hardware, as it enhances mental mapping and allows for active target definition. Moreover, we identified factors for future directions for the development of object-search systems.
zh

[CV-67] Few-Shot Learning with Adaptive Weight Masking in Conditional GANs

【速读】：该论文试图解决小样本学习（few-shot learning）中由于数据稀缺导致的过拟合和泛化能力不足的问题。解决方案的关键在于引入了一种新型的残差权重掩码条件生成对抗网络（Residual Weight Masking Conditional Generative Adversarial Network, RWM-CGAN），通过在生成器中集成残差单元以增强网络深度和样本质量，同时在判别器中采用权重掩码正则化技术来提升从小样本类别中学习特征的能力。这种方法通过提供可控且清晰的样本空间增强，有效解决了小样本学习中的鲁棒性和泛化性问题。

链接: https://arxiv.org/abs/2412.03105
作者: Jiacheng Hu,Zhen Qi,Jianjun Wei,Jiajing Chen,Runyuan Bao,Xinyu Qiu
关键词-EN: Masking Conditional Generative, Conditional Generative Adversarial, Generative Adversarial Network, Weight Masking Conditional, few-shot learning scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized various fields, yet its efficacy is hindered by overfitting and the requirement of extensive annotated data, particularly in few-shot learning scenarios where limited samples are available. This paper introduces a novel approach to few-shot learning by employing a Residual Weight Masking Conditional Generative Adversarial Network (RWM-CGAN) for data augmentation. The proposed model integrates residual units within the generator to enhance network depth and sample quality, coupled with a weight mask regularization technique in the discriminator to improve feature learning from small-sample categories. This method addresses the core issues of robustness and generalization in few-shot learning by providing a controlled and clear augmentation of the sample space. Extensive experiments demonstrate that RWM-CGAN not only expands the sample space effectively but also enriches the diversity and quality of generated samples, leading to significant improvements in detection and classification accuracy on public datasets. The paper contributes to the advancement of few-shot learning by offering a practical solution to the challenges posed by data scarcity and the need for rapid generalization to new tasks or categories.
zh

[CV-68] MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

【速读】：该论文试图解决从单目图像重建3D着装人体的问题，由于单视图输入的固有模糊性，现有方法依赖于预训练的SMPL(-X)估计模型或生成模型，但这些方法仅捕捉到人体的一般几何形状，忽略了特定的几何细节，导致骨骼重建不准确、关节位置错误和布料褶皱不清晰。为应对这些问题，论文提出了一种多层次几何学习框架，其关键在于设计了三个核心组件：骨骼级增强模块、关节级增强模块和褶皱级细化模块。具体技术包括将投影的3D傅里叶特征集成到高斯重建模型中，引入扰动以改进训练期间的关节深度估计，并通过类比扩散模型的去噪过程来细化人体粗略的褶皱。实验结果表明，该方法在两个分布外的测试集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.03103
作者: Gangjian Zhang,Nanjie Yao,Shunsi Zhang,Hanfeng Zhao,Guoliang Pang,Jian Shu,Hao Wang
关键词-EN: clothed human body, monocular image, paper investigates, investigates the research, research task
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.
zh

[CV-69] Lightweight Multiplane Images Network for Real-Time Stereoscopic Conversion from Planar Video

【速读】：该论文试图解决立体显示技术快速发展背景下，高质量立体图像和视频资源缺乏的问题，特别是如何在立体转换过程中平衡重建性能和推理效率。解决方案的关键在于提出了一种基于多平面图像（MPI）的平面视频实时立体转换网络，该网络包含一个细节分支用于生成MPI和一个深度-语义分支用于感知深度信息。与依赖显式深度图输入的模型不同，该方法采用轻量级的深度-语义分支隐式提取深度感知特征。为优化轻量级分支，采用了“重训练轻推理”策略，设计了一个仅在训练阶段使用的由粗到细的辅助分支。此外，该方法简化了MPI渲染过程以进一步加速推理。实验结果表明，该方法在性能上可与一些最先进的（SOTA）模型相媲美，并支持2K分辨率下的实时推理，相较于SOTA的TMPI算法，在主观质量相似的情况下实现了超过40倍的推理加速。

链接: https://arxiv.org/abs/2412.03102
作者: Shanding Diao,Yang Zhao,Yuan Chen,Zhao Zhang,Wei Jia,Ronggang Wang
关键词-EN: virtual reality devices, stereoscopic display technologies, stereoscopic conversion, high-quality stereoscopic image, display technologies
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:With the rapid development of stereoscopic display technologies, especially glasses-free 3D screens, and virtual reality devices, stereoscopic conversion has become an important task to address the lack of high-quality stereoscopic image and video resources. Current stereoscopic conversion algorithms typically struggle to balance reconstruction performance and inference efficiency. This paper proposes a planar video real-time stereoscopic conversion network based on multi-plane images (MPI), which consists of a detail branch for generating MPI and a depth-semantic branch for perceiving depth information. Unlike models that depend on explicit depth map inputs, the proposed method employs a lightweight depth-semantic branch to extract depth-aware features implicitly. To optimize the lightweight branch, a heavy training but light inference strategy is adopted, which involves designing a coarse-to-fine auxiliary branch that is only used during the training stage. In addition, the proposed method simplifies the MPI rendering process for stereoscopic conversion scenarios to further accelerate the inference. Experimental results demonstrate that the proposed method can achieve comparable performance to some state-of-the-art (SOTA) models and support real-time inference at 2K resolution. Compared to the SOTA TMPI algorithm, the proposed method obtains similar subjective quality while achieving over 40\times inference acceleration.
zh

[CV-70] Expanding Event Modality Applications through a Robust CLIP-Based Encoder

【速读】：该论文试图解决事件数据（event-based data）在缺乏大规模数据集的情况下，如何有效利用现有图像模型（如CLIP）的能力进行处理和应用的问题。解决方案的关键在于设计一个强大的编码器，将CLIP的图像嵌入能力迁移到事件数据上，实现事件嵌入与图像嵌入的对齐，从而支持零样本学习（zero-shot learning）和文本对齐，同时避免灾难性遗忘（catastrophic forgetting）。该编码器在物体识别任务中表现出色，尤其在零样本和少样本学习任务中具有竞争力，并且能够有效泛化到从视频数据提取的事件数据，无需额外训练，展示了其广泛的适用性和跨模态交互的潜力。

链接: https://arxiv.org/abs/2412.03093
作者: Sungheon Jeong,Hanning Chen,Sanggeon Yun,Suhyeon Cho,Wenjun Huang,Xiangjian Liu,Mohsen Imani
关键词-EN: diverse domains, transfers CLIP, paper introduces, introduces a powerful, expanding its applicability
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a powerful encoder that transfers CLIPs capabilities to event-based data, enhancing its utility and expanding its applicability across diverse domains. While large-scale datasets have significantly advanced image-based models, the scarcity of comprehensive event datasets has limited performance potential in event modality. To address this challenge, we adapt CLIPs architecture to align event embeddings with image embeddings, supporting zero-shot learning and preserving text alignment while mitigating catastrophic forgetting. Our encoder achieves strong performance in object recognition, with competitive results in zero-shot and few-shot learning tasks. Notably, it generalizes effectively to events extracted from video data without requiring additional training, highlighting its versatility. Additionally, we integrate this encoder within a cross-modality framework that facilitates interaction across five modalities-Image, Event, Text, Sound, and Depth-expanding the possibilities for cross-modal applications. Overall, this work underscores the transformative potential of a robust event encoder, broadening the scope and utility of event-based data across various fields.
zh

[CV-71] Mimir: Improving Video Diffusion Models for Precise Text Understanding

【速读】：该论文试图解决文本到视频生成 (Text-to-Video, T2V) 过程中，现有视频扩散模型在文本理解方面的局限性问题。解决方案的关键在于提出了一种名为 Mimir 的端到端训练框架，该框架通过精心设计的 token fuser 来协调文本编码器和大型语言模型 (Large Language Models, LLMs) 的输出，从而弥合两种不同文本建模范式之间的特征分布差异。这种设计使得 T2V 模型能够充分利用学习到的视频先验知识，同时充分发挥 LLMs 在文本理解和生成方面的优势，从而显著提升生成视频的质量和文本理解能力，特别是在处理短描述和动态变化场景时。

链接: https://arxiv.org/abs/2412.03085
作者: Shuai Tan,Biao Gong,Yutong Feng,Kecheng Zheng,Dandan Zheng,Shuwei Shi,Yujun Shen,Jingdong Chen,Ming Yang
关键词-EN: key control signal, narrative nature, video generation due, key control, control signal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of Mimir in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. Project page: this https URL
zh

[CV-72] Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

【速读】：该论文试图解决单目深度估计方法在视频帧间深度一致性估计上的不足。解决方案的关键在于提出了一种名为Align3R的新方法，通过利用DUSt3R模型来对不同时间步的单目深度图进行对齐。具体步骤包括：首先，对DUSt3R模型进行微调，使其能够处理动态场景中的额外单目深度输入；然后，通过优化过程同时重建深度图和相机姿态。实验结果表明，Align3R在单目视频的深度和相机姿态估计上表现优于基线方法，能够实现时间一致的深度图估计。

链接: https://arxiv.org/abs/2412.03079
作者: Jiahao Lu,Tianyu Huang,Peng Li,Zhiyang Dou,Cheng Lin,Zhiming Cui,Zhen Dong,Sai-Kit Yeung,Wenping Wang,Yuan Liu
关键词-EN: enable high-quality depth, methods enable high-quality, depth, high-quality depth estimation, estimation methods enable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.
zh

[CV-73] RoDyGS: Robust Dynamic Gaussian Splatting for Casual Videos

【速读】：该论文试图解决从非结构化的视频中优化动态神经场（Dynamic Neural Fields）的问题，这些视频缺乏直接的三维信息，如相机轨迹或场景几何。解决方案的关键在于提出了RoDyGS，一种从非结构化视频中优化动态高斯溅射（Dynamic Gaussian Splatting）的优化流程。RoDyGS通过分离动态和静态基元（primitives）来有效学习场景的运动和底层几何，并通过引入运动和几何正则化项（regularization terms）确保所学到的运动和几何在物理上是合理的。此外，论文还引入了Kubric-MRig基准测试，提供广泛的相机和物体运动以及同时的多视角捕捉，这些特性在之前的基准测试中是缺失的。实验结果表明，该方法显著优于之前的无姿态动态神经场方法，并在渲染质量上与现有的无姿态静态神经场方法相当。

链接: https://arxiv.org/abs/2412.03077
作者: Yoonwoo Jeong,Junmyeong Lee,Hoseung Choi,Minsu Cho
关键词-EN: reducing computational costs, Dynamic view synthesis, achieving high-fidelity rendering, view synthesis, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Dynamic view synthesis (DVS) has advanced remarkably in recent years, achieving high-fidelity rendering while reducing computational costs. Despite the progress, optimizing dynamic neural fields from casual videos remains challenging, as these videos do not provide direct 3D information, such as camera trajectories or the underlying scene geometry. In this work, we present RoDyGS, an optimization pipeline for dynamic Gaussian Splatting from casual videos. It effectively learns motion and underlying geometry of scenes by separating dynamic and static primitives, and ensures that the learned motion and geometry are physically plausible by incorporating motion and geometric regularization terms. We also introduce a comprehensive benchmark, Kubric-MRig, that provides extensive camera and object motion along with simultaneous multi-view captures, features that are absent in previous benchmarks. Experimental results demonstrate that the proposed method significantly outperforms previous pose-free dynamic neural fields and achieves competitive rendering quality compared to existing pose-free static neural fields. The code and data are publicly available at this https URL.
zh

[CV-74] okenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

【速读】：该论文试图解决多模态理解和生成任务中长期存在的统一性问题，即如何在不牺牲性能的情况下，同时实现高层次的语义理解和细粒度的图像生成。解决方案的关键在于创新的双码本架构（dual-codebook architecture），该架构通过解耦语义和像素级特征学习，并利用共享映射机制（shared mapping mechanism）来保持两者的对齐。这种设计使得模型能够直接访问理解任务所需的高层次语义表示和生成任务所需的细粒度视觉特征，从而在多模态任务中实现卓越的性能。

链接: https://arxiv.org/abs/2412.03069
作者: Liao Qu,Huichao Zhang,Yiheng Liu,Xu Wang,Yi Jiang,Yiming Gao,Hu Ye,Daniel K. Du,Zehuan Yuan,Xinglong Wu
关键词-EN: reconstruction-targeted Vector Quantization, unified image tokenizer, Vector Quantization, tokenizer that bridges, bridges the long-standing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow’s superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256256 resolution, achieving comparable results to SDXL.
zh

[CV-75] Lightweight Stochastic Video Prediction via Hybrid Warping

【速读】：该论文试图解决视频预测中动态区域（dynamic regions）的准确性问题，特别是在自动驾驶、远程工作和远程医疗等关键应用中。解决方案的关键在于提出了一种新颖的随机长期视频预测模型（stochastic long-term video prediction model），该模型通过采用混合变形策略（hybrid warping strategy）来聚焦动态区域。具体来说，该模型通过整合前向和后向变形生成的帧，有效弥补了单一变形技术的不足，从而提高了视频中移动区域的预测精度和真实感，并通过随机预测（stochastic predictions）来应对运动的不确定性。此外，为了实现实时预测，论文还引入了基于MobileNet的轻量级架构（MobileNet-based lightweight architecture）。该模型（称为SVPHW）在两个基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03061
作者: Kazuki Kotoyori,Shota Hirose,Heming Sun,Jiro Katto
关键词-EN: deep neural networks, Accurate video prediction, remote working, Accurate video, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE VCIP 2024

点击查看摘要

Abstract:Accurate video prediction by deep neural networks, especially for dynamic regions, is a challenging task in computer vision for critical applications such as autonomous driving, remote working, and telemedicine. Due to inherent uncertainties, existing prediction models often struggle with the complexity of motion dynamics and occlusions. In this paper, we propose a novel stochastic long-term video prediction model that focuses on dynamic regions by employing a hybrid warping strategy. By integrating frames generated through forward and backward warpings, our approach effectively compensates for the weaknesses of each technique, improving the prediction accuracy and realism of moving regions in videos while also addressing uncertainty by making stochastic predictions that account for various motions. Furthermore, considering real-time predictions, we introduce a MobileNet-based lightweight architecture into our model. Our model, called SVPHW, achieves state-of-the-art performance on two benchmark datasets.
zh

[CV-76] CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

【速读】：该论文试图解决在无监督3D表示学习中，由于高GPU内存消耗导致图像和点云两种模态分别进行预训练，从而忽视了两种模态间交互的问题。解决方案的关键在于提出了CLAP（Curvature sampLing and swApping Prototype assignment prediction）方法，通过以下三个创新点来实现：1) 使用曲率采样（Curvature Sampling）选择更具信息量的点/像素进行预训练，以克服GPU内存消耗问题；2) 引入可学习的原型（learnable prototypes）在共同特征空间中表示场景的部分，并通过交换原型分配预测（swapping prototype assignment prediction）来学习两种模态间的交互；3) 采用期望最大化训练方案（Expectation-Maximization training scheme）和Gram矩阵正则化损失（Gram Matrix Regularization Loss）来优化可学习原型，避免模型崩溃。实验结果表明，CLAP在NuScenes数据集上相比之前的SOTA方法，性能提升了300%。

链接: https://arxiv.org/abs/2412.03059
作者: Runjian Chen,Hang Zhang,Avinash Ravichandran,Wenqi Shao,Alex Wong,Ping Luo
关键词-EN: representation learning, GPU memory consumption, promising to reduce, reduce the labeling, labeling burden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised 3D representation learning via masked-and-reconstruction with differentiable rendering is promising to reduce the labeling burden for fusion 3D perception. However, previous literature conduct pre-training for different modalities separately because of the hight GPU memory consumption. Consequently, the interaction between the two modalities (images and point clouds) is neglected during pre-training. In this paper, we explore joint unsupervised pre-training for fusion 3D perception via differentiable rendering and propose CLAP, short for Curvature sampLing and swApping Prototype assignment prediction. The contributions are three-fold. 1) To overcome the GPU memory consumption problem, we propose Curvature Sampling to sample the more informative points/pixels for pre-training. 2) We propose to use learnable prototypes to represent parts of the scenes in a common feature space and bring the idea of swapping prototype assignment prediction to learn the interaction between the two modalities. 3) To further optimize learnable prototypes, we propose an Expectation-Maximization training scheme to maximize the similarity between embeddings and prototypes, followed by a Gram Matrix Regularization Loss to avoid collapse. Experiment results on NuScenes show that CLAP achieves 300% more performance gain as compared to previous SOTA 3D pre-training method via differentiable rendering. Codes and models will be released.
zh

[CV-77] Revisiting Energy-Based Model for Out-of-Distribution Detection

【速读】：该论文试图解决深度学习模型在分布外（Out-of-distribution, OOD）检测中的鲁棒性和泛化性问题。现有的OOD检测方法通常依赖于精心设计的数据集或数据增强技术，但这些方法与实际的OOD数据之间存在频繁的匹配问题，限制了模型的鲁棒性和泛化能力。论文提出的解决方案是引入“简单变换的外部数据暴露”（Outlier Exposure by Simple Transformations, OEST）框架，通过利用“周边分布”（Peripheral-Distribution, PD）数据来增强OOD检测。关键在于使用简单数据变换生成的PD数据，替代手工设计的异常数据，并通过能量模型（Energy-Based Models, EBMs）研究PD数据，建立能量屏障（Energy Barrier）以区分分布内（In-Distribution, ID）和OOD样本。此外，论文还提出了基于能量屏障概念的能量屏障损失（Energy-Barrier Loss），取代传统的能量边界损失，形成改进的OEST*范式，从而实现更有效和理论上有依据的ID和OOD样本分离。

链接: https://arxiv.org/abs/2412.03058
作者: Yifan Wu,Xichen Ye,Songmin Dai,Dengye Pan,Xiaoqiang Li,Weizhong Zhang,Yifan Chen
关键词-EN: robustifying deep learning, OOD detection, deep learning models, OOD, trained distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is an essential approach to robustifying deep learning models, enabling them to identify inputs that fall outside of their trained distribution. Existing OOD detection methods usually depend on crafted data, such as specific outlier datasets or elaborate data augmentations. While this is reasonable, the frequent mismatch between crafted data and OOD data limits model robustness and generalizability. In response to this issue, we introduce Outlier Exposure by Simple Transformations (OEST), a framework that enhances OOD detection by leveraging “peripheral-distribution” (PD) data. Specifically, PD data are samples generated through simple data transformations, thus providing an efficient alternative to manually curated outliers. We adopt energy-based models (EBMs) to study PD data. We recognize the “energy barrier” in OOD detection, which characterizes the energy difference between in-distribution (ID) and OOD samples and eases detection. PD data are introduced to establish the energy barrier during training. Furthermore, this energy barrier concept motivates a theoretically grounded energy-barrier loss to replace the classical energy-bounded loss, leading to an improved paradigm, OEST*, which achieves a more effective and theoretically sound separation between ID and OOD samples. We perform empirical validation of our proposal, and extensive experiments across various benchmarks demonstrate that OEST* achieves better or similar accuracy compared with state-of-the-art methods. Comments: This work has been submitted to the IEEE for possible publication Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T05, 68T45 ACMclasses: I.2.10; I.5.1 Cite as: arXiv:2412.03058 [cs.CV] (or arXiv:2412.03058v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.03058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-78] Point-GN: A Non-Parametric Network Using Gaussian Positional Encoding for Point Cloud Classification WACV

【速读】：该论文试图解决3D点云分类问题，特别是在资源受限和实时应用场景下的高效性和准确性。解决方案的关键在于引入了一种新颖的非参数网络Point-GN，该网络利用非可学习组件，如最远点采样 (Farthest Point Sampling, FPS)、k近邻 (k-Nearest Neighbors, k-NN) 和高斯位置编码 (Gaussian Positional Encoding, GPE)，来提取局部和全局几何特征。这种设计不仅消除了对额外训练的需求，还显著降低了计算复杂度，同时保持了高分类性能。Point-GN在ModelNet40和ScanObjectNN两个基准数据集上的分类准确率分别达到了85.29%和85.89%，优于现有的非参数方法，并与全训练模型性能相当，且无需任何可学习参数。

链接: https://arxiv.org/abs/2412.03056
作者: Marzieh Mohammadi,Amir Salarpour
关键词-EN: Gaussian Positional Encoding, Farthest Point Sampling, paper introduces Point-GN, efficient and accurate, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This paper has been accepted for presentation at the IEEE Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:This paper introduces Point-GN, a novel non-parametric network for efficient and accurate 3D point cloud classification. Unlike conventional deep learning models that rely on a large number of trainable parameters, Point-GN leverages non-learnable components-specifically, Farthest Point Sampling (FPS), k-Nearest Neighbors (k-NN), and Gaussian Positional Encoding (GPE)-to extract both local and global geometric features. This design eliminates the need for additional training while maintaining high performance, making Point-GN particularly suited for real-time, resource-constrained applications. We evaluate Point-GN on two benchmark datasets, ModelNet40 and ScanObjectNN, achieving classification accuracies of 85.29% and 85.89%, respectively, while significantly reducing computational complexity. Point-GN outperforms existing non-parametric methods and matches the performance of fully trained models, all with zero learnable parameters. Our results demonstrate that Point-GN is a promising solution for 3D point cloud classification in practical, real-time environments.
zh

[CV-79] REND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

【速读】：该论文试图解决激光雷达点云标注耗时耗能的问题，提出了一种名为TREND（Temporal REndering with Neural fielD）的无监督3D表示学习方法。解决方案的关键在于利用时间序列的激光雷达数据，通过预测未来观测来学习3D表示。TREND不同于传统的对比学习或掩码自编码方法，它通过循环嵌入方案生成跨时间的3D嵌入，并使用时间神经场来表示3D场景，通过可微分渲染计算损失。这是首次将时间预测引入无监督3D表示学习的工作，实验结果表明，TREND在下游3D物体检测任务中显著优于现有的最先进无监督预训练方法，证明了时间预测对激光雷达感知性能的提升。

链接: https://arxiv.org/abs/2412.03054
作者: Runjian Chen,Hyoungseob Park,Bo Zhang,Wenqi Shao,Ping Luo,Alex Wong
关键词-EN: Labeling LiDAR point, labeling burden, LiDAR point clouds, spurs recent unsupervised, Labeling LiDAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights. Almost all existing work focus on a single frame of LiDAR point cloud and neglect the temporal LiDAR sequence, which naturally accounts for object motion (and their semantics). Instead, we propose TREND, namely Temporal REndering with Neural fielD, to learn 3D representation via forecasting the future observation in an unsupervised manner. Unlike existing work that follows conventional contrastive learning or masked auto encoding paradigms, TREND integrates forecasting for 3D pre-training through a Recurrent Embedding scheme to generate 3D embedding across time and a Temporal Neural Field to represent the 3D scene, through which we compute the loss using differentiable rendering. To our best knowledge, TREND is the first work on temporal forecasting for unsupervised 3D representation learning. We evaluate TREND on downstream 3D object detection tasks on popular datasets, including NuScenes, Once and Waymo. Experiment results show that TREND brings up to 90% more improvement as compared to previous SOTA unsupervised 3D pre-training methods and generally improve different downstream models across datasets, demonstrating that indeed temporal forecasting brings improvement for LiDAR perception. Codes and models will be released.
zh

[CV-80] Point-GR: Graph Residual Point Cloud Network for 3D Object Classification and Segmentation ICPR2024

【速读】：该论文试图解决点云数据中3D形状分析的挑战，特别是如何有效地表示3D信息并提取有意义的特征用于分类任务。解决方案的关键在于提出了Point-GR，一种新型深度学习架构，专门设计用于将无序的原始点云数据转换到更高维度，同时保留局部几何特征。Point-GR通过在网络中引入基于残差的学习来缓解点云数据中的点排列问题，显著减少了网络参数数量，并在分类和部分分割任务中优于基线图网络。该模型在S3DIS基准数据集上实现了73.47%的场景分割平均交并比（mean IoU），展示了其卓越的有效性。

链接: https://arxiv.org/abs/2412.03052
作者: Md Meraz,Md Afzal Ansari,Mohammed Javed,Pavan Chakraborty
关键词-EN: gathered significant attention, point cloud data, recent years, shape analysis, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: ICPR 2024 G2SP-CV Workshop, Dec 1-5, 2024 Kolkata, India

点击查看摘要

Abstract:In recent years, the challenge of 3D shape analysis within point cloud data has gathered significant attention in computer vision. Addressing the complexities of effective 3D information representation and meaningful feature extraction for classification tasks remains crucial. This paper presents Point-GR, a novel deep learning architecture designed explicitly to transform unordered raw point clouds into higher dimensions while preserving local geometric features. It introduces residual-based learning within the network to mitigate the point permutation issues in point cloud data. The proposed Point-GR network significantly reduced the number of network parameters in Classification and Part-Segmentation compared to baseline graph-based networks. Notably, the Point-GR model achieves a state-of-the-art scene segmentation mean IoU of 73.47% on the S3DIS benchmark dataset, showcasing its effectiveness. Furthermore, the model shows competitive results in Classification and Part-Segmentation tasks.
zh

[CV-81] Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

【速读】：该论文试图解决视频异常检测中的两个主要问题：(1) 现有基于重建的方法在开放场景下的模型鲁棒性有限；(2) 过度强调但受限于细节运动的重建能力。解决方案的关键在于提出了一种频率引导的扩散模型，并通过扰动训练增强模型鲁棒性。具体来说，论文首先使用可训练的生成器生成扰动样本，用于扩散模型的扰动训练，从而增强模型对开放场景的适应性。在推理阶段引入扰动样本，通过影响正常和异常运动的重建效果，提高它们的可区分性。此外，论文提出了一种基于2D离散余弦变换的掩码方法，用于分离高频和低频信息，使扩散模型能够根据观察到的运动高频信息，专注于生成低频信息，从而准确重建运动。实验结果表明，该方法在多个视频异常检测数据集上表现出色。

链接: https://arxiv.org/abs/2412.03044
作者: Xiaofeng Tan,Hongsong Wang,Xin Geng
关键词-EN: challenging open-set task, proxy task, computer vision, perturbation training, essential yet challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection is an essential yet challenging open-set task in computer vision, often addressed by leveraging reconstruction as a proxy task. However, existing reconstruction-based methods encounter challenges in two main aspects: (1) limited model robustness for open-set scenarios, (2) and an overemphasis on, but restricted capacity for, detailed motion reconstruction. To this end, we propose a novel frequency-guided diffusion model with perturbation training, which enhances the model robustness by perturbation training and emphasizes the principal motion components guided by motion frequencies. Specifically, we first use a trainable generator to produce perturbative samples for perturbation training of the diffusion model. During the perturbation training phase, the model robustness is enhanced and the domain of the reconstructed model is broadened by training against this generator. Subsequently, perturbative samples are introduced for inference, which impacts the reconstruction of normal and abnormal motions differentially, thereby enhancing their separability. Considering that motion details originate from high-frequency information, we propose a masking method based on 2D discrete cosine transform to separate high-frequency information and low-frequency information. Guided by the high-frequency information from observed motion, the diffusion model can focus on generating low-frequency information, and thus reconstructing the motion accurately. Experimental results on five video anomaly detection datasets, including human-related and open-set benchmarks, demonstrate the effectiveness of the proposed method. Our code is available at this https URL.
zh

[CV-82] ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics

【速读】：该论文试图解决三维空间转录组学（3D Spatial Transcriptomics, ST）分析中的成本高昂和数据缺失问题。解决方案的关键在于提出了一种名为解剖学感知空间插补图网络（Anatomy-aware Spatial Imputation Graph Network, ASIGN）的新方法，该方法通过结合三维全切片成像（3D Whole Slide Imaging, WSI）和单张二维空间转录组学（2D ST）切片，实现了对3D ST数据的精确且经济高效的建模。ASIGN通过利用跨层重叠和基于相似性的扩展，将现有的二维空间关系扩展到三维，并结合多层次空间注意力图网络，全面整合不同数据源的特征。实验结果表明，ASIGN在2D和3D场景中均达到了最先进的性能。

链接: https://arxiv.org/abs/2412.03026
作者: Junchao Zhu,Ruining Deng,Tianyuan Yao,Juming Xiong,Chongyu Qu,Junlin Guo,Siqi Lu,Mengmeng Yin,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
关键词-EN: enables medical computer, medical computer vision, molecular profiles underlying, profiles underlying morphological, computer vision scientists
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial transcriptomics (ST) is an emerging technology that enables medical computer vision scientists to automatically interpret the molecular profiles underlying morphological features. Currently, however, most deep learning-based ST analyses are limited to two-dimensional (2D) sections, which can introduce diagnostic errors due to the heterogeneity of pathological tissues across 3D sections. Expanding ST to three-dimensional (3D) volumes is challenging due to the prohibitive costs; a 2D ST acquisition already costs over 50 times more than whole slide imaging (WSI), and a full 3D volume with 10 sections can be an order of magnitude more expensive. To reduce costs, scientists have attempted to predict ST data directly from WSI without performing actual ST acquisition. However, these methods typically yield unsatisfying results. To address this, we introduce a novel problem setting: 3D ST imputation using 3D WSI histology sections combined with a single 2D ST slide. To do so, we present the Anatomy-aware Spatial Imputation Graph Network (ASIGN) for more precise, yet affordable, 3D ST modeling. The ASIGN architecture extends existing 2D spatial relationships into 3D by leveraging cross-layer overlap and similarity-based expansion. Moreover, a multi-level spatial attention graph network integrates features comprehensively across different data sources. We evaluated ASIGN on three public spatial transcriptomics datasets, with experimental results demonstrating that ASIGN achieves state-of-the-art performance on both 2D and 3D scenarios. Code is available at this https URL.
zh

[CV-83] PEMF-VVTO: Point-Enhanced Video Virtual Try-on via Mask-free Paradigm

【速读】：该论文试图解决现有基于掩码的视频虚拟试衣方法在处理复杂场景变化和姿态运动时，由于过大的不连贯掩码破坏视频原始时空信息，导致试衣视频的逼真度和连贯性下降的问题。解决方案的关键在于提出了一种新的点增强的无掩码视频虚拟试衣框架（PEMF-VVTO）。具体来说，首先利用预训练的基于掩码的试衣模型构建大规模的伪人物样本训练数据，使模型能够在感知原始时空信息的同时实现精确的服装转移。然后，基于预先获取的稀疏帧-布料和帧-帧点对齐信息，设计了点增强的空间注意力（PSA）和点增强的时间注意力（PTA），以进一步提高无掩码模型的试衣精度和视频连贯性。PSA通过稀疏语义对齐显式引导服装转移到理想位置，而PTA则利用稀疏点对应上的时间注意力增强生成视频的平滑性。

链接: https://arxiv.org/abs/2412.03021
作者: Tianyu Chang,Xiaohao Chen. Zhichao Wei,Xuanpu Zhang,Qing-Guo Chen,Weihua Luo,Xun Yang
关键词-EN: semantically aligned try-on, aligned try-on area, Video Virtual Try-on, Virtual Try-on aims, source person video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video Virtual Try-on aims to fluently transfer the garment image to a semantically aligned try-on area in the source person video. Previous methods leveraged the inpainting mask to remove the original garment in the source video, thus achieving accurate garment transfer on simple model videos. However, when these methods are applied to realistic video data with more complex scene changes and posture movements, the overly large and incoherent agnostic masks will destroy the essential spatial-temporal information of the original video, thereby inhibiting the fidelity and coherence of the try-on video. To alleviate this problem, %avoid the inherent deficiencies of mask-based try-on paradigm, we propose a novel point-enhanced mask-free video virtual try-on framework (PEMF-VVTO). Specifically, we first leverage the pre-trained mask-based try-on model to construct large-scale paired training data (pseudo-person samples). Training on these mask-free data enables our model to perceive the original spatial-temporal information while realizing accurate garment transfer. Then, based on the pre-acquired sparse frame-cloth and frame-frame point alignments, we design the point-enhanced spatial attention (PSA) and point-enhanced temporal attention (PTA) to further improve the try-on accuracy and video coherence of the mask-free model. Concretely, PSA explicitly guides the garment transfer to desirable locations through the sparse semantic alignments of video frames and cloth. PTA exploits the temporal attention on sparse point correspondences to enhance the smoothness of generated videos. Extensive qualitative and quantitative experiments clearly illustrate that our PEMF-VVTO can generate more natural and coherent try-on videos than existing state-of-the-art methods.
zh

[CV-84] Unsupervised Network for Single Image Raindrop Removal

【速读】：该论文试图解决由雨滴引起的图像质量下降问题，这一问题严重影响了视觉系统的性能。现有的雨滴去除算法大多基于监督学习方法，需要成对的雨滴图像和无雨图像，这在实际应用中难以获取。论文提出了一种基于无监督学习的深度神经网络模型，仅需两组未配对的雨滴图像和无雨图像。解决方案的关键在于采用循环网络架构进行层分离，将雨滴图像分解为雨滴层、透明度掩码和干净背景层。干净背景层即为雨滴去除的目标结果，透明度掩码指示雨滴的空间位置。此外，模型引入了反馈机制，通过高层次信息优化低层次表示，即前一次迭代的输出作为下一次迭代的输入，与带雨滴的输入图像一起，逐步去除雨滴。实验结果表明，该方法在定量指标和视觉质量上均表现出色。

链接: https://arxiv.org/abs/2412.03019
作者: Huijiao Wang,Shenghao Zhao,Lei Yu,Xulei Yang
关键词-EN: quality degradation caused, vision systems, degradation caused, important but challenging, challenging problems
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,9 figures

点击查看摘要

Abstract:Image quality degradation caused by raindrops is one of the most important but challenging problems that reduce the performance of vision systems. Most existing raindrop removal algorithms are based on a supervised learning method using pairwise images, which are hard to obtain in real-world applications. This study proposes a deep neural network for raindrop removal based on unsupervised learning, which only requires two unpaired image sets with and without raindrops. Our proposed model performs layer separation based on cycle network architecture, which aims to separate a rainy image into a raindrop layer, a transparency mask, and a clean background layer. The clean background layer is the target raindrop removal result, while the transparency mask indicates the spatial locations of the raindrops. In addition, the proposed model applies a feedback mechanism to benefit layer separation by refining low-level representation with high-level information. i.e., the output of the previous iteration is used as input for the next iteration, together with the input image with raindrops. As a result, raindrops could be gradually removed through this feedback manner. Extensive experiments on raindrop benchmark datasets demonstrate the effectiveness of the proposed method on quantitative metrics and visual quality.
zh

[CV-85] Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

【速读】：该论文试图解决现有基于扩散先验的方法在图像超分辨率（SR）中难以平衡像素级保真度和感知质量的问题，特别是在用户对SR结果有不同偏好时，难以在推理阶段调整模型以适应这些偏好。解决方案的关键在于提出了像素级和语义级可调SR模型（Pixel-level and Semantic-level Adjustable SR, PiSA-SR），该模型通过在预训练的稳定扩散（Stable Diffusion, SD）模型上学习两个低秩适应（Low-Rank Adaptation, LoRA）模块来实现。PiSA-SR将SR问题形式化为学习低质量输入与高质量输出之间的残差，并将学习目标分解为两个不同的LoRA权重空间：一个用于像素级回归的\ell_2损失，另一个用于从预训练的分类和SD模型中提取语义信息的LPIPS和分类器分数蒸馏损失。通过在推理阶段引入两个可调的指导尺度来控制像素级保真度和语义级细节的强度，PiSA-SR能够在不重新训练的情况下根据用户偏好提供灵活的SR结果。

链接: https://arxiv.org/abs/2412.03017
作者: Lingchen Sun,Rongyuan Wu,Zhiyuan Ma,Shuaizheng Liu,Qiaosi Yi,Lei Zhang
关键词-EN: shown impressive results, real-world image super-resolution, Diffusion prior-based methods, image super-resolution, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion prior-based methods have shown impressive results in real-world image super-resolution (SR). However, most existing methods entangle pixel-level and semantic-level SR objectives in the training process, struggling to balance pixel-wise fidelity and perceptual quality. Meanwhile, users have varying preferences on SR results, thus it is demanded to develop an adjustable SR model that can be tailored to different fidelity-perception preferences during inference without re-training. We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. We first formulate the SD-based SR problem as learning the residual between the low-quality input and the high-quality output, then show that the learning objective can be decoupled into two distinct LoRA weight spaces: one is characterized by the \ell_2 -loss for pixel-level regression, and another is characterized by the LPIPS and classifier score distillation losses to extract semantic information from pre-trained classification and SD models. In its default setting, PiSA-SR can be performed in a single diffusion step, achieving leading real-world SR results in both quality and efficiency. By introducing two adjustable guidance scales on the two LoRA modules to control the strengths of pixel-wise fidelity and semantic-level details during inference, PiSASR can offer flexible SR results according to user preference without re-training. Codes and models can be found at this https URL.
zh

[CV-86] Benchmarking Attention Mechanisms and Consistency Regularization Semi-Supervised Learning for Post-Flood Building Damage Assessment in Satellite Images

【速读】：该论文试图解决洪水后建筑损伤评估（DA）与变化检测（CD）在神经网络设计上的差异问题。解决方案的关键在于：1) 针对DA任务中卫星图像中建筑变化特征更为微妙的特点，引入Simple Prior Attention UNet (SPAUNet)以增强模型识别细微变化的能力；2) 针对DA数据集面临的数据稀缺和标签不平衡问题，采用半监督学习（SSL）策略，构建四种不同的图像级别标签类别参考分布组合进行一致性训练。实验结果表明，SPAUNet在监督学习实验中表现优异，召回率达到79.10%，F1得分为71.32%，优于CD方法。SSL实验则证明了图像级别一致性正则化对模型的积极影响，特别是使用伪标签形成参考分布进行一致性训练效果最佳，显示了利用大量未标记数据的类别分布进行SSL的潜力。

链接: https://arxiv.org/abs/2412.03015
作者: Jiaxi Yu,Tomohiro Fukuda,Nobuyoshi Yabuki
关键词-EN: post-disaster reconstruction planning, building damage assessment, Post-flood building damage, reconstruction planning, critical for rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-flood building damage assessment is critical for rapid response and post-disaster reconstruction planning. Current research fails to consider the distinct requirements of disaster assessment (DA) from change detection (CD) in neural network design. This paper focuses on two key differences: 1) building change features in DA satellite images are more subtle than in CD; 2) DA datasets face more severe data scarcity and label imbalance. To address these issues, in terms of model architecture, the research explores the benchmark performance of attention mechanisms in post-flood DA tasks and introduces Simple Prior Attention UNet (SPAUNet) to enhance the model’s ability to recognize subtle changes, in terms of semi-supervised learning (SSL) strategies, the paper constructs four different combinations of image-level label category reference distributions for consistent training. Experimental results on flood events of xBD dataset show that SPAUNet performs exceptionally well in supervised learning experiments, achieving a recall of 79.10% and an F1 score of 71.32% for damaged classification, outperforming CD methods. The results indicate the necessity of DA task-oriented model design. SSL experiments demonstrate the positive impact of image-level consistency regularization on the model. Using pseudo-labels to form the reference distribution for consistency training yields the best results, proving the potential of using the category distribution of a large amount of unlabeled data for SSL. This paper clarifies the differences between DA and CD tasks. It preliminarily explores model design strategies utilizing prior attention mechanisms and image-level consistency regularization, establishing new post-flood DA task benchmark methods.
zh

[CV-87] Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

【速读】：该论文试图解决从单一视角生成多视角人体图像的复杂问题，特别是在现有3D人体数据集有限的情况下，如何生成逼真的人体形状和精细的面部细节。解决方案的关键在于提出了一种创新框架，该框架利用预训练的单视角模型在大规模人体数据集上开发的体态和面部表示，将其扩展到多视角扩散模型中。具体来说，通过将单视角模型的2D知识转移到多视角扩散模型中，并结合多模态面部特征的转移，以增强模型的细节恢复能力。实验结果表明，该方法在多视角人体合成任务中优于当前最先进的方法。

链接: https://arxiv.org/abs/2412.03011
作者: Yu Feng,Shunsi Zhang,Jian Shu,Hanfeng Zhao,Guoliang Pang,Chi Zhang,Hao Wang
关键词-EN: Generating multi-view human, Generating multi-view, significant challenge, complex and significant, human
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model’s detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.
zh

[CV-88] AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在动态真实世界场景中的鲁棒性问题。解决方案的关键在于提出了AdvDreamer框架，该框架通过生成物理上可复现的对抗性3D变换（Adversarial 3D Transformation, Adv-3DT）样本来系统评估VLMs对真实世界3D变化的鲁棒性。AdvDreamer结合了先进的生成技术，并引入了两个关键创新：一是逆语义概率目标（Inverse Semantic Probability Objective），用于在视觉-文本对齐空间中执行对抗性优化，确保方法的通用性；二是自然性奖励模型（Naturalness Reward Model），用于在对抗性优化过程中提供正则化反馈，防止生成样本偏离自然图像分布。此外，论文还建立了MM3DTBench数据集，用于基准测试VLMs在3D变化下的鲁棒性。

链接: https://arxiv.org/abs/2412.03002
作者: Shouwei Ruan,Hanqin Liu,Yao Huang,Xiaoqi Wang,Caixin Kang,Hang Su,Yinpeng Dong,Xingxing Wei
关键词-EN: Vision Language Models, Vision Language, remarkable generalization capabilities, remains largely unexplored, exhibited remarkable generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages, 8 figures

点击查看摘要

Abstract:Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs’ robustness to real-world 3D variations, we propose AdvDreamer, the first framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single-view images. AdvDreamer integrates advanced generative techniques with two key innovations and aims to characterize the worst-case distributions of 3D variations from natural images. To ensure adversarial effectiveness and method generality, we introduce an Inverse Semantic Probability Objective that executes adversarial optimization on fundamental vision-text alignment spaces, which can be generalizable across different VLM architectures and downstream tasks. To mitigate the distribution discrepancy between generated and real-world samples while maintaining physical reproducibility, we design a Naturalness Reward Model that provides regularization feedback during adversarial optimization, preventing convergence towards hallucinated and unnatural elements. Leveraging AdvDreamer, we establish MM3DTBench, the first VQA dataset for benchmarking VLMs’ 3D variations robustness. Extensive evaluations on representative VLMs with diverse architectures highlight that 3D variations in the real world may pose severe threats to model performance across various tasks.
zh

[CV-89] QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

【速读】：该论文试图解决大规模点云配准中的高效数据处理和配准鲁棒性问题，特别是在面对显著的视角变化和遮挡时。解决方案的关键在于引入了一种新的点云配准方法，即QuadricsReg，该方法利用简洁的二次曲面（quadrics）基元来表示场景，并通过其几何特性建立对应关系，以估计6自由度（6-DoF）变换。QuadricsReg通过捕捉场景的主要几何特征，能够高效处理大规模点云的复杂性。其核心在于利用二次曲面的内在特性（如类型和尺度）初始化对应关系，并通过构建多层次兼容性图集，利用几何一致性上的最大团来寻找对应关系。最终，通过二次曲面对应关系估计6-DoF变换，并在因子图中基于二次曲面退化感知距离进行优化，确保配准的高精度和对退化结构的鲁棒性。

链接: https://arxiv.org/abs/2412.02998
作者: Ji Wu,Huai Yu,Shu Han,Xi-Meng Cai,Ming-Feng Wang,Wen Yang,Gui-Song Xia
关键词-EN: point cloud registration, processing vast amounts, significant viewpoint variations, large-scale point cloud, efficiently processing vast
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 25 pages, 17 figures

点击查看摘要

Abstract:In the realm of large-scale point cloud registration, designing a compact symbolic representation is crucial for efficiently processing vast amounts of data, ensuring registration robustness against significant viewpoint variations and occlusions. This paper introduces a novel point cloud registration method, i.e., QuadricsReg, which leverages concise quadrics primitives to represent scenes and utilizes their geometric characteristics to establish correspondences for 6-DoF transformation estimation. As a symbolic feature, the quadric representation fully captures the primary geometric characteristics of scenes, which can efficiently handle the complexity of large-scale point clouds. The intrinsic characteristics of quadrics, such as types and scales, are employed to initialize correspondences. Then we build a multi-level compatibility graph set to find the correspondences using the maximum clique on the geometric consistency between quadrics. Finally, we estimate the 6-DoF transformation using the quadric correspondences, which is further optimized based on the quadric degeneracy-aware distance in a factor graph, ensuring high registration accuracy and robustness against degenerate structures. We test on 5 public datasets and the self-collected heterogeneous dataset across different LiDAR sensors and robot platforms. The exceptional registration success rates and minimal registration errors demonstrate the effectiveness of QuadricsReg in large-scale point cloud registration scenarios. Furthermore, the real-world registration testing on our self-collected heterogeneous dataset shows the robustness and generalization ability of QuadricsReg on different LiDAR sensors and robot platforms. The codes and demos will be released at \urlthis https URL.
zh

[CV-90] CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D Design Datasets

【速读】：该论文试图解决从零开始设计和创建三维（3D）物体耗时且具有挑战性的问题。解决方案的关键在于提出了一个名为CLAS的机器学习增强框架，该框架通过四个步骤：捕捉（capture）、标注（label）、关联（associate）和搜索（search），实现了基于用户特定需求的3D物体自动检索。CLAS利用现有的3D物体数据集，为设计师提供了一种高效且有效的方法，使其能够充分利用未被利用的3D数据集。此外，CLAS还可用于生成高质量的3D物体合成数据集，用于训练和评估3D生成模型。作为概念验证，论文展示了一个基于CLAS的搜索系统，该系统通过网页用户界面（UI）从ShapeNet数据集中检索了6,778个椅子3D物体，并在封闭检索设置中实现了较高的检索准确率。

链接: https://arxiv.org/abs/2412.02996
作者: XiuYu Zhang,Xiaolei Ye,Jui-Che Chang,Yue Fang
关键词-EN: objects, Three-dimensional, CLAS, wide applications, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) objects have wide applications. Despite the growing interest in 3D modeling in academia and industries, designing and/or creating 3D objects from scratch remains time-consuming and challenging. With the development of generative artificial intelligence (AI), designers discover a new way to create images for ideation. However, generative AIs are less useful in creating 3D objects with satisfying qualities. To allow 3D designers to access a wide range of 3D objects for creative activities based on their specific demands, we propose a machine learning (ML) enhanced framework CLAS - named after the four-step of capture, label, associate, and search - to enable fully automatic retrieval of 3D objects based on user specifications leveraging the existing datasets of 3D objects. CLAS provides an effective and efficient method for any person or organization to benefit from their existing but not utilized 3D datasets. In addition, CLAS may also be used to produce high-quality 3D object synthesis datasets for training and evaluating 3D generative models. As a proof of concept, we created and showcased a search system with a web user interface (UI) for retrieving 6,778 3D objects of chairs in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy of 42.27%, and top 10 accuracy of 89.64%.
zh

[CV-91] EchoONE: Segmenting Multiple echocardiography Planes in One Model

【速读】：该论文试图解决在超声心动图检查中，由于心脏结构在不同平面上的显著差异，导致需要为每个特定平面单独开发AI模型的问题。这一问题被称为多平面分割（Multi-Plane Segmentation, MPS）问题。解决方案的关键在于提出了一个名为EchoONE的新方法，该方法采用基于SAM的分割架构，结合先验可组合掩码学习（Prior-Composable Mask Learning, PC-Mask）模块用于语义感知的密集提示生成，以及一个带有局部特征融合与适应（Local Feature Fusion and Adaption, LFFA）模块的可学习CNN分支，用于SAM的适应。通过这种方法，论文首次实现了在单一模型中解决超声心动图数据的多平面分割问题，并在多个内部和外部数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2412.02993
作者: Jiongtong Hu,Wei Zhuo,Jun Cheng,Yingying Liu,Wufeng Xue,Dong Ni
关键词-EN: required in screening, diagnosis and treatment, cardiac disease, clinical practice, treatment of cardiac
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In clinical practice of echocardiography examinations, multiple planes containing the heart structures of different view are usually required in screening, diagnosis and treatment of cardiac disease. AI models for echocardiography have to be tailored for each specific plane due to the dramatic structure differences, thus resulting in repetition development and extra complexity. Effective solution for such a multi-plane segmentation (MPS) problem is highly demanded for medical images, yet has not been well investigated. In this paper, we propose a novel solution, EchoONE, for this problem with a SAM-based segmentation architecture, a prior-composable mask learning (PC-Mask) module for semantic-aware dense prompt generation, and a learnable CNN-branch with a simple yet effective local feature fusion and adaption (LFFA) module for SAM adapting. We extensively evaluated our method on multiple internal and external echocardiography datasets, and achieved consistently state-of-the-art performance for multi-source datasets with different heart planes. This is the first time that the MPS problem is solved in one model for echocardiography data. The code will be available at this https URL.
zh

[CV-92] Is Foreground Prototype Sufficient? Few-Shot Medical Image Segmentation with Background-Fused Prototype

【速读】：该论文试图解决在医学图像中进行少样本语义分割（Few-shot Semantic Segmentation, FSS）时，背景与前景共享大量视觉特征，导致传统方法难以有效区分背景的问题。解决方案的关键在于提出了一种新的可插拔的背景融合原型方法（Background-fused prototype, Bro）。具体来说，Bro方法通过两个核心设计来实现背景的详细描述：首先，特征相似性校准（Feature Similarity Calibration, FeaC）通过特征交叉注意力机制减少支持图像中的噪声；其次，分层通道对抗注意力（Hierarchical Channel Adversarial Attention, HiCA）通过基于通道组的注意力机制和对抗均值偏移结构，实现从粗到细的背景融合。这些设计使得Bro方法能够更全面地表示医学图像中的背景，从而显著提升现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.02983
作者: Song Tang,Chunxiao Zu,Wenxin Su,Yuan Dong,Mao Ye,Yan Gan,Xiatian Zhu
关键词-EN: Few-shot Semantic Segmentation, Few-shot Semantic, Semantic Segmentation, single labeled training, labeled training sample
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot Semantic Segmentation(FSS)aim to adapt a pre-trained model to new classes with as few as a single labeled training sample per class. The existing prototypical work used in natural image scenarios biasedly focus on capturing foreground’s discrimination while employing a simplistic representation for background, grounded on the inherent observation separation between foreground and background. However, this paradigm is not applicable to medical images where the foreground and background share numerous visual features, necessitating a more detailed description for background. In this paper, we present a new pluggable Background-fused prototype(Bro)approach for FSS in medical images. Instead of finding a commonality of background subjects in support image, Bro incorporates this background with two pivot designs. Specifically, Feature Similarity Calibration(FeaC)initially reduces noise in the support image by employing feature cross-attention with the query image. Subsequently, Hierarchical Channel Adversarial Attention(HiCA)merges the background into comprehensive prototypes. We achieve this by a channel groups-based attention mechanism, where an adversarial Mean-Offset structure encourages a coarse-to-fine fusion. Extensive experiments show that previous state-of-the-art methods, when paired with Bro, experience significant performance improvements. This demonstrates a more integrated way to represent backgrounds specifically for medical image.
zh

[CV-93] Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

【速读】：该论文试图解决多器官、多类别细胞语义分割的问题，特别是在细胞大小和形状存在细微差异的情况下。解决方案的关键在于提出了一种名为MONCH（Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH）的方法，该方法利用视觉-语言输入，通过单一分支架构实现高效的特征提取。具体来说，MONCH设计了一种层次化的特征提取机制，从粗到细地提取高频、卷积和拓扑特征，以适应不同形状细胞的分割需求。此外，引入了一种渐进式提示解码器，以协调多模态信息，从细到粗的粒度整合特征，从而更好地捕捉上下文信息。实验结果表明，MONCH在具有显著类别不平衡和细胞大小形状变化的PanNuke数据集上，优于现有的最先进细胞分割方法和视觉-语言模型。

链接: https://arxiv.org/abs/2412.02978
作者: Qing Zhang,Hang Guo,Siyuan Yang,Qingli Li,Yan Wang
关键词-EN: Pathological cell semantic, Pathological cell, computational pathology, essential for applications, effective treatment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.
zh

[CV-94] Stain-aware Domain Alignment for Imbalance Blood Cell Classification

【速读】：该论文试图解决血液细胞识别中的领域偏移（domain shift）和数据不平衡（data imbalance）问题，这些问题在实际的血细胞图像数据集中普遍存在，影响识别准确性。解决方案的关键在于提出了一种名为SADA（stain-aware domain alignment）的新型血液细胞分类方法。SADA通过染色感知领域对齐（stain-aware domain alignment）来挖掘领域不变特征（domain-invariant features），具体包括：1) 基于染色的数据增强方法（stain-based augmentation approach）；2) 局部对齐约束（local alignment constraint）；3) 领域不变监督对比学习策略（domain-invariant supervised contrastive learning strategy）。此外，SADA将训练过程分为领域不变特征学习和分类训练两个阶段，从而缓解数据不平衡问题。实验结果表明，SADA在多个公共和私有血细胞数据集上达到了新的最先进水平，显著优于现有方法。

链接: https://arxiv.org/abs/2412.02976
作者: Yongcheng Li,Lingcong Cai,Ying Lu,Xianghua Fu,Xiao Han,Ma Li,Wenxing Lai,Xiangzhong Zhang,Xiaomao Fan
关键词-EN: Blood cell identification, Blood cell, blood-related diseases, cell identification, critical for hematological
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blood cell identification is critical for hematological analysis as it aids physicians in diagnosing various blood-related diseases. In real-world scenarios, blood cell image datasets often present the issues of domain shift and data imbalance, posing challenges for accurate blood cell identification. To address these issues, we propose a novel blood cell classification method termed SADA via stain-aware domain alignment. The primary objective of this work is to mine domain-invariant features in the presence of domain shifts and data imbalances. To accomplish this objective, we propose a stain-based augmentation approach and a local alignment constraint to learn domain-invariant features. Furthermore, we propose a domain-invariant supervised contrastive learning strategy to capture discriminative features. We decouple the training process into two stages of domain-invariant feature learning and classification training, alleviating the problem of data imbalance. Experiment results on four public blood cell datasets and a private real dataset collected from the Third Affiliated Hospital of Sun Yat-sen University demonstrate that SADA can achieve a new state-of-the-art baseline, which is superior to the existing cutting-edge methods with a big margin. The source code can be available at the URL (\urlthis https URL).
zh

[CV-95] MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

【速读】：该论文试图解决医学报告中的不准确性问题，特别是在放射学报告中，无论是由人类还是机器学习算法生成的报告。解决方案的关键在于提出了一种图像条件下的自动修正框架，该框架能够识别并修正报告中的错误。具体来说，研究者首先在报告中故意引入多种错误，然后通过一个两阶段的框架来定位这些错误并进行修正，模拟了一个自动修正过程。这种方法旨在弥补现有自动化医学报告系统在事实错误和错误结论方面的不足，从而提高报告在关键医疗应用中的可靠性和可信度。实验结果表明，该方法在修正医学报告错误方面具有潜在的应用价值。

链接: https://arxiv.org/abs/2412.02971
作者: Arnold Caleb Asiimwe,Dídac Surís,Pranav Rajpurkar,Carl Vondrick
关键词-EN: machine learning algorithms, learning algorithms, generated by humans, humans or machine, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical reporting, the accuracy of radiological reports, whether generated by humans or machine learning algorithms, is critical. We tackle a new task in this paper: image-conditioned autocorrection of inaccuracies within these reports. Using the MIMIC-CXR dataset, we first intentionally introduce a diverse range of errors into reports. Subsequently, we propose a two-stage framework capable of pinpointing these errors and then making corrections, simulating an \textitautocorrection process. This method aims to address the shortcomings of existing automated medical reporting systems, like factual errors and incorrect conclusions, enhancing report reliability in vital healthcare applications. Importantly, our approach could serve as a guardrail, ensuring the accuracy and trustworthiness of automated report generation. Experiments on established datasets and state of the art report generation models validate this method’s potential in correcting medical reporting errors.
zh

[CV-96] Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

【速读】：该论文试图解决扩散模型在生成高分辨率图像时推理速度慢的问题。解决方案的关键是提出了部分条件补丁并行化（Partially Conditioned Patch Parallelism, PCPP），通过在每个扩散步骤中仅依赖于相邻补丁的部分条件，减少了计算和设备间的通信成本。与现有的分布式融合（DistriFusion）方法相比，PCPP在保持图像质量的同时，显著降低了通信成本（约70%），并实现了2.36至8.02倍的推理速度提升，具体取决于计算设备配置和生成分辨率。

链接: https://arxiv.org/abs/2412.02962
作者: XiuYu Zhang,Zening Luo,Michelle E. Lu
关键词-EN: exhibited exciting capabilities, Diffusion models, Conditioned Patch Parallelism, video creation, exhibited exciting
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around 70% compared to DistriFusion (the state of the art implementation of PP) and achieves 2.36\sim 8.02\times inference speed-up using 4\sim 8 GPUs compared to 2.32\sim 6.71\times achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.
zh

[CV-97] Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution

【速读】：该论文试图解决现实世界图像超分辨率（Real-ISR）中存在的两个主要问题：一是现有方法在识别某些显著对象时可能失败，导致这些区域的语义恢复不准确；二是同一区域可能对多个文本提示有强烈响应，导致图像超分辨率中的语义模糊。解决方案的关键在于将语义分割作为额外的控制条件引入基于扩散的图像超分辨率中。通过为每个像素分配类别标签，语义分割不仅能够更全面地感知图像中的显著对象，还能通过明确分配对象到各自的空间区域来缓解语义模糊的风险。具体实现上，论文提出了SegSR，这是一个双扩散框架，通过开发双模态桥接模块（Dual-Modality Bridge module）来促进图像超分辨率和分割扩散模型之间的信息更新流动，从而在反向扩散过程中实现相互增益。

链接: https://arxiv.org/abs/2412.02960
作者: Jiahua Xiao,Jiawei Zhang,Dongqing Zou,Xiaodan Zhang,Jimmy Ren,Xing Wei
关键词-EN: Real-world image super-resolution, image super-resolution, Real-world image, leveraging large-scale, achieved a remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.
zh

[CV-98] An indoor DSO-based ceiling-vision odometry system for indoor industrial environments

【速读】：该论文试图解决室内工业环境中自主移动机器人定位系统的可靠性和鲁棒性问题，特别是在面对动态物体时传统视觉里程计 (Visual Odometry, VO) 方法的局限性。解决方案的关键在于引入基于直接稀疏里程计 (Direct Sparse Odometry, DSO) 的天花板视觉系统 (Ceiling-DSO)。Ceiling-DSO 利用向上摄像头跟踪机器人相对于天花板的移动，利用天花板的静态和一致性特点，避免了传统方法对场景中动态物体的依赖。该系统不依赖于天花板上的特定形状或地标，确保了其在不同天花板类型中的适用性。通过调整DSO参数，实现了在线姿态估计的优化，并在自建的真实世界数据集上进行了评估，结果显示其误差率在接受范围内。

链接: https://arxiv.org/abs/2412.02950
作者: Abdelhak Bougouffa,Emmanuel Seignez,Samir Bouaziz,Florian Gardes
关键词-EN: Autonomous Mobile Robots, Mobile Robots operating, Autonomous Mobile, indoor industrial environments, industrial environments require
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous Mobile Robots operating in indoor industrial environments require a localization system that is reliable and robust. While Visual Odometry (VO) can offer a reasonable estimation of the robot’s state, traditional VO methods encounter challenges when confronted with dynamic objects in the scene. Alternatively, an upward-facing camera can be utilized to track the robot’s movement relative to the ceiling, which represents a static and consistent space. We introduce in this paper Ceiling-DSO, a ceiling-vision system based on Direct Sparse Odometry (DSO). Unlike other ceiling-vision systems, Ceiling-DSO takes advantage of the versatile formulation of DSO, avoiding assumptions about observable shapes or landmarks on the ceiling. This approach ensures the method’s applicability to various ceiling types. Since no publicly available dataset for ceiling-vision exists, we created a custom dataset in a real-world scenario and employed it to evaluate our approach. By adjusting DSO parameters, we identified the optimal fit for online pose estimation, resulting in acceptable error rates compared to ground truth. We provide in this paper a qualitative and quantitative analysis of the obtained results.
zh

[CV-99] Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis WACV2025

【速读】：该论文试图解决大视觉-语言模型 (Large Vision-Language Models, LVLM) 在实际应用中出现的幻觉问题，即模型生成不存在的视觉元素，从而损害用户信任。解决方案的关键在于识别和干预导致幻觉的隐藏因素，如对象、上下文和语义前景-背景结构。论文提出了一种新的因果分析方法：幻觉探测系统，通过分析图像、文本提示和网络显著性之间的因果关系，系统地探索阻止这些隐藏因素的方法。实验结果表明，基于分析的简单技术可以显著减少幻觉，并提示通过编辑网络内部结构来最小化幻觉输出的潜力。

链接: https://arxiv.org/abs/2412.02946
作者: Po-Hsuan Huang,Jeng-Lin Li,Chin-Po Chen,Ming-Ching Chang,Wei-Chao Chen
关键词-EN: large vision-language models, alongside natural language, inputs alongside natural, Recent advancements, comprehend visual inputs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted by WACV2025

点击查看摘要

Abstract:Recent advancements in large vision-language models (LVLM) have significantly enhanced their ability to comprehend visual inputs alongside natural language. However, a major challenge in their real-world application is hallucination, where LVLMs generate non-existent visual elements, eroding user trust. The underlying mechanism driving this multimodal hallucination is poorly understood. Minimal research has illuminated whether contexts such as sky, tree, or grass field involve the LVLM in hallucinating a frisbee. We hypothesize that hidden factors, such as objects, contexts, and semantic foreground-background structures, induce hallucination. This study proposes a novel causal approach: a hallucination probing system to identify these hidden factors. By analyzing the causality between images, text prompts, and network saliency, we systematically explore interventions to block these factors. Our experimental findings show that a straightforward technique based on our analysis can significantly reduce hallucinations. Additionally, our analyses indicate the potential to edit network internals to minimize hallucinated outputs.
zh

[CV-100] Video LLM s for Temporal Reasoning in Long Videos

【速读】：该论文试图解决长视频中有效的时间推理和细粒度理解问题。解决方案的关键在于引入了一个名为 TemporalVLM 的视频大语言模型，该模型通过视觉编码器将长视频映射为时间感知且包含局部和全局线索的特征。具体步骤包括：首先将输入视频分割为短时片段，并将其与时间戳联合编码为时间敏感的局部特征；然后通过双向长短期记忆模块进行全局特征聚合。提取的时间感知和多层次特征对于长视频中的精确时间推理和细粒度理解至关重要。此外，论文还引入了一个大规模的长视频数据集 IndustryASM，用于评估 TemporalVLM 的性能。实验结果表明，TemporalVLM 在时间推理和细粒度理解任务上优于以往的方法。

链接: https://arxiv.org/abs/2412.02930
作者: Fawad Javed Fateh,Umer Ahmed,Hamza Khan,M. Zeeshan Zia,Quoc-Huy Tran
关键词-EN: large language model, language model capable, video large language, paper introduces TemporalVLM, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces TemporalVLM, a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, which consists of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.
zh

[CV-101] Panoptic Diffusion Models: co-generation of images and segmentation maps

【速读】：该论文试图解决现有扩散模型无法同时从文本提示生成图像和相应的全景分割图（panoptic segmentation maps）的问题。解决方案的关键在于提出了全景扩散模型（Panoptic Diffusion Model, PDM），这是首个能够同时生成图像和全景分割图的模型。PDM通过构建详细的分割布局，在整个生成过程中提供内置的指导，确保文本提示中提到的类别被包含，并丰富背景中的分割多样性。该模型在两种架构上进行了验证：统一的扩散变换器和带有预训练骨干网络的双流变换器。此外，PDM还集成了快速扩散求解器以减少采样步骤，并在有真实分割图的情况下，可以作为文本引导的图像到图像生成模型使用。最后，论文提出了一种新的评估生成地图质量的指标，并展示了PDM在图像生成方面达到了最先进的性能，具有隐式的场景控制能力。

链接: https://arxiv.org/abs/2412.02929
作者: Yinghan Long,Kaushik Roy
关键词-EN: demonstrated impressive capabilities, diffusion models, demonstrated impressive, impressive capabilities, PDM
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, diffusion models have demonstrated impressive capabilities in text-guided and image-conditioned image generation. However, existing diffusion models cannot simultaneously generate a segmentation map of objects and a corresponding image from the prompt. Previous attempts either generate segmentation maps based on the images or provide maps as input conditions to control image generation, limiting their functionality to given inputs. Incorporating an inherent understanding of the scene layouts can improve the creativity and realism of diffusion models. To address this limitation, we present Panoptic Diffusion Model (PDM), the first model designed to generate both images and panoptic segmentation maps concurrently. PDM bridges the gap between image and text by constructing segmentation layouts that provide detailed, built-in guidance throughout the generation process. This ensures the inclusion of categories mentioned in text prompts and enriches the diversity of segments within the background. We demonstrate the effectiveness of PDM across two architectures: a unified diffusion transformer and a two-stream transformer with a pretrained backbone. To facilitate co-generation with fewer sampling steps, we incorporate a fast diffusion solver into PDM. Additionally, when ground-truth maps are available, PDM can function as a text-guided image-to-image generation model. Finally, we propose a novel metric for evaluating the quality of generated maps and show that PDM achieves state-of-the-art results in image generation with implicit scene control.
zh

[CV-102] ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

【速读】：该论文试图解决在图像合成过程中如何有效结合3D形状指导和文本描述的问题。解决方案的关键在于引入了一种名为ShapeWords的方法，该方法通过将目标3D形状信息嵌入到与输入文本相关的专用标记中，实现了3D形状感知与文本上下文的有效融合，从而指导图像合成过程。与传统依赖于固定视角深度图的形状指导方法不同，ShapeWords能够生成既多样又一致的图像，这些图像不仅反映了目标形状的几何特征，还符合文本描述，从而在保持3D形状意识的同时，提高了图像的文本合规性和美学合理性。

链接: https://arxiv.org/abs/2412.02912
作者: Dmitry Petrov,Pradyumn Goyal,Divyansh Shivashok,Yuanming Tao,Melinos Averkiou,Evangelos Kalogerakis
关键词-EN: synthesizing images based, approach for synthesizing, text prompts, shape, synthesizing images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape’s geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.
zh

[CV-103] EgoCast: Forecasting Egocentric Human Pose in the Wild

【速读】：该论文试图解决在增强现实（Augmented Reality）中准确估计和预测人体姿态的问题。解决方案的关键在于提出了EgoCast，一种利用第一人称视频（egocentric videos）和本体感觉数据（proprioceptive data）的双模态方法来进行3D人体姿态预测。EgoCast通过引入当前帧估计模块（current-frame estimation module），生成用于推断的伪真实姿态（pseudo-groundtruth poses），从而在预测过程中消除了传统方法对过去真实姿态的依赖。这一创新显著提升了在Ego-Exo4D和Aria Digital Twin数据集上的实际运动估计性能，并在Ego-Exo4D Body Pose 2024挑战赛中超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.02903
作者: Maria Escobar,Juanita Puentes,Cristhian Forigua,Jordi Pont-Tuset,Kevis-Kokitsi Maninis,Pablo Arbelaez
关键词-EN: Augmented Reality, Accurately estimating, immersion in Augmented, human pose forecasting, human pose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately estimating and forecasting human body pose is important for enhancing the user’s sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.
zh

[CV-104] GUESS: Generative Uncertainty Ensemble for Self Supervision

【速读】：该论文试图解决自监督学习 (Self-supervised Learning, SSL) 框架中强制数据增强不变性 (invariance to data augmentations) 的低效性和潜在有害性问题。解决方案的关键在于引入不确定性表示 (uncertainty representation) 到损失函数和架构设计中，以实现更依赖于数据的不变性强化。具体来说，论文提出了一个名为 GUESS 的伪白化框架，通过控制不确定性注入、新的架构设计和损失函数，结合生成-判别损失函数和样本的轻微扭曲版本输入，以学习更鲁棒和有效的特征表示。这一方法在实验中被证明是一种新的基准。

链接: https://arxiv.org/abs/2412.02896
作者: Salman Mohamadi,Gianfranco Doretto,Donald A. Adjeroh
关键词-EN: loss function, Self-supervised learning, SSL loss function, consist of pretext, learn useful general
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) frameworks consist of pretext task, and loss function aiming to learn useful general features from unlabeled data. The basic idea of most SSL baselines revolves around enforcing the invariance to a variety of data augmentations via the loss function. However, one main issue is that, inattentive or deterministic enforcement of the invariance to any kind of data augmentation is generally not only inefficient, but also potentially detrimental to performance on the downstream tasks. In this work, we investigate the issue from the viewpoint of uncertainty in invariance representation. Uncertainty representation is fairly under-explored in the design of SSL architectures as well as loss functions. We incorporate uncertainty representation in both loss function as well as architecture design aiming for more data-dependent invariance enforcement. The former is represented in the form of data-derived uncertainty in SSL loss function resulting in a generative-discriminative loss function. The latter is achieved by feeding slightly different distorted versions of samples to the ensemble aiming for learning better and more robust representation. Specifically, building upon the recent methods that use hard and soft whitening (a.k.a redundancy reduction), we introduce a new approach GUESS, a pseudo-whitening framework, composed of controlled uncertainty injection, a new architecture, and a new loss function. We include detailed results and ablation analysis establishing GUESS as a new baseline.
zh

[CV-105] EvRT-DETR: The Surprising Effectiveness of DETR-based Detection for Event Cameras

【速读】：该论文试图解决基于事件相机（Event-based Cameras, EBCs）的目标检测问题。传统相机在某些应用场景中存在功耗高、时间分辨率低和高动态范围不足的问题，而事件相机通过捕捉像素级亮度变化事件，具有功耗低、时间分辨率高和高动态范围的优势。然而，由于事件相机数据的稀疏性和异步性，开发适用于其的图像分析方法具有挑战性。

解决方案的关键在于将实时检测变换器（Real-Time DEtection TRansformer, RT-DETR）这一先进的自然图像检测器与事件相机数据的简单图像类表示相结合，从而在不依赖复杂数据表示和专用架构的情况下，实现了显著的性能提升。具体来说，论文展示了在事件相机数据上训练的RT-DETR模型能够达到与当前最先进的EBC目标检测方法相媲美的性能。此外，论文还提出了一种基于低秩适应（Low-Rank Adaptation, LoRA）的方法来增强RT-DETR模型，以处理数据的时序动态特性，设计的EvRT-DETR模型在标准基准数据集Gen1和Gen4上均超越了当前最先进的结果，同时仅使用了自然图像和视频分析中的标准模块。这表明，通过主流目标检测架构的精心适应，可以有效实现事件相机的目标检测，而不需要专门的架构工程。

链接: https://arxiv.org/abs/2412.02890
作者: Dmitrii Torbunov,Yihui Ren,Animesh Ghose,Odera Dim,Yonggang Cui
关键词-EN: EBC object detection, Event-based cameras, EBC object, high dynamic range, EBC data achieves
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based cameras (EBCs) have emerged as a bio-inspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range. However, the development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data. This work addresses the problem of object detection for the EBC cameras. The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures. Here, we demonstrate that the combination of a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, with a simple image-like representation of the EBC data achieves remarkable performance, surpassing current state-of-the-art results. Specifically, we show that a properly trained RT-DETR model on the EBC data achieves performance comparable to the most advanced EBC object detection methods. Next, we propose a low-rank adaptation (LoRA)-inspired way to augment the RT-DETR model to handle temporal dynamics of the data. The designed EvRT-DETR model outperforms the current, most advanced results on standard benchmark datasets Gen1 (mAP +2.3 ) and Gen4 (mAP +1.4 ) while only using standard modules from natural image and video analysis. These results demonstrate that effective EBC object detection can be achieved through careful adaptation of mainstream object detection architectures without requiring specialized architectural engineering. The code is available at: this https URL
zh

[CV-106] Patchfinder: Leveraging Visual Language Models for Accurate Information Retrieval using Model Uncertainty WACV

【速读】：该论文试图解决从大量噪声扫描文档中高效提取信息的问题。解决方案的关键在于提出了PatchFinder算法，该算法基于视觉语言模型（Vision Language Models, VLMs），通过设计一种基于最大Softmax概率的置信度评分（Patch Confidence）来衡量模型预测的置信度。PatchFinder利用这一置信度评分来确定合适的补丁大小，将输入文档分割成重叠的补丁，并生成基于置信度的目标信息预测。实验结果表明，PatchFinder能够利用Phi-3v（一个42亿参数的视觉语言模型）在190份噪声扫描文档的数据集上达到94%的准确率，比ChatGPT-4o高出18.5个百分点。

链接: https://arxiv.org/abs/2412.02886
作者: Roman Colman,Minh Vu,Manish Bhattarai,Martin Ma,Hari Viswanathan,Daniel O’Malley,Javier E. Santos
关键词-EN: record vast amounts, vision language models, language models, vision language, corporations and governments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:For decades, corporations and governments have relied on scanned documents to record vast amounts of information. However, extracting this information is a slow and tedious process due to the overwhelming amount of documents. The rise of vision language models presents a way to efficiently and accurately extract the information out of these documents. The current automated workflow often requires a two-step approach involving the extraction of information using optical character recognition software, and subsequent usage of large language models for processing this information. Unfortunately, these methods encounter significant challenges when dealing with noisy scanned documents. The high information density of such documents often necessitates using computationally expensive language models to effectively reduce noise. In this study, we propose PatchFinder, an algorithm that builds upon Vision Language Models (VLMs) to address the information extraction task. First, we devise a confidence-based score, called Patch Confidence, based on the Maximum Softmax Probability of the VLMs’ output to measure the model’s confidence in its predictions. Then, PatchFinder utilizes that score to determine a suitable patch size, partition the input document into overlapping patches of that size, and generate confidence-based predictions for the target information. Our experimental results show that PatchFinder can leverage Phi-3v, a 4.2 billion parameter vision language model, to achieve an accuracy of 94% on our dataset of 190 noisy scanned documents, surpassing the performance of ChatGPT-4o by 18.5 percentage points. Comments: This paper has been accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: F.2.2, I.2.7 Cite as: arXiv:2412.02886 [cs.CV] (or arXiv:2412.02886v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.02886 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-107] Pairwise Spatiotemporal Partial Trajectory Matching for Co-movement Analysis

【速读】：该论文试图解决时空成对运动分析中传统方法在解释性和部分匹配捕捉方面的不足。解决方案的关键在于提出了一种新颖的成对时空部分轨迹匹配方法，通过将表格化的时空数据转换为可解释的轨迹图像，基于指定的时间窗口进行部分轨迹分析。该方法包括轨迹定位、空间重叠检查以及使用孪生神经网络（Siamese Neural Network）进行成对匹配。通过在共同行走分类任务中的评估，该方法展示了其在识别共同行为方面的有效性，并在F1-score上达到了0.73，超越了现有方法。此外，该方法还探索了在现实场景中进行成对日常模式分析的实用性，提供了对共享行为频率、时间和持续时间的深入见解。

链接: https://arxiv.org/abs/2412.02879
作者: Maria Cardei,Sabit Ahmed,Gretchen Chapman,Afsaneh Doryab
关键词-EN: specific time frames, movement analysis involves, analysis involves identifying, involves identifying shared, identifying shared geographic-based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In submission. 17 pages, 5 figures

点击查看摘要

Abstract:Spatiotemporal pairwise movement analysis involves identifying shared geographic-based behaviors between individuals within specific time frames. Traditionally, this task relies on sequence modeling and behavior analysis techniques applied to tabular or video-based data, but these methods often lack interpretability and struggle to capture partial matching. In this paper, we propose a novel method for pairwise spatiotemporal partial trajectory matching that transforms tabular spatiotemporal data into interpretable trajectory images based on specified time windows, allowing for partial trajectory analysis. This approach includes localization of trajectories, checking for spatial overlap, and pairwise matching using a Siamese Neural Network. We evaluate our method on a co-walking classification task, demonstrating its effectiveness in a novel co-behavior identification application. Our model surpasses established methods, achieving an F1-score up to 0.73. Additionally, we explore the method’s utility for pair routine pattern analysis in real-world scenarios, providing insights into the frequency, timing, and duration of shared behaviors. This approach offers a powerful, interpretable framework for spatiotemporal behavior analysis, with potential applications in social behavior research, urban planning, and healthcare.
zh

[CV-108] MAGMA: Manifold Regularization for MAEs WACV2025

【速读】：该论文试图解决基于Transformer架构的掩码自编码器（Masked Autoencoders, MAEs）在视觉特征正则化方面的不足，这可能影响其性能。解决方案的关键在于引入了一种新颖的批量层级正则化损失（batch-wide layer-wise regularization loss），即MAGMA，该损失应用于Transformer不同层的表示。通过插入这种正则化损失，论文显著提升了MAE模型的性能，并展示了该损失对其他通用自监督学习方法（如VICReg和SimCLR）的优化效果，从而扩大了该方法的影响范围。

链接: https://arxiv.org/abs/2412.02871
作者: Alin Dondera,Anuj Singh,Hadi Jamali-Rad
关键词-EN: Masked Autoencoders, self-supervised learning, generating positive, contrastive frameworks, important divide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in WACV 2025

点击查看摘要

Abstract:Masked Autoencoders (MAEs) are an important divide in self-supervised learning (SSL) due to their independence from augmentation techniques for generating positive (and/or negative) pairs as in contrastive frameworks. Their masking and reconstruction strategy also nicely aligns with SSL approaches in natural language processing. Most MAEs are built upon Transformer-based architectures where visual features are not regularized as opposed to their convolutional neural network (CNN) based counterparts, which can potentially hinder their performance. To address this, we introduce MAGMA, a novel batch-wide layer-wise regularization loss applied to representations of different Transformer layers. We demonstrate that by plugging in the proposed regularization loss, one can significantly improve the performance of MAE-based models. We further demonstrate the impact of the proposed loss on optimizing other generic SSL approaches (such as VICReg and SimCLR), broadening the impact of the proposed approach. Our code base can be found at this https URL.
zh

[CV-109] Memory-efficient Continual Learning with Neural Collapse Contrastive WACV2025

【速读】：该论文试图解决持续学习（Continual Learning, CL）中的灾难性遗忘问题，特别是在对比学习方法中，由于样本间“软关系”（soft relationships）随数据分布变化而导致的表示重叠问题。解决方案的关键在于提出了一种新的表示学习损失函数——焦点神经崩溃对比（Focal Neural Collapse Contrastive, FNC2），该方法有效平衡了样本间的“硬关系”（hard relationships）和“软关系”，从而减少了任务间的表示重叠。此外，论文还引入了硬度-软度蒸馏（Hardness-Softness Distillation, HSD）损失，以逐步保留这些关系在不同任务间的知识，从而显著减少对记忆的依赖，甚至在无记忆的情况下也能与基于重放的方法相媲美，为数据隐私问题提供了有力的解决方案。

链接: https://arxiv.org/abs/2412.02865
作者: Trung-Anh Dang,Vincent Nguyen,Ngoc-Son Vu,Christel Vrain
关键词-EN: enhancing knowledge transfer, improved representation quality, Neural Collapse Contrastive, significantly improved representation, significantly improved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Contrastive learning has significantly improved representation quality, enhancing knowledge transfer across tasks in continual learning (CL). However, catastrophic forgetting remains a key challenge, as contrastive based methods primarily focus on “soft relationships” or “softness” between samples, which shift with changing data distributions and lead to representation overlap across tasks. Recently, the newly identified Neural Collapse phenomenon has shown promise in CL by focusing on “hard relationships” or “hardness” between samples and fixed prototypes. However, this approach overlooks “softness”, crucial for capturing intra-class variability, and this rigid focus can also pull old class representations toward current ones, increasing forgetting. Building on these insights, we propose Focal Neural Collapse Contrastive (FNC2), a novel representation learning loss that effectively balances both soft and hard relationships. Additionally, we introduce the Hardness-Softness Distillation (HSD) loss to progressively preserve the knowledge gained from these relationships across tasks. Our method outperforms state-of-the-art approaches, particularly in minimizing memory reliance. Remarkably, even without the use of memory, our approach rivals rehearsal-based methods, offering a compelling solution for data privacy concerns.
zh

[CV-110] Is Large-Scale Pretraining the Secret to Good Domain Generalization?

【速读】：该论文试图解决多源领域泛化 (Multi-Source Domain Generalization, DG) 中，预训练模型与新数据学习特征结合后，是否真正提升了泛化能力的问题。论文提出，仅依赖预训练数据的感知相似性不足以保证泛化性能，关键在于预训练数据与新数据的学习特征是否高度对齐。为此，论文引入了对齐假设 (Alignment Hypothesis)，即最终的泛化性能高度依赖于图像与类别标签文本嵌入的对齐程度。通过实验验证，论文发现现有DG方法在预训练数据外的领域 (Out-of-pretraining, OOP) 表现不佳，而在预训练数据内的领域 (In-pretraining, IP) 表现优异，强调了开发能够超越预训练对齐限制的DG方法的重要性。

链接: https://arxiv.org/abs/2412.02856
作者: Piotr Teterwak,Kuniaki Saito,Theodoros Tsiligkaridis,Bryan A. Plummer,Kate Saenko
关键词-EN: Multi-Source Domain Generalization, unseen target domains, Domain Generalization, multiple source domains, Multi-Source Domain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.
zh

[CV-111] Optimized CNNs for Rapid 3D Point Cloud Object Recognition

【速读】：该论文试图解决在3D点云数据中高效检测物体的问题。解决方案的关键在于采用了一种独特的特征中心投票机制（feature-centric voting mechanism）来构建卷积层（convolutional layers），充分利用输入数据的典型稀疏性（sparsity）。论文提出将稀疏卷积层（sparse convolutional layers）与L1正则化（L1 regularization）相结合，以增强中间层的稀疏性，从而有效处理大规模3D数据。实验结果表明，Vote3Deep模型在MVTec 3D-AD基准测试中表现优异，不仅在激光和视觉结合的方法中超越了现有技术水平，而且在处理速度上也保持了竞争力，适合实时应用。

链接: https://arxiv.org/abs/2412.02855
作者: Tianyi Lyu,Dian Gu,Peiyuan Chen,Yaoting Jiang,Zhenhong Zhang,Huadong Pang,Li Zhou,Yiping Dong
关键词-EN: efficiently detecting objects, point clouds, convolutional neural networks, study introduces, efficiently detecting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:This study introduces a method for efficiently detecting objects within 3D point clouds using convolutional neural networks (CNNs). Our approach adopts a unique feature-centric voting mechanism to construct convolutional layers that capitalize on the typical sparsity observed in input data. We explore the trade-off between accuracy and speed across diverse network architectures and advocate for integrating an \mathcalL_1 penalty on filter activations to augment sparsity within intermediate layers. This research pioneers the proposal of sparse convolutional layers combined with \mathcalL_1 regularization to effectively handle large-scale 3D data processing. Our method’s efficacy is demonstrated on the MVTec 3D-AD object detection benchmark. The Vote3Deep models, with just three layers, outperform the previous state-of-the-art in both laser-only approaches and combined laser-vision methods. Additionally, they maintain competitive processing speeds. This underscores our approach’s capability to substantially enhance detection performance while ensuring computational efficiency suitable for real-time applications.
zh

[CV-112] Effortless Efficiency: Low-Cost Pruning of Diffusion Models

【速读】：该论文试图解决扩散模型在实际部署中由于模型规模增大导致的计算复杂度和内存需求增加的问题。解决方案的关键在于提出了一种无需重新训练的模型无关结构化剪枝框架，通过学习可微分的掩码来稀疏化模型。为确保剪枝后仍能保持去噪潜变量的质量，设计了一种贯穿整个扩散过程的端到端剪枝目标函数。此外，为应对端到端剪枝过程中的内存密集问题，提出了时间步梯度检查点技术，显著降低了优化过程中的内存使用，使得在有限内存预算下也能进行端到端剪枝。实验结果表明，该方法能够在不显著影响性能的情况下有效剪枝高达20%的参数，且无需重新训练模型。

链接: https://arxiv.org/abs/2412.02852
作者: Yang Zhang,Er Jin,Yanfei Dong,Ashkan Khakzar,Philip Torr,Johannes Stegmaier,Kenji Kawaguchi
关键词-EN: achieved impressive advancements, Diffusion models, vision tasks, Diffusion, achieved impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which escalates computational complexity and memory demands, complicating deployment, raising inference costs, and causing environmental impact. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to retain the model performance. Retraining a modern large diffusion model is extremely costly and resource-intensive, which limits the practicality of these methods. In this work, we achieve low-cost diffusion pruning without retraining by proposing a model-agnostic structural pruning framework for diffusion models that learns a differentiable mask to sparsify the model. To ensure effective pruning that preserves the quality of the final denoised latent, we design a novel end-to-end pruning objective that spans the entire diffusion process. As end-to-end pruning is memory-intensive, we further propose time step gradient checkpointing, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on state-of-the-art U-Net diffusion models SDXL and diffusion transformers (FLUX) demonstrate that our method can effectively prune up to 20% parameters with minimal perceptible performance degradation, and notably, without the need for model retraining. We also showcase that our method can still prune on top of time step distilled diffusion models.
zh

[CV-113] Enhancing Robustness of CLIP to Common Corruptions through Bimodal Test-Time Adaptation

【速读】：该论文试图解决开放词汇分类模型（如CLIP）在面对常见图像损坏时缺乏鲁棒性的问题。解决方案的关键在于提出了一种双模态的测试时适应方法（bimodal Test-Time Adaptation, TTA），称为\framework。该方法不仅优化视觉编码器以提升图像特征提取能力，还通过增强图像类别原型（使用伪标签计算）与相应文本特征之间的关联，来强化图像与文本特征的对齐，从而提高CLIP在图像损坏情况下的鲁棒性。

链接: https://arxiv.org/abs/2412.02837
作者: Sarthak Kumar Maharana,Baoming Zhang,Leonid Karlinsky,Rogerio Feris,Yunhui Guo
关键词-EN: Language Image Pretraining, Contrastive Language Image, Contrastive Language, remains poorly understood, open-vocabulary classification models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions at increasing severity levels during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \framework, a bimodal TTA method specially designed to improve CLIP’s robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for better image feature extraction but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in TTA for CLIP, specifically for domains involving image corruption. Particularly, with a ViT-B/16 vision backbone, we obtain mean accuracy improvements of 9.7%, 5.94%, and 5.12% for CIFAR-10C, CIFAR-100C, and ImageNet-C, respectively.
zh

[CV-114] FLAME 3 Dataset: Unleashing the Power of Radiometric Thermal UAV Imagery for Wildfire Management

【速读】：该论文试图解决无人机（UAVs）在空中野火管理中利用辐射热成像传感器的数据可用性问题。解决方案的关键在于引入了同步采集和处理可见光谱与辐射热成像数据的方法，并通过创建FLAME 3数据集，提供了首个全面的野火场景下并排的可见光谱与辐射热成像数据集。这一数据集不仅包括辐射热成像的Tag图像文件格式（TIFFs）和天底热图，还简化了从数据采集到神经网络输入的每个步骤，部分实现了自动化。FLAME 3数据集的推出旨在推动新一代利用辐射热成像数据的机器学习模型的发展，从而简化空中野火检测、分割和评估等任务。

链接: https://arxiv.org/abs/2412.02831
作者: Bryce Hopkins,Leo ONeill,Michael Marinaccio,Eric Rowell,Russell Parsons,Sarah Flanary,Irtija Nazim,Carl Seielstad,Fatemeh Afghah
关键词-EN: offers significant potential, unmanned aerial vehicles, advancing AI-driven aerial, thermal imaging sensors, offers significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 Figures, 8 Tables

点击查看摘要

Abstract:The increasing accessibility of radiometric thermal imaging sensors for unmanned aerial vehicles (UAVs) offers significant potential for advancing AI-driven aerial wildfire management. Radiometric imaging provides per-pixel temperature estimates, a valuable improvement over non-radiometric data that requires irradiance measurements to be converted into visible images using RGB color palettes. Despite its benefits, this technology has been underutilized largely due to a lack of available data for researchers. This study addresses this gap by introducing methods for collecting and processing synchronized visual spectrum and radiometric thermal imagery using UAVs at prescribed fires. The included imagery processing pipeline drastically simplifies and partially automates each step from data collection to neural network input. Further, we present the FLAME 3 dataset, the first comprehensive collection of side-by-side visual spectrum and radiometric thermal imagery of wildland fires. Building on our previous FLAME 1 and FLAME 2 datasets, FLAME 3 includes radiometric thermal Tag Image File Format (TIFFs) and nadir thermal plots, providing a new data type and collection method. This dataset aims to spur a new generation of machine learning models utilizing radiometric thermal imagery, potentially trivializing tasks such as aerial wildfire detection, segmentation, and assessment. A single-burn subset of FLAME 3 for computer vision applications is available on Kaggle with the full 6 burn set available to readers upon request.
zh

[CV-115] Many-MobileNet: Multi-Model Augmentation for Robust Retinal Disease Classification

【速读】：该论文试图解决在视网膜疾病分类任务中，由于数据稀缺和模型过拟合导致的泛化能力不足的问题。解决方案的关键在于提出了一种名为Many-MobileNet的高效模型融合策略，通过训练多个采用不同数据增强策略和模型复杂度的轻量级CNN模型，并将其融合，从而在数据稀缺的领域中实现鲁棒的泛化能力，同时平衡了计算效率与特征提取能力。

链接: https://arxiv.org/abs/2412.02825
作者: Hao Wang,Wenhui Zhu,Xuanzhao Dong,Yanxi Chen,Xin Li,Peijie Qiu,Xiwen Chen,Vamsi Krishna Vasa,Yujian Xiong,Oana M. Dumitrascu,Abolfazl Razi,Yalin Wang
关键词-EN: lightweight CNN architecture, retinal disease classification, CNN architecture, lightweight CNN, propose Many-MobileNet
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose Many-MobileNet, an efficient model fusion strategy for retinal disease classification using lightweight CNN architecture. Our method addresses key challenges such as overfitting and limited dataset variability by training multiple models with distinct data augmentation strategies and different model complexities. Through this fusion technique, we achieved robust generalization in data-scarce domains while balancing computational efficiency with feature extraction capabilities.
zh

[CV-116] mporally Consistent Dynamic Scene Graphs: An End-to-End Approach for Action Tracklet Generation

【速读】：该论文试图解决视频内容理解中动态交互关系的捕捉问题，特别是在跨视频序列中保持时间一致性的挑战。解决方案的关键在于提出了Temporally Consistent Dynamic Scene Graphs (TCDSG)框架，该框架通过检测、跟踪和链接时间序列中的主体-对象关系，生成动作轨迹（action tracklets），从而实现时间一致的实体及其交互序列。核心技术包括一种新颖的二分匹配机制，结合自适应解码器查询和反馈回路，确保长时间序列中的时间连贯性和鲁棒跟踪。这一方法不仅在Action Genome、OpenPVSG和MEVA数据集上显著提升了时间召回率（temporal recall@k），还首次为MEVA数据集增加了持久对象ID注释，以支持全面的轨迹生成。通过无缝集成空间和时间动态，该研究为多帧视频分析设定了新标准，并为监控、自主导航等高影响力应用开辟了新途径。

链接: https://arxiv.org/abs/2412.02808
作者: Raphael Ruschel,Md Awsafur Rahman,Hardik Prajapati,Suya You,B. S. Manjuanth
关键词-EN: Understanding video content, advancing real-world applications, Understanding video, activity recognition, Dynamic Scene Graphs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
zh

[CV-117] STORM: Strategic Orchestration of Modalities for Rare Event Classification

【速读】：该论文试图解决在生物医学领域中，如何选择最有效的多模态数据以提高人工智能（AI）方法的分类性能的问题。传统方法依赖于手动试错，缺乏系统性框架来识别最相关的模态。论文提出的解决方案之关键是引入了一种基于熵的算法STORM，该算法系统地评估单个模态及其组合的信息内容，识别出对罕见类别分类任务至关重要的判别特征。通过癫痫发作起始区检测的案例研究，证明了该算法在提升分类性能方面的有效性。

链接: https://arxiv.org/abs/2412.02805
作者: Payal Kamboj,Ayan Banerjee,Sandeep K.S. Gupta
关键词-EN: expert insights, artificial intelligence, insights are crucial, modalities, informative modalities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Asilomar Conference on Signals, Systems, and Computers, 2024

点击查看摘要

Abstract:In domains such as biomedical, expert insights are crucial for selecting the most informative modalities for artificial intelligence (AI) methodologies. However, using all available modalities poses challenges, particularly in determining the impact of each modality on performance and optimizing their combinations for accurate classification. Traditional approaches resort to manual trial and error methods, lacking systematic frameworks for discerning the most relevant modalities. Moreover, although multi-modal learning enables the integration of information from diverse sources, utilizing all available modalities is often impractical and unnecessary. To address this, we introduce an entropy-based algorithm STORM to solve the modality selection problem for rare event. This algorithm systematically evaluates the information content of individual modalities and their combinations, identifying the most discriminative features essential for rare class classification tasks. Through seizure onset zone detection case study, we demonstrate the efficacy of our algorithm in enhancing classification performance. By selecting useful subset of modalities, our approach paves the way for more efficient AI-driven biomedical analyses, thereby advancing disease diagnosis in clinical settings.
zh

[CV-118] Gaussian Splatting Under Attack: Investigating Adversarial Noise in 3D Objects NEURIPS2024 NEURIPS

【速读】：该论文试图解决3D模型在对抗攻击下的脆弱性问题，特别是在3D高斯光斑（3D Gaussian Splatting）重建的辐射场模型中。解决方案的关键是引入了一种名为掩码迭代快速梯度符号法（Masked Iterative Fast Gradient Sign Method, M-IFGSM）的对抗攻击方法，该方法通过在目标对象的掩码区域上施加扰动，生成几乎不可察觉的对抗噪声，从而显著降低CLIP视觉-语言模型在零样本目标检测中的准确性和置信度。实验结果表明，该方法在CO3D数据集上的8个对象中有效降低了模型的检测性能，突显了对抗攻击对3D模型在自动驾驶、机器人和监控等关键应用中的潜在风险。

链接: https://arxiv.org/abs/2412.02803
作者: Abdurrahman Zeybey,Mehmet Ergezer,Tommy Nguyen
关键词-EN: Gaussian Splatting, enabling high-quality view, high-quality view synthesis, Splatting has advanced, Iterative Fast Gradient
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted to Safe Generative AI Workshop @ NeurIPS 2024: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting has advanced radiance field reconstruction, enabling high-quality view synthesis and fast rendering in 3D modeling. While adversarial attacks on object detection models are well-studied for 2D images, their impact on 3D models remains underexplored. This work introduces the Masked Iterative Fast Gradient Sign Method (M-IFGSM), designed to generate adversarial noise targeting the CLIP vision-language model. M-IFGSM specifically alters the object of interest by focusing perturbations on masked regions, degrading the performance of CLIP’s zero-shot object detection capability when applied to 3D models. Using eight objects from the Common Objects 3D (CO3D) dataset, we demonstrate that our method effectively reduces the accuracy and confidence of the model, with adversarial noise being nearly imperceptible to human observers. The top-1 accuracy in original model renders drops from 95.4% to 12.5% for train images and from 91.2% to 35.4% for test images, with confidence levels reflecting this shift from true classification to misclassification, underscoring the risks of adversarial attacks on 3D models in applications such as autonomous driving, robotics, and surveillance. The significance of this research lies in its potential to expose vulnerabilities in modern 3D vision models, including radiance fields, prompting the development of more robust defenses and security measures in critical real-world applications.
zh

[CV-119] Grayscale to Hyperspectral at Any Resolution Using a Phase-Only Lens

【速读】：该论文试图解决从单个衍射光学元件和无滤波的全色光传感器捕获的灰度快照测量中重建高分辨率H×W×31超光谱图像的问题。解决方案的关键在于训练一个条件去噪扩散模型，该模型能够将小灰度测量块映射到超光谱块，并通过全局基于物理的指导来同步多个块的预测。该模型可以在小规模超光谱数据集上进行训练，然后用于重建任意大小的超光谱图像，并且通过使用不同的种子生成多个样本，模型还能生成有用的不确定性图。这一方法在先前的快照超光谱基准测试中实现了最先进的性能，为新型高分辨率、紧凑型和轻量级的超光谱成像仪奠定了基础。

链接: https://arxiv.org/abs/2412.02798
作者: Dean Hazineh,Federico Capasso,Todd Zickler
关键词-EN: filterless panchromatic photosensor, single diffractive optic, times, panchromatic photosensor, single diffractive
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:We consider the problem of reconstructing a H\times W\times 31 hyperspectral image from a H\times W grayscale snapshot measurement that is captured using a single diffractive optic and a filterless panchromatic photosensor. This problem is severely ill-posed, and we present the first model that is able to produce high-quality results. We train a conditional denoising diffusion model that maps a small grayscale measurement patch to a hyperspectral patch. We then deploy the model to many patches in parallel, using global physics-based guidance to synchronize the patch predictions. Our model can be trained using small hyperspectral datasets and then deployed to reconstruct hyperspectral images of arbitrary size. Also, by drawing multiple samples with different seeds, our model produces useful uncertainty maps. We show that our model achieves state-of-the-art performance on previous snapshot hyperspectral benchmarks where reconstruction is better conditioned. Our work lays the foundation for a new class of high-resolution hyperspectral imagers that are compact and light-efficient.
zh

[CV-120] Hijacking Vision-and-Language Navigation Agents with Adversarial Environmental Attacks WACV2025

【速读】：该论文试图解决的问题是如何通过环境中的局部外观修改来劫持辅助实体代理（Assistive embodied agents）的行为，特别是在开放世界环境中执行任务时。解决方案的关键在于开发了一种针对视觉语言导航任务（Vision-and-Language Navigation, VLN）的白盒对抗攻击方法。该方法通过优化3D攻击对象的外观，诱导预训练的VLN代理在观察到该对象时执行攻击者期望的行为，即使这些行为与用户的指令不符。实验结果表明，这种攻击能够导致代理忽略指令并执行替代动作，甚至在优化攻击时未考虑的指令和路径设置下也能生效，从而显著降低代理成功遵循用户指令的能力。

链接: https://arxiv.org/abs/2412.02795
作者: Zijiao Yang,Xiangxi Shi,Eric Slyman,Stefan Lee
关键词-EN: Assistive embodied agents, Assistive embodied, impact labor tasks, significantly impact labor, in-home care
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:Assistive embodied agents that can be instructed in natural language to perform tasks in open-world environments have the potential to significantly impact labor tasks like manufacturing or in-home care – benefiting the lives of those who come to depend on them. In this work, we consider how this benefit might be hijacked by local modifications in the appearance of the agent’s operating environment. Specifically, we take the popular Vision-and-Language Navigation (VLN) task as a representative setting and develop a whitebox adversarial attack that optimizes a 3D attack object’s appearance to induce desired behaviors in pretrained VLN agents that observe it in the environment. We demonstrate that the proposed attack can cause VLN agents to ignore their instructions and execute alternative actions after encountering the attack object – even for instructions and agent paths not considered when optimizing the attack. For these novel settings, we find our attacks can induce early-termination behaviors or divert an agent along an attacker-defined multi-step trajectory. Under both conditions, environmental attacks significantly reduce agent capabilities to successfully follow user instructions.
zh

[CV-121] Mixture of Physical Priors Adapter for Parameter-Efficient Fine-Tuning

【速读】：该论文试图解决参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在处理高秩或高频数据时过度简化表示的问题。解决方案的关键在于提出了一种新的方法，即利用物理先验（Physical Priors）来建模网络权重，从而实现更准确的近似。具体来说，论文引入了基于热扩散、波传播和泊松稳态方程的三种基础方程，分别强调局部平滑性、长程交互和全局平衡。为了有效结合这些先验，论文提出了混合物理先验适配器（Mixture of Physical Priors Adapter, MoPPA），并采用高效的离散余弦变换（Discrete Cosine Transform, DCT）实现。此外，设计了一种路径正则化机制来动态调整这些先验的贡献，使得MoPPA成为一个轻量级、即插即用的模块，能够无缝集成到Transformer架构中，并根据局部上下文调整复杂度。实验结果表明，MoPPA在VTAB-1K图像分类任务中将PEFT的准确率提高了2.1%，并在多种视觉骨干网络上验证了其有效性和适应性。

链接: https://arxiv.org/abs/2412.02759
作者: Zhaozhi Wang,Conghu Li,Qixiang Ye,Tong Zhang
关键词-EN: parameter-efficient fine-tuning, methods rely, rely on low-rank, low-rank representations, Discrete Cosine Transform
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Most parameter-efficient fine-tuning (PEFT) methods rely on low-rank representations to adapt models. However, these approaches often oversimplify representations, particularly when the underlying data has high-rank or high-frequency components. This limitation hinders the model’s ability to capture complex data interactions effectively. In this paper, we propose a novel approach that models network weights by leveraging a combination of physical priors, enabling more accurate approximations. We use three foundational equations – heat diffusion, wave propagation, and Poisson’s steady-state equation – each contributing distinctive modeling properties: heat diffusion enforces local smoothness, wave propagation facilitates long-range interactions, and Poisson’s equation captures global equilibrium. To combine these priors effectively, we introduce the Mixture of Physical Priors Adapter (MoPPA), using an efficient Discrete Cosine Transform (DCT) implementation. To dynamically balance these priors, a route regularization mechanism is designed to adaptively tune their contributions. MoPPA serves as a lightweight, plug-and-play module that seamlessly integrates into transformer architectures, with adaptable complexity depending on the local context. Specifically, using MAE pre-trained ViT-B, MoPPA improves PEFT accuracy by up to 2.1% on VTAB-1K image classification with a comparable number of trainable parameters, and advantages are further validated through experiments across various vision backbones, showcasing MoPPA’s effectiveness and adaptability. The code will be made public available.
zh

[CV-122] MVCTrack: Boosting 3D Point Cloud Tracking via Multimodal-Guided Virtual Cues

【速读】：该论文试图解决在自动驾驶和机器人领域中，3D单目标跟踪在稀疏和不完整点云场景下的挑战。解决方案的关键在于提出了多模态引导的虚拟线索投影方案（Multimodal-guided Virtual Cues Projection, MVCP），通过生成虚拟线索来丰富稀疏点云。具体来说，MVCP方案将RGB传感器无缝集成到基于LiDAR的系统中，利用一组2D检测结果创建密集的3D虚拟线索，从而显著改善点云的稀疏性。这些虚拟线索能够自然地与现有的基于LiDAR的3D跟踪器结合，带来显著的性能提升。

链接: https://arxiv.org/abs/2412.02734
作者: Zhaofeng Hu,Sifan Zhou,Shibo Zhao,Zhihang Yuan
关键词-EN: single object tracking, single object, driving and robotics, Virtual Cues, object tracking
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to create dense 3D virtual cues that significantly improve the sparsity of point clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D trackers, yielding substantial performance gains. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset.
zh

[CV-123] Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications

【速读】：该论文试图解决现有地理空间基础模型（geospatial foundation models）在处理多任务和高分辨率应用时的性能不足问题。解决方案的关键在于引入时间序列和位置嵌入（temporal and location embeddings），并基于NASA的Harmonized Landsat and Sentinel-2数据集进行训练，从而显著提升了模型的性能和适应性。具体来说，Prithvi-EO-2.0模型通过增加参数数量（300M和600M）和优化嵌入技术，在GEO-Bench基准测试中比前代模型提升了8%的性能，并在多个领域和分辨率（从0.1m到15m）的遥感任务中超越了其他六个地理空间基础模型。此外，早期引入终端用户和领域专家（SMEs）的参与，确保了模型和数据集设计的持续反馈和定制化应用的成功。

链接: https://arxiv.org/abs/2412.02732
作者: Daniela Szwarcman,Sujit Roy,Paolo Fraccaro,Þorsteinn Elí Gíslason,Benedikt Blumenstiel,Rinki Ghosal,Pedro Henrique de Oliveira,Joao Lucas de Sousa Almeida,Rocco Sedona,Yanghui Kang,Srija Chakraborty,Sizhe Wang,Ankur Kumar,Myscon Truong,Denys Godwin,Hyunho Lee,Chia-Yu Hsu,Ata Akbari Asanjan,Besart Mujeci,Trevor Keenan,Paulo Arevalo,Wenwen Li,Hamed Alemohammad,Pontus Olofsson,Christopher Hain,Robert Kennedy,Bianca Zadrozny,Gabriele Cavallaro,Campbell Watson,Manil Maskey,Rahul Ramachandran,Juan Bernabe Moreno
关键词-EN: technical report presents, offers significant improvements, NASA Harmonized Landsat, report presents, technical report
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This technical report presents Prithvi-EO-2.0, a new geospatial foundation model that offers significant improvements over its predecessor, Prithvi-EO-1.0. Trained on 4.2M global time series samples from NASA’s Harmonized Landsat and Sentinel-2 data archive at 30m resolution, the new 300M and 600M parameter models incorporate temporal and location embeddings for enhanced performance across various geospatial tasks. Through extensive benchmarking with GEO-Bench, the 600M version outperforms the previous Prithvi-EO model by 8% across a range of tasks. It also outperforms six other geospatial foundation models when benchmarked on remote sensing tasks from different domains and resolutions (i.e. from 0.1m to 15m). The results demonstrate the versatility of the model in both classical earth observation and high-resolution applications. Early involvement of end-users and subject matter experts (SMEs) are among the key factors that contributed to the project’s success. In particular, SME involvement allowed for constant feedback on model and dataset design, as well as successful customization for diverse SME-led applications in disaster response, land use and crop mapping, and ecosystem dynamics monitoring. Prithvi-EO-2.0 is available on Hugging Face and IBM terratorch, with additional resources on GitHub. The project exemplifies the Trusted Open Science approach embraced by all involved organizations.
zh

[CV-124] mg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation NEURIPS2024

【速读】：该论文试图解决在虚拟和增强现实环境中，基于计算机视觉的手部姿态推断方法存在的局限性，如遮挡、视野限制和光线不足等问题。解决方案的关键在于引入了一种基于腕部表面肌电图（sEMG）的新型传感技术，该技术能够持续感知驱动手部运动的肌肉活动。论文提出了emg2pose基准，这是一个包含高质量手部姿态标签和腕部sEMG记录的最大公开数据集，涵盖了2kHz、16通道的sEMG数据和来自26个摄像头运动捕捉系统的姿态标签，涉及193名用户、370小时的数据和29个不同阶段的多样化手势。该基准提供了竞争性的基线模型，并设定了评估真实世界泛化能力的挑战性任务，包括用户、传感器位置和阶段的泛化。通过这一平台，机器学习社区可以探索复杂的泛化问题，有望显著提升基于sEMG的人机交互技术的发展。

链接: https://arxiv.org/abs/2412.02725
作者: Sasha Salter,Richard Warren,Collin Schlager,Adrian Spurr,Shangchen Han,Rohin Bhasin,Yujun Cai,Peter Walkington,Anuoluwapo Bolarinwa,Robert Wang,Nathan Danielson,Josh Merel,Eftychios Pnevmatikakis,Jesse Marshall
关键词-EN: humans interact, hand pose, always-available hand pose, hand pose inference, pose
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Published at NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement, and existing sEMG models have required hundreds of users and device placements to effectively generalize. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, the largest publicly available dataset of high-quality hand pose labels and wrist sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a 26-camera motion capture rig for 193 users, 370 hours, and 29 stages with diverse gestures - a scale comparable to vision-based hand pose datasets. We provide competitive baselines and challenging tasks evaluating real-world generalization scenarios: held-out users, sensor placements, and stages. emg2pose provides the machine learning community a platform for exploring complex generalization problems, holding potential to significantly enhance the development of sEMG-based human-computer interactions.
zh

[CV-125] SAMa: Material-aware 3D Selection and Segmentation

【速读】：该论文试图解决将3D资产分解为材质部分这一常见但高度手工化的任务。解决方案的关键在于引入了一种名为Select Any Material (SAMa)的材质选择方法，该方法扩展了SAM2视频选择模型的能力，使其适用于多种3D表示形式。通过利用模型的跨视图一致性，SAMa在稀疏视图集的基础上创建了一个3D一致的中间材质相似性表示（以点云形式）。在此相似性云中进行最近邻查找，可以高效地重建物体表面的连续选择掩码，并且可以从任意视图进行检查。该方法设计为多视图一致，无需对比学习或特征场预处理，且在几秒钟内完成无优化选择。SAMa在选择精度和多视图一致性方面优于多个强基准，并支持多种应用，如替换text-to-3D输出的漫反射材质，或在NeRFs和3D-Gaussians上选择和编辑材质。

链接: https://arxiv.org/abs/2411.19322
作者: Michael Fischer,Iliyan Georgiev,Thibault Groueix,Vladimir G. Kim,Tobias Ritschel,Valentin Deschaintre
关键词-EN: highly manual process, artists and creators, manual process, common task, task for artists
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model’s cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbour lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects’ surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output, or selecting and editing materials on NeRFs and 3D-Gaussians.
zh

[CV-126] Domain-Agnostic Stroke Lesion Segmentation Using Physics-Constrained Synthetic Data

【速读】：该论文试图解决磁共振成像（MRI）中脑卒中病灶分割的难题，主要挑战在于不同临床成像领域之间的差异性，现有模型难以在不同的MRI采集参数和序列中泛化。解决方案的关键在于提出了两种基于物理约束的方法，利用合成定量MRI（qMRI）图像来增强分割模型的鲁棒性和泛化能力。第一种方法通过训练qMRI估计模型，从MPRAGE图像预测qMRI图谱，并用于模拟多样化的MRI序列进行分割训练。第二种方法则基于先前在合成数据上的工作，通过组织标签数据集生成qMRI图谱。这两种方法在多种分布外数据集上均优于基线nnUNet，其中第二种方法表现尤为突出，超越了先前的合成数据方法。

链接: https://arxiv.org/abs/2412.03318
作者: Liam Chalcroft,Jenny Crinion,Cathy J. Price,John Ashburner
关键词-EN: Magnetic Resonance Imaging, clinical imaging domains, Magnetic Resonance, Resonance Imaging, MRI acquisition parameters
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Segmenting stroke lesions in Magnetic Resonance Imaging (MRI) is challenging due to diverse clinical imaging domains, with existing models struggling to generalise across different MRI acquisition parameters and sequences. In this work, we propose two novel physics-constrained approaches using synthetic quantitative MRI (qMRI) images to enhance the robustness and generalisability of segmentation models. We trained a qMRI estimation model to predict qMRI maps from MPRAGE images, which were used to simulate diverse MRI sequences for segmentation training. A second approach built upon prior work in synthetic data for stroke lesion segmentation, generating qMRI maps from a dataset of tissue labels. The proposed approaches improved over the baseline nnUNet on a variety of out-of-distribution datasets, with the second approach outperforming the prior synthetic data method.
zh

[CV-127] Is JPEG AI going to change image forensics?

【速读】：该论文试图解决即将推出的基于神经网络图像压缩的JPEG AI标准对深度伪造图像检测和图像拼接定位的反取证效应问题。解决方案的关键在于通过广泛的实验分析，揭示JPEG AI处理后的图像与合成图像和拼接图像产生的伪影相似性，导致现有取证检测工具的误报率增加，从而影响其性能。论文强调了多媒体取证研究人员需要将JPEG AI图像纳入实验设置，并开发能够有效区分神经压缩伪影与实际操作的鲁棒取证技术。

链接: https://arxiv.org/abs/2412.03261
作者: Edoardo Daniele Cannas,Sara Mandelli,Natasa Popovic,Ayman Alkhateeb,Alessandro Gnutti,Paolo Bestagini,Stefano Tubaro
关键词-EN: deepfake image detection, image splicing localization, neural image compression, critical areas, standard based
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we investigate the counter-forensic effects of the forthcoming JPEG AI standard based on neural image compression, focusing on two critical areas: deepfake image detection and image splicing localization. Neural image compression leverages advanced neural network algorithms to achieve higher compression rates while maintaining image quality. However, it introduces artifacts that closely resemble those generated by image synthesis techniques and image splicing pipelines, complicating the work of researchers when discriminating pristine from manipulated content. We comprehensively analyze JPEG AI’s counter-forensic effects through extensive experiments on several state-of-the-art detectors and datasets. Our results demonstrate that an increase in false alarms impairs the performance of leading forensic detectors when analyzing genuine content processed through JPEG AI. By exposing the vulnerabilities of the available forensic tools we aim to raise the urgent need for multimedia forensics researchers to include JPEG AI images in their experimental setups and develop robust forensic techniques to distinguish between neural compression artifacts and actual manipulations.
zh

[CV-128] Hybrid deep learning-based strategy for the hepatocellular carcinoma cancer grade classification of HE stained liver histopathology images

【速读】：该论文试图解决肝细胞癌（Hepatocellular carcinoma, HCC）早期诊断的挑战，特别是由于手动评估苏木精-伊红染色全切片图像（hematoxylin and eosin-stained whole slide images）耗时且可能导致决策不一致的问题。解决方案的关键在于提出了一种基于混合深度学习的架构，该架构利用迁移学习（transfer learning）从预训练的卷积神经网络（CNN）模型中提取特征，并结合由一系列全连接层构成的分类器。通过使用公开的TCGA-LIHC数据库进行模型开发，并在Kasturba Gandhi医学学院（KMC）的数据库上进行验证，该混合模型在TCGA数据库上实现了100%的敏感性、特异性、F1分数、准确率和AUC，在KMC数据库上使用EfficientNetb3作为特征提取器时，也取得了96.97%的敏感性、98.85%的特异性、96.71%的F1分数、96.71%的准确率和0.99的AUC。该方法相较于预训练模型在TCGA-LIHC和KMC数据库上分别提高了2%和4%的准确率。

链接: https://arxiv.org/abs/2412.03084
作者: Ajinkya Deshpande,Deep Gupta,Ankit Bhurane,Nisha Meshram,Sneha Singh,Petia Radeva
关键词-EN: Atlas Hepatocellular Carcinoma, Genome Atlas Hepatocellular, Hepatocellular carcinoma, Cancer Genome Atlas, Gandhi Medical College
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 14 figure, 9 tables

点击查看摘要

Abstract:Hepatocellular carcinoma (HCC) is a common type of liver cancer whose early-stage diagnosis is a common challenge, mainly due to the manual assessment of hematoxylin and eosin-stained whole slide images, which is a time-consuming process and may lead to variability in decision-making. For accurate detection of HCC, we propose a hybrid deep learning-based architecture that uses transfer learning to extract the features from pre-trained convolutional neural network (CNN) models and a classifier made up of a sequence of fully connected layers. This study uses a publicly available The Cancer Genome Atlas Hepatocellular Carcinoma (TCGA-LIHC)database (n=491) for model development and database of Kasturba Gandhi Medical College (KMC), India for validation. The pre-processing step involves patch extraction, colour normalization, and augmentation that results in 3920 patches for the TCGA dataset. The developed hybrid deep neural network consisting of a CNN-based pre-trained feature extractor and a customized artificial neural network-based classifier is trained using five-fold cross-validation. For this study, eight different state-of-the-art models are trained and tested as feature extractors for the proposed hybrid model. The proposed hybrid model with ResNet50-based feature extractor provided the sensitivity, specificity, F1-score, accuracy, and AUC of 100.00%, 100.00%, 100.00%, 100.00%, and 1.00, respectively on the TCGA database. On the KMC database, EfficientNetb3 resulted in the optimal choice of the feature extractor giving sensitivity, specificity, F1-score, accuracy, and AUC of 96.97, 98.85, 96.71, 96.71, and 0.99, respectively. The proposed hybrid models showed improvement in accuracy of 2% and 4% over the pre-trained models in TCGA-LIHC and KMC databases.
zh

[CV-129] Real-Time AIoT for UAV Antenna Interference Detection via Edge-Cloud Collaboration

【速读】：该论文试图解决在第五代移动通信（5G）时代中，由于未经授权或故障的天线引起的通信干扰问题。解决方案的关键在于提出了一种基于计算机视觉的AIoT系统，采用优化的边缘-云协同（ECC+）模式，结合关键帧选择算法（KSA），以降低端到端延迟（E2EL）并确保数据传输的可靠性，符合超可靠低延迟通信（URLLC）的核心原则。系统核心是一个基于检测跟踪（TBD）范式的端到端天线定位方案，包括检测器（EdgeAnt）和跟踪器（AntSort）。EdgeAnt在自定义的天线干扰源数据集上达到了42.1%的平均精度（mAP），并在COCO数据集上达到了38.9%的mAP，同时具有高效的计算性能。通过在Jetson Xavier NX和Raspberry Pi 4B上的部署，实现了实时推理速度，相比仅云模式（CO），ECC+模式将E2EL降低了88.9%，精度提高了28.2%。此外，该系统还具备良好的扩展性，支持多无人机协同检查。

链接: https://arxiv.org/abs/2412.03055
作者: Jun Dong,Jintao Cheng,Jin Wu,Chengxi Zhang,Shunyi Zhao,Xiaoyu Tang
关键词-EN: maintaining network performance, eliminating communication interference, crucial for maintaining, maintaining network, interference sources
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the fifth-generation (5G) era, eliminating communication interference sources is crucial for maintaining network performance. Interference often originates from unauthorized or malfunctioning antennas, and radio monitoring agencies must address numerous sources of such antennas annually. Unmanned aerial vehicles (UAVs) can improve inspection efficiency. However, the data transmission delay in the existing cloud-only (CO) artificial intelligence (AI) mode fails to meet the low latency requirements for real-time performance. Therefore, we propose a computer vision-based AI of Things (AIoT) system to detect antenna interference sources for UAVs. The system adopts an optimized edge-cloud collaboration (ECC+) mode, combining a keyframe selection algorithm (KSA), focusing on reducing end-to-end latency (E2EL) and ensuring reliable data transmission, which aligns with the core principles of ultra-reliable low-latency communication (URLLC). At the core of our approach is an end-to-end antenna localization scheme based on the tracking-by-detection (TBD) paradigm, including a detector (EdgeAnt) and a tracker (AntSort). EdgeAnt achieves state-of-the-art (SOTA) performance with a mean average precision (mAP) of 42.1% on our custom antenna interference source dataset, requiring only 3 million parameters and 14.7 GFLOPs. On the COCO dataset, EdgeAnt achieves 38.9% mAP with 5.4 GFLOPs. We deployed EdgeAnt on Jetson Xavier NX (TRT) and Raspberry Pi 4B (NCNN), achieving real-time inference speeds of 21.1 (1088) and 4.8 (640) frames per second (FPS), respectively. Compared with CO mode, the ECC+ mode reduces E2EL by 88.9%, increases accuracy by 28.2%. Additionally, the system offers excellent scalability for coordinated multiple UAVs inspections. The detector code is publicly available at this https URL.
zh

[CV-130] Fan-Beam CT Reconstruction for Unaligned Sparse-View X-ray Baggage Dataset

【速读】：该论文试图解决在安全行李检查中，由于静态X射线系统而非旋转系统，导致难以获取大规模3D标注数据和体素表示的问题。解决方案的关键在于提出了一种校准和重建方法，利用未对齐的稀疏多视角X射线行李数据集，该数据集具有广泛的2D标注。具体方法包括结合多光谱神经衰减场重建与线性推扫（Linear pushbroom, LPB）相机模型姿态优化，并通过颜色编码网络增强新视角的渲染一致性。这种方法旨在提高安全行李检查领域的泛化能力，特别是在泛化性极具挑战性的情况下。

链接: https://arxiv.org/abs/2412.03036
作者: Shin Kim
关键词-EN: Computed Tomography, reconstructs cross-sectional images, multiple directions, X-ray images, X-ray
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed Tomography (CT) is a technology that reconstructs cross-sectional images using X-ray images taken from multiple directions. In CT, hundreds of X-ray images acquired as the X-ray source and detector rotate around a central axis, are used for precise reconstruction. In security baggage inspection, X-ray imaging is also widely used; however, unlike the rotating systems in medical CT, stationary X-ray systems are more common, and publicly available reconstructed data are limited. This makes it challenging to obtain large-scale 3D labeled data and voxel representations essential for training. To address these limitations, our study presents a calibration and reconstruction method using an unaligned sparse multi-view X-ray baggage dataset, which has extensive 2D labeling. Our approach integrates multi-spectral neural attenuation field reconstruction with Linear pushbroom (LPB) camera model pose optimization, enhancing rendering consistency for novel views through color coding network. Our method aims to improve generalization within the security baggage inspection domain, where generalization is particularly challenging.
zh

[CV-131] Assessing the performance of CT image denoisers using Laguerre-Gauss Channelized Hotelling Observer for lesion detection

【速读】：该论文试图解决低剂量CT图像去噪后诊断图像质量的问题。解决方案的关键在于评估深度学习去噪算法在视觉感知和数据保真度方面的表现，并引入任务相关的检测能力评估（LCD）来衡量去噪图像的诊断质量。尽管深度学习去噪算法在峰值信噪比（PSNR）和结构相似性指数（SSIM）等任务无关的指标上优于低剂量CT图像，但在任务相关的检测能力评估（LCD）上，其表现仍不如正常剂量的CT图像。

链接: https://arxiv.org/abs/2412.02920
作者: Prabhat Kc,Rongping Zeng
关键词-EN: deep learning methods, deep learning denoising, deep learning, learning denoising algorithms, computer vision problems
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 2 pages, 2024 IEEE Nuclear Science Symposium (NSS), Medical Imaging Conference (MIC) and Room Temperature Semiconductor Detector Conference (RTSD)

点击查看摘要

Abstract:The remarkable success of deep learning methods in solving computer vision problems, such as image classification, object detection, scene understanding, image segmentation, etc., has paved the way for their application in biomedical imaging. One such application is in the field of CT image denoising, whereby deep learning methods are proposed to recover denoised images from noisy images acquired at low radiation. Outputs derived from applying deep learning denoising algorithms may appear clean and visually pleasing; however, the underlying diagnostic image quality may not be on par with their normal-dose CT counterparts. In this work, we assessed the image quality of deep learning denoising algorithms by making use of visual perception- and data fidelity-based task-agnostic metrics (like the PSNR and the SSIM) - commonly used in the computer vision - and a task-based detectability assessment (the LCD) - extensively used in the CT imaging. When compared against normal-dose CT images, the deep learning denoisers outperformed low-dose CT based on metrics like the PSNR (by 2.4 to 3.8 dB) and SSIM (by 0.05 to 0.11). However, based on the LCD performance, the detectability using quarter-dose denoised outputs was inferior to that obtained using normal-dose CT scans.
zh

[CV-132] Unpaired Modality Translation for Pseudo Labeling of Histology Images

【速读】：该论文试图解决组织学图像分割中标注数据缺乏的问题。解决方案的关键在于利用无监督图像翻译技术构建显微镜伪标签生成管道。具体来说，该方法通过在有标签和无标签域之间进行图像翻译，生成伪标签，而无需目标域的先验标注。论文评估了两种伪标签生成策略在三个与有标签数据逐渐不同的图像域中的效果，特别是通过将标记数据集（TEM）翻译到目标模态（SEM）来创建合成数据，并在合成数据上训练分割模型，实现了在SEM数据集上的平均Dice分数为0.736 ± 0.005。这种方法旨在通过提供高质量的伪标签作为手动细化的起点，加速标注过程。

链接: https://arxiv.org/abs/2412.02858
作者: Arthur Boschet,Armand Collin,Nishka Katoch,Julien Cohen-Adad
关键词-EN: annotated data presents, biomedical applications, significant challenge, lack of annotated, presents a significant
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The segmentation of histological images is critical for various biomedical applications, yet the lack of annotated data presents a significant challenge. We propose a microscopy pseudo labeling pipeline utilizing unsupervised image translation to address this issue. Our method generates pseudo labels by translating between labeled and unlabeled domains without requiring prior annotation in the target domain. We evaluate two pseudo labeling strategies across three image domains increasingly dissimilar from the labeled data, demonstrating their effectiveness. Notably, our method achieves a mean Dice score of 0.736 \pm 0.005 on a SEM dataset using the tutoring path, which involves training a segmentation model on synthetic data created by translating the labeled dataset (TEM) to the target modality (SEM). This approach aims to accelerate the annotation process by providing high-quality pseudo labels as a starting point for manual refinement.
zh

人工智能

[AI-0] he Matrix: Infinite-Horizon World Generation with Real-Time Moving Control

链接: https://arxiv.org/abs/2412.03568
作者: Ruili Feng,Han Zhang,Zhantao Yang,Jie Xiao,Zhilei Shu,Zhiheng Liu,Andy Zheng,Yukun Huang,Yu Liu,Hongyang Zhang
关键词-EN: high-fidelity real-scene video, enabling immersive exploration, real-scene video streams, richly dynamic environments, realistic world simulator
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p high-fidelity real-scene video streams with real-time, responsive control in both first- and third-person perspectives, enabling immersive exploration of richly dynamic environments. Trained on limited supervised data from AAA games like Forza Horizon 5 and Cyberpunk 2077, complemented by large-scale unsupervised footage from real-world settings like Tokyo streets, The Matrix allows users to traverse diverse terrains – deserts, grasslands, water bodies, and urban landscapes – in continuous, uncut hour-long sequences. Operating at 16 FPS, the system supports real-time interactivity and demonstrates zero-shot generalization, translating virtual game environments to real-world contexts where collecting continuous movement data is often infeasible. For example, The Matrix can simulate a BMW X3 driving through an office setting–an environment present in neither gaming data nor real-world sources. This approach showcases the potential of AAA game data to advance robust world models, bridging the gap between simulations and real-world applications in scenarios with limited data.

[AI-1] NODE-AdvGAN: Improving the transferability and perceptual similarity of adversarial examples by dynamic-system-driven adversarial generative model

链接: https://arxiv.org/abs/2412.03539
作者: Xinheng Xie,Yue Wu,Cuiyu He
关键词-EN: introduce imperceptible perturbations, Understanding adversarial, Ordinary Differential Equation, Neural Ordinary Differential, crucial for improving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding adversarial examples is crucial for improving the model’s robustness, as they introduce imperceptible perturbations that deceive models. Effective adversarial examples, therefore, offer the potential to train more robust models by removing their singularities. We propose NODE-AdvGAN, a novel approach that treats adversarial generation as a continuous process and employs a Neural Ordinary Differential Equation (NODE) for simulating the dynamics of the generator. By mimicking the iterative nature of traditional gradient-based methods, NODE-AdvGAN generates smoother and more precise perturbations that preserve high perceptual similarity when added to benign images. We also propose a new training strategy, NODE-AdvGAN-T, which enhances transferability in black-box attacks by effectively tuning noise parameters during training. Experiments demonstrate that NODE-AdvGAN and NODE-AdvGAN-T generate more effective adversarial examples that achieve higher attack success rates while preserving better perceptual quality than traditional GAN-based methods.

[AI-2] Youre (Not) My Type – Can LLM s Generate Feedback of Specific Types for Introductory Programming Tasks?

链接: https://arxiv.org/abs/2412.03516
作者: Dominic Lohr,Hieke Keuning,Natalie Kiesler
关键词-EN: Large Language Models, Feedback, influential factors, great body, Background
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at Journal of Computer Assisted Learning (2024)

点击查看摘要

Abstract:Background: Feedback as one of the most influential factors for learning has been subject to a great body of research. It plays a key role in the development of educational technology systems and is traditionally rooted in deterministic feedback defined by experts and their experience. However, with the rise of generative AI and especially Large Language Models (LLMs), we expect feedback as part of learning systems to transform, especially for the context of programming. In the past, it was challenging to automate feedback for learners of programming. LLMs may create new possibilities to provide richer, and more individual feedback than ever before. Objectives: This paper aims to generate specific types of feedback for introductory programming tasks using LLMs. We revisit existing feedback taxonomies to capture the specifics of the generated feedback, such as randomness, uncertainty, and degrees of variation. Methods: We iteratively designed prompts for the generation of specific feedback types (as part of existing feedback taxonomies) in response to authentic student programs. We then evaluated the generated output and determined to what extent it reflected certain feedback types. Results and Conclusion: The present work provides a better understanding of different feedback dimensions and characteristics. The results have implications for future feedback research with regard to, for example, feedback effects and learners’ informational needs. It further provides a basis for the development of new tools and learning systems for novice programmers including feedback generated by AI. Comments: Accepted at Journal of Computer Assisted Learning (2024) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.03516 [cs.AI] (or arXiv:2412.03516v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.03516 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dominic Lohr [view email] [v1] Wed, 4 Dec 2024 17:57:39 UTC (168 KB)

[AI-3] Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective

链接: https://arxiv.org/abs/2412.03487
作者: Neta Shaul,Itai Gat,Marton Havasi,Daniel Severo,Anuroop Sriram,Peter Holderrieth,Brian Karrer,Yaron Lipman,Ricky T. Q. Chen
关键词-EN: flow generative models, simple masked construction, generative models based, discrete generative models, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The design space of discrete-space diffusion or flow generative models are significantly less well-understood than their continuous-space counterparts, with many works focusing only on a simple masked construction. In this work, we aim to take a holistic approach to the construction of discrete generative models based on continuous-time Markov chains, and for the first time, allow the use of arbitrary discrete probability paths, or colloquially, corruption processes. Through the lens of optimizing the symmetric kinetic energy, we propose velocity formulas that can be applied to any given probability path, completely decoupling the probability and velocity, and giving the user the freedom to specify any desirable probability path based on expert knowledge specific to the data domain. Furthermore, we find that a special construction of mixture probability paths optimizes the symmetric kinetic energy for the discrete case. We empirically validate the usefulness of this new design space across multiple modalities: text generation, inorganic material generation, and image generation. We find that we can outperform the mask construction even in text with kinetic-optimal mixture paths, while we can make use of domain-specific constructions of the probability path over the visual domain.

[AI-4] From Words to Workflows: Automating Business Processes

链接: https://arxiv.org/abs/2412.03446
作者: Laura Minkova,Jessica López Espejel,Taki Eddine Toufik Djaidja,Walid Dahhane,El Hassane Ettifouri
关键词-EN: businesses increasingly rely, Robotic Process Automation, limitations of Robotic, Large Language Models, Robotic Process
类目: Artificial Intelligence (cs.AI)
*备注: Under review at Elsevier’s Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:As businesses increasingly rely on automation to streamline operations, the limitations of Robotic Process Automation (RPA) have become apparent, particularly its dependence on expert knowledge and inability to handle complex decision-making tasks. Recent advancements in Artificial Intelligence (AI), particularly Generative AI (GenAI) and Large Language Models (LLMs), have paved the way for Intelligent Automation (IA), which integrates cognitive capabilities to overcome the shortcomings of RPA. This paper introduces Text2Workflow, a novel method that automatically generates workflows from natural language user requests. Unlike traditional automation approaches, Text2Workflow offers a generalized solution for automating any business process, translating user inputs into a sequence of executable steps represented in JavaScript Object Notation (JSON) format. Leveraging the decision-making and instruction-following capabilities of LLMs, this method provides a scalable, adaptable framework that enables users to visualize and execute workflows with minimal manual intervention. This research outlines the Text2Workflow methodology and its broader implications for automating complex business processes.

[AI-5] PBP: Post-training Backdoor Purification for Malware Classifiers NDSS2025

链接: https://arxiv.org/abs/2412.03441
作者: Dung Thuy Nguyen,Ngoc N. Tran,Taylor T. Johnson,Kevin Leach
关键词-EN: recent years, brought new challenges, including the increasing, backdoor poisoning attacks, training data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted at NDSS 2025

点击查看摘要

Abstract:In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks on ML malware classifiers. For instance, adversaries could inject malicious samples into public malware repositories, contaminating the training data and potentially misclassifying malware by the ML model. Current countermeasures predominantly focus on detecting poisoned samples by leveraging disagreements within the outputs of a diverse set of ensemble models on training data points. However, these methods are not suitable for scenarios where Machine Learning-as-a-Service (MLaaS) is used or when users aim to remove backdoors from a model after it has been trained. Addressing this scenario, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method exploits the influence of backdoor attacks on the activation distribution of neural networks, independent of the trigger-embedding method. In the presence of a backdoor attack, the activation distribution of each layer is distorted into a mixture of distributions. By regulating the statistics of the batch normalization layers, we can guide a backdoored model to perform similarly to a clean one. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations. Notably, our approach requires only a small portion of the training data – only 1% – to purify the backdoor and reduce the attack success rate from 100% to almost 0%, a 100-fold improvement over the baseline methods. Our code is available at \urlthis https URL.

[AI-6] BIMCaP: BIM-based AI-supported LiDAR-Camera Pose Refinement

链接: https://arxiv.org/abs/2412.03434
作者: Miguel Arturo Vega Torres,Anna Ribic,Borja García de Soto,André Borrmann
关键词-EN: pre-existing building information, building information models, sparse LiDAR data, paper introduces BIMCaP, accurate indoor mapping
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 24 figures, Conference: EG-ICE: 31st International Workshop on Intelligent Computing in Engineering

点击查看摘要

Abstract:This paper introduces BIMCaP, a novel method to integrate mobile 3D sparse LiDAR data and camera measurements with pre-existing building information models (BIMs), enhancing fast and accurate indoor mapping with affordable sensors. BIMCaP refines sensor poses by leveraging a 3D BIM and employing a bundle adjustment technique to align real-world measurements with the model. Experiments using real-world open-access data show that BIMCaP achieves superior accuracy, reducing translational error by over 4 cm compared to current state-of-the-art methods. This advancement enhances the accuracy and cost-effectiveness of 3D mapping methodologies like SLAM. BIMCaP’s improvements benefit various fields, including construction site management and emergency response, by providing up-to-date, aligned digital maps for better decision-making and productivity. Link to the repository: this https URL

[AI-7] Genetic Algorithm Based System for Path Planning with Unmanned Aerial Vehicles Swarms in Cell-Grid Environments

链接: https://arxiv.org/abs/2412.03433
作者: Alejandro Puente-Castro,Enrique Fernandez-Blanco,Daniel Rivero
关键词-EN: unmanned aerial vehicles, gaining momentum due, autonomously controlling swarms, Path Planning, aerial vehicles
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Path Planning methods for autonomously controlling swarms of unmanned aerial vehicles (UAVs) are gaining momentum due to their operational advantages. An increasing number of scenarios now require autonomous control of multiple UAVs, as autonomous operation can significantly reduce labor costs. Additionally, obtaining optimal flight paths can lower energy consumption, thereby extending battery life for other critical operations. Many of these scenarios, however, involve obstacles such as power lines and trees, which complicate Path Planning. This paper presents an evolutionary computation-based system employing genetic algorithms to address this problem in environments with obstacles. The proposed approach aims to ensure complete coverage of areas with fixed obstacles, such as in field exploration tasks, while minimizing flight time regardless of map size or the number of UAVs in the swarm. No specific goal points or prior information beyond the provided map is required. The experiments conducted in this study used five maps of varying sizes and obstacle densities, as well as a control map without obstacles, with different numbers of UAVs. The results demonstrate that this method can determine optimal paths for all UAVs during full map traversal, thus minimizing resource consumption. A comparative analysis with other state-of-the-art approach is presented to highlight the advantages and potential limitations of the proposed method.

[AI-8] ango*: Constrained synthesis planning using chemically informed value functions

链接: https://arxiv.org/abs/2412.03424
作者: Daniel Armstrong,Zlatko Joncev,Jeff Guo,Philippe Schwaller
关键词-EN: made significant strides, Computer-aided synthesis planning, generating retrosynthetic pathways, Computer-aided synthesis, non-constrained fashion
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer-aided synthesis planning (CASP) has made significant strides in generating retrosynthetic pathways for simple molecules in a non-constrained fashion. Recent work introduces a specialised bidirectional search algorithm with forward and retro expansion to address the starting material-constrained synthesis problem, allowing CASP systems to provide synthesis pathways from specified starting materials, such as waste products or renewable feed-stocks. In this work, we introduce a simple guided search which allows solving the starting material-constrained synthesis planning problem using an existing, uni-directional search algorithm, Retro*. We show that by optimising a single hyperparameter, Tango* outperforms existing methods in terms of efficiency and solve rate. We find the Tango* cost function catalyses strong improvements for the bidirectional DESP methods. Our method also achieves lower wall clock times while proposing synthetic routes of similar length, a common metric for route quality. Finally, we highlight potential reasons for the strong performance of Tango over neural guided search methods

[AI-9] Automated Test-Case Generation for REST APIs Using Model Inference Search Heuristic

链接: https://arxiv.org/abs/2412.03420
作者: Clinton Cao,Annibale Panichella,Sicco Verwer
关键词-EN: testing approaches tailored, microservice architectural style, test case, rising popularity, architectural style
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:The rising popularity of the microservice architectural style has led to a growing demand for automated testing approaches tailored to these systems. EvoMaster is a state-of-the-art tool that uses Evolutionary Algorithms (EAs) to automatically generate test cases for microservices’ REST APIs. One limitation of these EAs is the use of unit-level search heuristics, such as branch distances, which focus on fine-grained code coverage and may not effectively capture the complex, interconnected behaviors characteristic of system-level testing. To address this limitation, we propose a new search heuristic (MISH) that uses real-time automaton learning to guide the test case generation process. We capture the sequential call patterns exhibited by a test case by learning an automaton from the stream of log events outputted by different microservices within the same system. Therefore, MISH learns a representation of the systemwide behavior, allowing us to define the fitness of a test case based on the path it traverses within the inferred automaton. We empirically evaluate MISH’s effectiveness on six real-world benchmark microservice applications and compare it against a state-of-the-art technique, MOSA, for testing REST APIs. Our evaluation shows promising results for using MISH to guide the automated test case generation within EvoMaster.

[AI-10] Learning Semantic Association Rules from Internet of Things Data

链接: https://arxiv.org/abs/2412.03417
作者: Erkan Karabulut,Paul Groth,Victoria Degeler
关键词-EN: Association Rule Mining, Rule Mining, logical implications, Internet of Things, discovering commonalities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Association Rule Mining (ARM) is the task of discovering commonalities in data in the form of logical implications. ARM is used in the Internet of Things (IoT) for different tasks including monitoring and decision-making. However, existing methods give limited consideration to IoT-specific requirements such as heterogeneity and volume. Furthermore, they do not utilize important static domain-specific description data about IoT systems, which is increasingly represented as knowledge graphs. In this paper, we propose a novel ARM pipeline for IoT data that utilizes both dynamic sensor data and static IoT system metadata. Furthermore, we propose an Autoencoder-based Neurosymbolic ARM method (Aerial) as part of the pipeline to address the high volume of IoT data and reduce the total number of rules that are resource-intensive to process. Aerial learns a neural representation of a given data and extracts association rules from this representation by exploiting the reconstruction (decoding) mechanism of an autoencoder. Extensive evaluations on 3 IoT datasets from 2 domains show that ARM on both static and dynamic IoT data results in more generically applicable rules while Aerial can learn a more concise set of high-quality association rules than the state-of-the-art with full coverage over the datasets.

[AI-11] Enhancing Supply Chain Visibility with Generative AI: An Exploratory Case Study on Relationship Prediction in Knowledge Graphs

链接: https://arxiv.org/abs/2412.03390
作者: Ge Zheng,Alexandra Brintrup
关键词-EN: key stumbling block, interdependent supply network, supply chain, effective supply chain, key stumbling
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:A key stumbling block in effective supply chain risk management for companies and policymakers is a lack of visibility on interdependent supply network relationships. Relationship prediction, also called link prediction is an emergent area of supply chain surveillance research that aims to increase the visibility of supply chains using data-driven techniques. Existing methods have been successful for predicting relationships but struggle to extract the context in which these relationships are embedded - such as the products being supplied or locations they are supplied from. Lack of context prevents practitioners from distinguishing transactional relations from established supply chain relations, hindering accurate estimations of risk. In this work, we develop a new Generative Artificial Intelligence (Gen AI) enhanced machine learning framework that leverages pre-trained language models as embedding models combined with machine learning models to predict supply chain relationships within knowledge graphs. By integrating Generative AI techniques, our approach captures the nuanced semantic relationships between entities, thereby improving supply chain visibility and facilitating more precise risk management. Using data from a real case study, we show that GenAI-enhanced link prediction surpasses all benchmarks, and demonstrate how GenAI models can be explored and effectively used in supply chain risk management.

[AI-12] WiS Platform: Enhancing Evaluation of LLM -Based Multi-Agent Systems Through Game-Based Analysis

链接: https://arxiv.org/abs/2412.03359
作者: Chengwei Hu,Jianhui Zheng,Yancheng He,Hangyu Guo,Junguang Jiang,Han Zhu,Kai Sun,Yuning Jiang,Wenbo Su,Bo Zheng
关键词-EN: autonomous multi-agent systems, handle complex tasks, Recent advancements, LLM-based MAS, large language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in autonomous multi-agent systems (MAS) based on large language models (LLMs) have enhanced the application scenarios and improved the capability of LLMs to handle complex tasks. Despite demonstrating effectiveness, existing studies still evidently struggle to evaluate, analysis, and reproducibility of LLM-based MAS. In this paper, to facilitate the research on LLM-based MAS, we introduce an open, scalable, and real-time updated platform for accessing and analyzing the LLM-based MAS based on the games Who is Spy?" (WiS). Our platform is featured with three main worths: (1) a unified model evaluate interface that supports models available on Hugging Face; (2) real-time updated leaderboard for model evaluation; (3) a comprehensive evaluation covering game-winning rates, attacking, defense strategies, and reasoning of LLMs. To rigorously test WiS, we conduct extensive experiments coverage of various open- and closed-source LLMs, we find that different agents exhibit distinct and intriguing behaviors in the game. The experimental results demonstrate the effectiveness and efficiency of our platform in evaluating LLM-based MAS. Our platform and its documentation are publicly available at \urlthis https URL

[AI-13] AI-Driven Day-to-Day Route Choice

链接: https://arxiv.org/abs/2412.03338
作者: Leizhen Wang,Peibo Duan,Zhengbing He,Cheng Lyu,Xin Chen,Nan Zheng,Li Yao,Zhenliang Ma
关键词-EN: Understanding travelers’ route, policymakers devise optimal, devise optimal operational, Understanding travelers’, abnormal circumstances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding travelers’ route choices can help policymakers devise optimal operational and planning strategies for both normal and abnormal circumstances. However, existing choice modeling methods often rely on predefined assumptions and struggle to capture the dynamic and adaptive nature of travel behavior. Recently, Large Language Models (LLMs) have emerged as a promising alternative, demonstrating remarkable ability to replicate human-like behaviors across various fields. Despite this potential, their capacity to accurately simulate human route choice behavior in transportation contexts remains doubtful. To satisfy this curiosity, this paper investigates the potential of LLMs for route choice modeling by introducing an LLM-empowered agent, “LLMTraveler.” This agent integrates an LLM as its core, equipped with a memory system that learns from past experiences and makes decisions by balancing retrieved data and personality traits. The study systematically evaluates the LLMTraveler’s ability to replicate human-like decision-making through two stages: (1) analyzing its route-switching behavior in single origin-destination (OD) pair congestion game scenarios, where it demonstrates patterns align with laboratory data but are not fully explained by traditional models, and (2) testing its capacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar and Willumsen (OW) network, producing results comparable to Multinomial Logit (MNL) and Reinforcement Learning (RL) models. These experiments demonstrate that the framework can partially replicate human-like decision-making in route choice while providing natural language explanations for its decisions. This capability offers valuable insights for transportation policymaking, such as simulating traveler responses to new policies or changes in the network.

[AI-14] Path-Guided Particle-based Sampling

链接: https://arxiv.org/abs/2412.03312
作者: Mingzhou Fan,Ruida Zhou,Chao Tian,Xiaoning Qian
关键词-EN: Stein variational gradient, variational gradient descent, attracted significant attention, Stein variational, target distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Particle-based Bayesian inference methods by sampling from a partition-free target (posterior) distribution, e.g., Stein variational gradient descent (SVGD), have attracted significant attention. We propose a path-guided particle-based sampling~(PGPS) method based on a novel Log-weighted Shrinkage (LwS) density path linking an initial distribution to the target distribution. We propose to utilize a Neural network to learn a vector field motivated by the Fokker-Planck equation of the designed density path. Particles, initiated from the initial distribution, evolve according to the ordinary differential equation defined by the vector field. The distribution of these particles is guided along a density path from the initial distribution to the target distribution. The proposed LwS density path allows for an efficient search of modes of the target distribution while canonical methods fail. We theoretically analyze the Wasserstein distance of the distribution of the PGPS-generated samples and the target distribution due to approximation and discretization errors. Practically, the proposed PGPS-LwS method demonstrates higher Bayesian inference accuracy and better calibration ability in experiments conducted on both synthetic and real-world Bayesian learning tasks, compared to baselines, such as SVGD and Langevin dynamics, etc.

[AI-15] Contextual Data Integration for Bike-sharing Demand Prediction with Graph Neural Networks in Degraded Weather Conditions

链接: https://arxiv.org/abs/2412.03307
作者: Romain Rochas(LICIT-Eco7, ENTPE),Angelo Furno(LICIT-Eco7, ENTPE),Nour-Eddin El Faouzi(LICIT-Eco7, ENTPE)
关键词-EN: bike sharing, sharing is impacted, weather, weather conditions, degraded weather conditions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Demand for bike sharing is impacted by various factors, such as weather conditions, events, and the availability of other transportation modes. This impact remains elusive due to the complex interdependence of these factors or locationrelated user behavior variations. It is also not clear which factor is additional information which are not already contained in the historical demand. Intermodal dependencies between bike-sharing and other modes are also underexplored, and the value of this information has not been studied in degraded situations. The proposed study analyzes the impact of adding contextual data, such as weather, time embedding, and road traffic flow, to predict bike-sharing Origin-Destination (OD) flows in atypical weather situations Our study highlights a mild relationship between prediction quality of bike-sharing demand and road traffic flow, while the introduced time embedding allows outperforming state-of-the-art results, particularly in the case of degraded weather conditions. Including weather data as an additional input further improves our model with respect to the basic ST-ED-RMGC prediction model by reducing of more than 20% the prediction error in degraded weather condition.

[AI-16] Integrating Generative AI into Art Therapy: A Technical Showcase

链接: https://arxiv.org/abs/2412.03287
作者: Yannis Valentin Schmutz,Tetiana Kravchenko,Souhir Ben Souissi,Mascha Kurpicz-Briki
关键词-EN: complement art therapy, art therapy, paper explores, therapy, field of art
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the integration of generative AI into the field of art therapy. Leveraging proven text-to-image models, we introduce a novel technical design to complement art therapy. The resulting AI-based tools shall enable patients to refine and customize their creative work, opening up new avenues of expression and accessibility. Using three illustrative examples, we demonstrate potential outputs of our solution and evaluate them qualitatively. Furthermore, we discuss the current limitations and ethical considerations associated with this integration and provide an outlook into future research efforts. Our implementations are publicly available at this https URL.

[AI-17] Detecting abnormal heart sound using mobile phones and on-device IConNet

链接: https://arxiv.org/abs/2412.03267
作者: Linh Vu,Thu Tran
关键词-EN: easily accessible early, accessible early screening, cardiovascular diseases, global prevalence, prevalence of cardiovascular
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: N2Women’24 Workshop, MobiSys 2024, Tokyo, Japan

点击查看摘要

Abstract:Given the global prevalence of cardiovascular diseases, there is a pressing need for easily accessible early screening methods. Typically, this requires medical practitioners to investigate heart auscultations for irregular sounds, followed by echocardiography and electrocardiography tests. To democratize early diagnosis, we present a user-friendly solution for abnormal heart sound detection, utilizing mobile phones and a lightweight neural network optimized for on-device inference. Unlike previous approaches reliant on specialized stethoscopes, our method directly analyzes audio recordings, facilitated by a novel architecture known as IConNet. IConNet, an Interpretable Convolutional Neural Network, harnesses insights from audio signal processing, enhancing efficiency and providing transparency in neural pattern extraction from raw waveform signals. This is a significant step towards trustworthy AI in healthcare, aiding in remote health monitoring efforts.

[AI-18] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

链接: https://arxiv.org/abs/2412.03213
作者: Guangda Liu,Chengwei Li,Jieru Zhao,Chenqi Zhang,Minyi Guo
关键词-EN: Large Language Models, Large Language, complex logical reasoning, Language Models, variety of applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2 \times speedup in latency and a 2.5 \times improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.

[AI-19] Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for Traffic Prediction

链接: https://arxiv.org/abs/2412.03188
作者: Ivan Kralj,Lodovico Giaretta,Gordan Ježić,Ivana Podnar Žarko,Šarūnas Girdzijauskas
关键词-EN: avoid major disruptions, produce vast amounts, sensors produce vast, Graph Neural Networks, major disruptions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 4 figures, 3 tables, conference

点击查看摘要

Abstract:In smart mobility, large networks of geographically distributed sensors produce vast amounts of high-frequency spatio-temporal data that must be processed in real time to avoid major disruptions. Traditional centralized approaches are increasingly unsuitable to this task, as they struggle to scale with expanding sensor networks, and reliability issues in central components can easily affect the whole deployment. To address these challenges, we explore and adapt semi-decentralized training techniques for Spatio-Temporal Graph Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation framework where sensors are grouped by proximity into multiple cloudlets, each handling a subgraph of the traffic graph, fetching node features from other cloudlets to train its own local ST-GNN model, and exchanging model updates with other cloudlets to ensure consistency, enhancing scalability and removing reliance on a centralized aggregator. We perform extensive comparative evaluation of four different ST-GNN training setups – centralized, traditional FL, server-free FL, and Gossip Learning – on large-scale traffic datasets, the METR-LA and PeMS-BAY datasets, for short-, mid-, and long-term vehicle speed predictions. Experimental results show that semi-decentralized setups are comparable to centralized approaches in performance metrics, while offering advantages in terms of scalability and fault tolerance. In addition, we highlight often overlooked issues in existing literature for distributed ST-GNNs, such as the variation in model performance across different geographical areas due to region-specific traffic patterns, and the significant communication overhead and computational costs that arise from the large receptive field of GNNs, leading to substantial data transfers and increased computation of partial embeddings.

[AI-20] Physics-Informed Deep Inverse Operator Networks for Solving PDE Inverse Problems

链接: https://arxiv.org/abs/2412.03161
作者: Sung Woong Cho,Hwijae Son
关键词-EN: partial differential equations, involving partial differential, problems involving partial, Inverse problems involving, operator learning approach
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inverse problems involving partial differential equations (PDEs) can be seen as discovering a mapping from measurement data to unknown quantities, often framed within an operator learning approach. However, existing methods typically rely on large amounts of labeled training data, which is impractical for most real-world applications. Moreover, these supervised models may fail to capture the underlying physical principles accurately. To address these limitations, we propose a novel architecture called Physics-Informed Deep Inverse Operator Networks (PI-DIONs), which can learn the solution operator of PDE-based inverse problems without labeled training data. We extend the stability estimates established in the inverse problem literature to the operator learning framework, thereby providing a robust theoretical foundation for our method. These estimates guarantee that the proposed model, trained on a finite sample and grid, generalizes effectively across the entire domain and function space. Extensive experiments are conducted to demonstrate that PI-DIONs can effectively and accurately learn the solution operators of the inverse problems without the need for labeled data.

[AI-21] sting Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples

链接: https://arxiv.org/abs/2412.03154
作者: Xingjian Zhou,Hongji Xu,Andy Xu,Zhouxing Shi,Cho-Jui Hsieh,Huan Zhang
关键词-EN: recent years, developed to formally, benchmark, formally verify, verifiers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Preprint

点击查看摘要

Abstract:In recent years, many neural network (NN) verifiers have been developed to formally verify certain properties of neural networks such as robustness. Although many benchmarks have been constructed to evaluate the performance of NN verifiers, they typically lack a ground-truth for hard instances where no current verifier can verify and no counterexample can be found, which makes it difficult to check the soundness of a new verifier if it claims to verify hard instances which no other verifier can do. We propose to develop a soundness benchmark for NN verification. Our benchmark contains instances with deliberately inserted counterexamples while we also try to hide the counterexamples from regular adversarial attacks which can be used for finding counterexamples. We design a training method to produce neural networks with such hidden counterexamples. Our benchmark aims to be used for testing the soundness of NN verifiers and identifying falsely claimed verifiability when it is known that hidden counterexamples exist. We systematically construct our benchmark and generate instances across diverse model architectures, activation functions, input sizes, and perturbation radii. We demonstrate that our benchmark successfully identifies bugs in state-of-the-art NN verifiers, as well as synthetic bugs, providing a crucial step toward enhancing the reliability of testing NN verifiers. Our code is available at this https URL and our benchmark is available at this https URL.

[AI-22] Large Language Models show both individual and collective creativity comparable to humans

链接: https://arxiv.org/abs/2412.03151
作者: Luning Sun,Yuzhuo Yuan,Yuan Yao,Yanyan Li,Hao Zhang,Xing Xie,Xiting Wang,Fang Luo,David Stillwell
关键词-EN: Large Language Models, Language Models, Large Language, largely automated routine, Artificial intelligence
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence has, so far, largely automated routine tasks, but what does it mean for the future of work if Large Language Models (LLMs) show creativity comparable to humans? To measure the creativity of LLMs holistically, the current study uses 13 creative tasks spanning three domains. We benchmark the LLMs against individual humans, and also take a novel approach by comparing them to the collective creativity of groups of humans. We find that the best LLMs (Claude and GPT-4) rank in the 52nd percentile against humans, and overall LLMs excel in divergent thinking and problem solving but lag in creative writing. When questioned 10 times, an LLM’s collective creativity is equivalent to 8-10 humans. When more responses are requested, two additional responses of LLMs equal one extra human. Ultimately, LLMs, when optimally applied, may compete with a small group of humans in the future of work.

[AI-23] Robust Multi-bit Text Watermark with LLM -based Paraphrasers

链接: https://arxiv.org/abs/2412.03123
作者: Xiaojun Xu,Jinghan Jia,Yuanshun Yao,Yang Liu,Hang Li
关键词-EN: propose an imperceptible, imperceptible multi-bit text, text watermark embedded, imperceptible multi-bit, LLM paraphrasers
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation. We open-source the code: this https URL.

[AI-24] Experience-driven discovery of planning strategies

链接: https://arxiv.org/abs/2412.03111
作者: Ruiqi He,Falk Lieder
关键词-EN: limited cognitive resources, planning strategies, adaptive planning strategies, people can plan, plan efficiently
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One explanation for how people can plan efficiently despite limited cognitive resources is that we possess a set of adaptive planning strategies and know when and how to use them. But how are these strategies acquired? While previous research has studied how individuals learn to choose among existing strategies, little is known about the process of forming new planning strategies. In this work, we propose that new planning strategies are discovered through metacognitive reinforcement learning. To test this, we designed a novel experiment to investigate the discovery of new planning strategies. We then present metacognitive reinforcement learning models and demonstrate their capability for strategy discovery as well as show that they provide a better explanation of human strategy discovery than alternative learning mechanisms. However, when fitted to human data, these models exhibit a slower discovery rate than humans, leaving room for improvement.

[AI-25] CredID: Credible Multi-Bit Watermark for Large Language Models Identification

链接: https://arxiv.org/abs/2412.03107
作者: Haoyu Jiang,Xuhong Wang,Ping Yi,Shanzhe Lei,Yilun Lin
关键词-EN: Large Language Models, complex natural language, natural language processing, language processing tasks, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: v1

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in complex natural language processing tasks but raise privacy and security concerns due to the lack of identity recognition. This paper proposes a multi-party credible watermarking framework (CredID) involving a trusted third party (TTP) and multiple LLM vendors to address these issues. In the watermark embedding stage, vendors request a seed from the TTP to generate watermarked text without sending the user’s prompt. In the extraction stage, the TTP coordinates each vendor to extract and verify the watermark from the text. This provides a credible watermarking scheme while preserving vendor privacy. Furthermore, current watermarking algorithms struggle with text quality, information capacity, and robustness, making it challenging to meet the diverse identification needs of LLMs. Thus, we propose a novel multi-bit watermarking algorithm and an open-source toolkit to facilitate research. Experiments show our CredID enhances watermark credibility and efficiency without compromising text quality. Additionally, we successfully utilized this framework to achieve highly accurate identification among multiple LLM vendors.

[AI-26] ChatTS: Aligning Time Series with LLM s via Synthetic Data for Enhanced Understanding and Reasoning

链接: https://arxiv.org/abs/2412.03104
作者: Zhe Xie,Zeyan Li,Xiao He,Longlong Xu,Xidao Wen,Tieying Zhang,Jianjun Chen,Rui Shi,Dan Pei
关键词-EN: time series, series, time, time series understanding, Understanding time series
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:Understanding time series is crucial for its application in real-world scenarios. Recently, large language models (LLMs) have been increasingly applied to time series tasks, leveraging their strong language capabilities to enhance various applications. However, research on multimodal LLMs (MLLMs) for time series understanding and reasoning remains limited, primarily due to the scarcity of high-quality datasets that align time series with textual information. This paper introduces ChatTS, a novel MLLM designed for time series analysis. ChatTS treats time series as a modality, similar to how vision MLLMs process images, enabling it to perform both understanding and reasoning with time series. To address the scarcity of training data, we propose an attribute-based method for generating synthetic time series with detailed attribute descriptions. We further introduce Time Series Evol-Instruct, a novel approach that generates diverse time series QAs, enhancing the model’s reasoning capabilities. To the best of our knowledge, ChatTS is the first MLLM that takes multivariate time series as input, which is fine-tuned exclusively on synthetic datasets. We evaluate its performance using benchmark datasets with real-world data, including six alignment tasks and four reasoning tasks. Our results show that ChatTS significantly outperforms existing vision-based MLLMs (e.g., GPT-4o) and text/agent-based LLMs, achieving a 46.0% improvement in alignment tasks and a 25.8% improvement in reasoning tasks.

[AI-27] Coordinated Multi-Armed Bandits for Improved Spatial Reuse in Wi-Fi

链接: https://arxiv.org/abs/2412.03076
作者: Francesc Wilhelmi,Boris Bellalta,Szymon Szott,Katarzyna Kosek-Szott,Sergio Barrachina-Muñoz
关键词-EN: Multi-Access Point Coordination, Point Coordination, Artificial Intelligence, Intelligence and Machine, Multi-Access Point
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-Access Point Coordination (MAPC) and Artificial Intelligence and Machine Learning (AI/ML) are expected to be key features in future Wi-Fi, such as the forthcoming IEEE 802.11bn (Wi-Fi 8) and beyond. In this paper, we explore a coordinated solution based on online learning to drive the optimization of Spatial Reuse (SR), a method that allows multiple devices to perform simultaneous transmissions by controlling interference through Packet Detect (PD) adjustment and transmit power control. In particular, we focus on a Multi-Agent Multi-Armed Bandit (MA-MAB) setting, where multiple decision-making agents concurrently configure SR parameters from coexisting networks by leveraging the MAPC framework, and study various algorithms and reward-sharing mechanisms. We evaluate different MA-MAB implementations using Komondor, a well-adopted Wi-Fi simulator, and demonstrate that AI-native SR enabled by coordinated MABs can improve the network performance over current Wi-Fi operation: mean throughput increases by 15%, fairness is improved by increasing the minimum throughput across the network by 210%, while the maximum access delay is kept below 3 ms.

[AI-28] Preference-based opponent shaping in differentiable games

链接: https://arxiv.org/abs/2412.03072
作者: Xinyu Qiao,Yudong Hu,Congying Han,Weiyan Wu,Tiande Guo
关键词-EN: Strategy learning, Strategy, challenging problem, learning, game environments
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Strategy learning in game environments with multi-agent is a challenging problem. Since each agent’s reward is determined by the joint strategy, a greedy learning strategy that aims to maximize its own reward may fall into a local optimum. Recent studies have proposed the opponent modeling and shaping methods for game environments. These methods enhance the efficiency of strategy learning by modeling the strategies and updating processes of other agents. However, these methods often rely on simple predictions of opponent strategy changes. Due to the lack of modeling behavioral preferences such as cooperation and competition, they are usually applicable only to predefined scenarios and lack generalization capabilities. In this paper, we propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents’ preferences towards cooperation. We introduce the preference parameter, which is incorporated into the agent’s loss function, thus allowing the agent to directly consider the opponent’s loss function when updating the strategy. We update the preference parameters concurrently with strategy learning to ensure that agents can adapt to any cooperative or competitive game environment. Through a series of experiments, we verify the performance of PBOS algorithm in a variety of differentiable games. The experimental results show that the PBOS algorithm can guide the agent to learn the appropriate preference parameters, so as to achieve better reward distribution in multiple game environments.

[AI-29] UTSD: Unified Time Series Diffusion Model

链接: https://arxiv.org/abs/2412.03068
作者: Xiangkai Ma,Xiaobin Hong,Wenzhong Li,Sanglu Lu
关键词-EN: achieved unprecedented success, time series analysis, Transformer-based architectures, Time Series Diffusion, Unified Time Series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adapter-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. The empirical results validate the potential of UTSD as a time series foundational model.

[AI-30] Less is More: A Stealthy and Efficient Adversarial Attack Method for DRL-based Autonomous Driving Policies

链接: https://arxiv.org/abs/2412.03051
作者: Junchao Fan,Xuyang Lei,Xiaolin Chang,Jelena Mišić,Vojislav B. Mišić
关键词-EN: based autonomous driving, autonomous driving policies, autonomous driving, deep reinforcement learning, driving policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite significant advancements in deep reinforcement learning (DRL)-based autonomous driving policies, these policies still exhibit vulnerability to adversarial attacks. This vulnerability poses a formidable challenge to the practical deployment of these policies in autonomous driving. Designing effective adversarial attacks is an indispensable prerequisite for enhancing the robustness of these policies. In view of this, we present a novel stealthy and efficient adversarial attack method for DRL-based autonomous driving policies. Specifically, we introduce a DRL-based adversary designed to trigger safety violations (e.g., collisions) by injecting adversarial samples at critical moments. We model the attack as a mixed-integer optimization problem and formulate it as a Markov decision process. Then, we train the adversary to learn the optimal policy for attacking at critical moments without domain knowledge. Furthermore, we introduce attack-related information and a trajectory clipping method to enhance the learning capability of the adversary. Finally, we validate our method in an unprotected left-turn scenario across different traffic densities. The experimental results show that our method achieves more than 90% collision rate within three attacks in most cases. Furthermore, our method achieves more than 130% improvement in attack efficiency compared to the unlimited attack method.

[AI-31] Specification Generation for Neural Networks in Systems

链接: https://arxiv.org/abs/2412.03028
作者: Isha Chaudhary,Shuyi Lin,Cheng Tan,Gagandeep Singh
关键词-EN: precise mathematical representations, precise mathematical, mathematical representations, crucial to guarantee, guarantee the trustworthiness
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Specifications - precise mathematical representations of correct domain-specific behaviors - are crucial to guarantee the trustworthiness of computer systems. With the increasing development of neural networks as computer system components, specifications gain more importance as they can be used to regulate the behaviors of these black-box models. Traditionally, specifications are designed by domain experts based on their intuition of correct behavior. However, this is labor-intensive and hence not a scalable approach as computer system applications diversify. We hypothesize that the traditional (aka reference) algorithms that neural networks replace for higher performance can act as effective proxies for correct behaviors of the models, when available. This is because they have been used and tested for long enough to encode several aspects of the trustworthy/correct behaviors in the underlying domain. Driven by our hypothesis, we develop a novel automated framework, SpecTRA to generate specifications for neural networks using references. We formulate specification generation as an optimization problem and solve it with observations of reference behaviors. SpecTRA clusters similar observations into compact specifications. We present specifications generated by SpecTRA for neural networks in adaptive bit rate and congestion control algorithms. Our specifications show evidence of being correct and matching intuition. Moreover, we use our specifications to show several unknown vulnerabilities of the SOTA models for computer systems.

[AI-32] heoretical limitations of multi-layer Transformer

链接: https://arxiv.org/abs/2412.02975
作者: Lijie Chen,Binghui Peng,Hongxun Wu
关键词-EN: modern large language, large language models, textit, modern large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple 1 -layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first \textitunconditional lower bound against multi-layer decoder-only transformers. For any constant L , we prove that any L -layer decoder-only transformer needs a polynomial model dimension ( n^\Omega(1) ) to perform sequential composition of L functions over an input of n tokens. As a consequence, our results give: (1) the first depth-width trade-off for multi-layer transformers, exhibiting that the L -step composition task is exponentially harder for L -layer models compared to (L+1) -layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought. On the technical side, we propose the multi-party \textitautoregressive \textitcommunication \textitmodel that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain \textitindistinguishable \textitdecomposition of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2412.02975 [cs.LG] (or arXiv:2412.02975v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.02975 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Binghui Peng [view email] [v1] Wed, 4 Dec 2024 02:37:31 UTC (39 KB)

[AI-33] 3D Interaction Geometric Pre-training for Molecular Relational Learning

链接: https://arxiv.org/abs/2412.02957
作者: Namkyeong Lee,Yunhak Oh,Heewoong Noh,Gyoung S. Na,Minkai Xu,Hanchen Wang,Tianfan Fu,Chanyoung Park
关键词-EN: Molecular Relational Learning, rapidly growing field, Relational Learning, Molecular Relational, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecular Relational Learning (MRL) is a rapidly growing field that focuses on understanding the interaction dynamics between molecules, which is crucial for applications ranging from catalyst engineering to drug discovery. Despite recent progress, earlier MRL approaches are limited to using only the 2D topological structure of molecules, as obtaining the 3D interaction geometry remains prohibitively expensive. This paper introduces a novel 3D geometric pre-training strategy for MRL (3DMRL) that incorporates a 3D virtual interaction environment, overcoming the limitations of costly traditional quantum mechanical calculation methods. With the constructed 3D virtual interaction environment, 3DMRL trains 2D MRL model to learn the overall 3D geometric information of molecular interaction through contrastive learning. Moreover, fine-grained interaction between molecules is learned through force prediction loss, which is crucial in understanding the wide range of molecular interaction processes. Extensive experiments on various tasks using real-world datasets, including out-of-distribution and extrapolation scenarios, demonstrate the effectiveness of 3DMRL, showing up to a 24.93% improvement in performance across 40 tasks.

[AI-34] STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow Prediction

链接: https://arxiv.org/abs/2412.02942
作者: Silu He,Peng Shen,Pingzhen Xu,Qinyao Luo,Haifeng Li
关键词-EN: Existing works typically, works typically treat, Existing works, representation space, typically treat spatial-temporal
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing works typically treat spatial-temporal prediction as the task of learning a function F to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ( E ): learning the intrinsic representation of observations, (2) Cross-Time Mapping ( M ): transforming past representations into future representations, and (3) Decoding ( D ): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning F = E \cdot M \cdot D , which includes learning the space transformations \left\E,D\right\ between the observation space and the hidden representation space, as well as the spatial-temporal mapping M from future states to past states within the representation space. This leads to two key questions: \textbfQ1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space? To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.

[AI-35] SAVER: A Toolbox for Sampling-Based Probabilistic Verification of Neural Networks

链接: https://arxiv.org/abs/2412.02940
作者: Vignesh Sivaramakrishnan,Krishna C. Kalagarla,Rosalyn Devonport,Joshua Pilipovsky,Panagiotis Tsiotras,Meeko Oishi
关键词-EN: neural network verification, set expansion factor, network verification toolbox, neural network, expansion factor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 8 figures, submitted to the 28th ACM International Conference on Hybrid Systems: Computation and Control

点击查看摘要

Abstract:We present a neural network verification toolbox to 1) assess the probability of satisfaction of a constraint, and 2) synthesize a set expansion factor to achieve the probability of satisfaction. Specifically, the tool box establishes with a user-specified level of confidence whether the output of the neural network for a given input distribution is likely to be contained within a given set. Should the tool determine that the given set cannot satisfy the likelihood constraint, the tool also implements an approach outlined in this paper to alter the constraint set to ensure that the user-defined satisfaction probability is achieved. The toolbox is comprised of sampling-based approaches which exploit the properties of signed distance function to define set containment.

[AI-36] Inverse Delayed Reinforcement Learning

链接: https://arxiv.org/abs/2412.02931
作者: Simon Sinong Zhan,Qingyuan Wu,Zhian Ruan,Frank Yang,Philip Wang,Yixuan Wang,Ruochen Jiao,Chao Huang,Qi Zhu
关键词-EN: Inverse Reinforcement Learning, Inverse Reinforcement, Reinforcement Learning, imitation tasks, variety of imitation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Inverse Reinforcement Learning (IRL) has demonstrated effectiveness in a variety of imitation tasks. In this paper, we introduce an IRL framework designed to extract rewarding features from expert trajectories affected by delayed disturbances. Instead of relying on direct observations, our approach employs an efficient off-policy adversarial training framework to derive expert features and recover optimal policies from augmented delayed observations. Empirical evaluations in the MuJoCo environment under diverse delay settings validate the effectiveness of our method. Furthermore, we provide a theoretical analysis showing that recovering expert policies from augmented delayed observations outperforms using direct delayed observations.

[AI-37] Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data

链接: https://arxiv.org/abs/2412.02919
作者: Soroush Omranpour,Guillaume Rabusseau,Reihaneh Rabbany
关键词-EN: sequence modeling tasks, propose Higher-Order Transformers, multi-dimensional data remains, ubiquitous for sequence, sequence modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the quadratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis’ dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model’s expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.

[AI-38] Deep-Learning Based Docking Methods: Fair Comparisons to Conventional Docking Workflows

链接: https://arxiv.org/abs/2412.02889
作者: Ajay N. Jain,Ann E. Cleves,W. Patrick Walters
关键词-EN: docking small-molecule ligands, binding site, binding site condition, protein binding sites, conventional docking approaches
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 19 pages including references and appendices, 7 figures

点击查看摘要

Abstract:The diffusion learning method, DiffDock, for docking small-molecule ligands into protein binding sites was recently introduced. Results included comparisons to more conventional docking approaches, with DiffDock showing superior performance. Here, we employ a fully automatic workflow using the Surflex-Dock methods to generate a fair baseline for conventional docking approaches. Results were generated for the common and expected situation where a binding site location is known and also for the condition of an unknown binding site. For the known binding site condition, Surflex-Dock success rates at 2.0 Angstroms RMSD far exceeded those for DiffDock (Top-1/Top-5 success rates, respectively, were 68/81% compared with 45/51%). Glide performed with similar success rates (67/73%) to Surflex-Dock for the known binding site condition, and results for AutoDock Vina and Gnina followed this pattern. For the unknown binding site condition, using an automated method to identify multiple binding pockets, Surflex-Dock success rates again exceeded those of DiffDock, but by a somewhat lesser margin. DiffDock made use of roughly 17,000 co-crystal structures for learning (98% of PDBBind version 2020, pre-2019 structures) for a training set in order to predict on 363 test cases (2% of PDBBind 2020) from 2019 forward. DiffDock’s performance was inextricably linked with the presence of near-neighbor cases of close to identical protein-ligand complexes in the training set for over half of the test set cases. DiffDock exhibited a 40 percentage point difference on near-neighbor cases (two-thirds of all test cases) compared with cases with no near-neighbor training case. DiffDock has apparently encoded a type of table-lookup during its learning process, rendering meaningful applications beyond its reach. Further, it does not perform even close to competitively with a competently run modern docking workflow.

[AI-39] Modeling and Discovering Direct Causes for Predictive Models

链接: https://arxiv.org/abs/2412.02878
作者: Yizuo Chen,Amit Bhatia
关键词-EN: machine learning models, causal modeling framework, causal graphs, machine learning, causal modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We introduce a causal modeling framework that captures the input-output behavior of predictive models (e.g., machine learning models) by representing it using causal graphs. The framework enables us to define and identify features that directly cause the predictions, which has broad implications for data collection and model evaluation. We show two assumptions under which the direct causes can be discovered from data, one of which further simplifies the discovery process. In addition to providing sound and complete algorithms, we propose an optimization technique based on an independence rule that can be integrated with the algorithms to speed up the discovery process both theoretically and empirically.

[AI-40] Out-of-Distribution Detection for Neurosymbolic Autonomous Cyber Agents

链接: https://arxiv.org/abs/2412.02875
作者: Ankita Samaddar,Nicholas Potteiger,Xenofon Koutsoukos
关键词-EN: modern defense techniques, adopting intelligent agents, applications take advantage, advantage of modern, modern defense
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 9 pages, 10 figures, IEEE International Conference on AI in Cybersecurity (ICAIC), 2025

点击查看摘要

Abstract:Autonomous agents for cyber applications take advantage of modern defense techniques by adopting intelligent agents with conventional and learning-enabled components. These intelligent agents are trained via reinforcement learning (RL) algorithms, and can learn, adapt to, reason about and deploy security rules to defend networked computer systems while maintaining critical operational workflows. However, the knowledge available during training about the state of the operational network and its environment may be limited. The agents should be trustworthy so that they can reliably detect situations they cannot handle, and hand them over to cyber experts. In this work, we develop an out-of-distribution (OOD) Monitoring algorithm that uses a Probabilistic Neural Network (PNN) to detect anomalous or OOD situations of RL-based agents with discrete states and discrete actions. To demonstrate the effectiveness of the proposed approach, we integrate the OOD monitoring algorithm with a neurosymbolic autonomous cyber agent that uses behavior trees with learning-enabled components. We evaluate the proposed approach in a simulated cyber environment under different adversarial strategies. Experimental results over a large number of episodes illustrate the overall efficiency of our proposed approach.

[AI-41] Constrained Identifiability of Causal Effects

链接: https://arxiv.org/abs/2412.02869
作者: Yizuo Chen,Adnan Darwiche
关键词-EN: causal graph, study the identification, causal effects, causal, constraints
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the identification of causal effects in the presence of different types of constraints (e.g., logical constraints) in addition to the causal graph. These constraints impose restrictions on the models (parameterizations) induced by the causal graph, reducing the set of models considered by the identifiability problem. We formalize the notion of constrained identifiability, which takes a set of constraints as another input to the classical definition of identifiability. We then introduce a framework for testing constrained identifiability by employing tractable Arithmetic Circuits (ACs), which enables us to accommodate constraints systematically. We show that this AC-based approach is at least as complete as existing algorithms (e.g., do-calculus) for testing classical identifiability, which only assumes the constraint of strict positivity. We use examples to demonstrate the effectiveness of this AC-based approach by showing that unidentifiable causal effects may become identifiable under different types of constraints.

[AI-42] A Novel Compact LLM Framework for Local High-Privacy EHR Data Applications

链接: https://arxiv.org/abs/2412.02868
作者: Yixiang Qu,Yifan Dai,Shilin Yu,Pradham Tanikella,Travis Schrank,Trevor Hackman,Didong Li,Di Wu
关键词-EN: Electronic Health Records, Large Language Models, natural language processing, Health Records, Electronic Health
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capabilities in natural language processing, yet their use in sensitive domains like healthcare, particularly with Electronic Health Records (EHR), faces significant challenges due to privacy concerns and limited computational resources. This paper presents a compact LLM framework designed for local deployment in settings with strict privacy requirements and limited access to high-performance GPUs. We introduce a novel preprocessing technique that uses information extraction methods, e.g., regular expressions, to filter and emphasize critical information in clinical notes, enhancing the performance of smaller LLMs on EHR data. Our framework is evaluated using zero-shot and few-shot learning paradigms on both private and publicly available (MIMIC-IV) datasets, and we also compare its performance with fine-tuned LLMs on the MIMIC-IV dataset. The results demonstrate that our preprocessing approach significantly boosts the prediction accuracy of smaller LLMs, making them suitable for high-privacy, resource-constrained applications. This study offers valuable insights into optimizing LLM performance for sensitive, data-intensive tasks while addressing computational and privacy limitations.

[AI-43] Proximal Control of UAVs with Federated Learning for Human-Robot Collaborative Domains

链接: https://arxiv.org/abs/2412.02863
作者: Lucas Nogueira Nobrega,Ewerton de Oliveira,Martin Saska,Tiago Nascimento
关键词-EN: human-robot interaction, area of research, growing area, HRI, Deep Neural Networks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The human-robot interaction (HRI) is a growing area of research. In HRI, complex command (action) classification is still an open problem that usually prevents the real applicability of such a technique. The literature presents some works that use neural networks to detect these actions. However, occlusion is still a major issue in HRI, especially when using uncrewed aerial vehicles (UAVs), since, during the robot’s movement, the human operator is often out of the robot’s field of view. Furthermore, in multi-robot scenarios, distributed training is also an open problem. In this sense, this work proposes an action recognition and control approach based on Long Short-Term Memory (LSTM) Deep Neural Networks with two layers in association with three densely connected layers and Federated Learning (FL) embedded in multiple drones. The FL enabled our approach to be trained in a distributed fashion, i.e., access to data without the need for cloud or other repositories, which facilitates the multi-robot system’s learning. Furthermore, our multi-robot approach results also prevented occlusion situations, with experiments with real robots achieving an accuracy greater than 96%.

[AI-44] Block MedCare: Advancing healthcare through blockchain integration with AI and IoT

链接: https://arxiv.org/abs/2412.02851
作者: Oliver Simonoski,Dijana Capeska Bogatinoska
关键词-EN: Electronic Health Record, Health Record, Electronic Health, efficiency of Electronic, focusing on enhancing
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This research explores the integration of blockchain technology in healthcare, focusing on enhancing the security and efficiency of Electronic Health Record (EHR) management. We propose a novel Ethereum-based system that empowers patients with secure control over their medical data. Our approach addresses key challenges in healthcare blockchain implementation, including scalability, privacy, and regulatory compliance. The system incorporates digital signatures, Role-Based Access Control, and a multi-layered architecture to ensure secure, controlled access. We developed a decentralized application (dApp) with user-friendly interfaces for patients, doctors, and administrators, demonstrating the practical application of our solution. A survey among healthcare professionals and IT experts revealed strong interest in blockchain adoption, while also highlighting concerns about integration costs. The study explores future enhancements, including integration with IoT devices and AI-driven analytics, contributing to the evolution of secure, efficient, and interoperable healthcare systems that leverage cutting-edge technologies for improved patient care.

[AI-45] Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Model

链接: https://arxiv.org/abs/2412.02802
作者: María Victoria Carro
关键词-EN: user perceived preferences, perceived preferences, factually correct, statements are factually, large language model
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sycophancy refers to the tendency of a large language model to align its outputs with the user’s perceived preferences, beliefs, or opinions, in order to look favorable, regardless of whether those statements are factually correct. This behavior can lead to undesirable consequences, such as reinforcing discriminatory biases or amplifying misinformation. Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in large language models or, conversely, whether users consider such behavior as favorable. To investigate this, we instructed one group of participants to answer ground-truth questions with the assistance of a GPT specifically designed to provide sycophantic responses, while another group used the standard version of ChatGPT. Initially, participants were required to use the language model, after which they were given the option to continue using it if they found it trustworthy and useful. Trust was measured through both demonstrated actions and self-reported perceptions. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model, despite the opportunity to verify the accuracy of the model’s output.

[AI-46] Optimization of Transformer heart disease prediction model based on particle swarm optimization algorithm

链接: https://arxiv.org/abs/2412.02801
作者: Peiyang Yu,Jingyuan Yi,Tianyi Huang,Zeqiu Xu,Xiaochuan Xu
关键词-EN: improved Transformer model, particle swarm optimization, Transformer model, heart disease, latest particle swarm
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aiming at the latest particle swarm optimization algorithm, this paper proposes an improved Transformer model to improve the accuracy of heart disease prediction and provide a new algorithm idea. We first use three mainstream machine learning classification algorithms - decision tree, random forest and XGBoost, and then output the confusion matrix of these three models. The results showed that the random forest model had the best performance in predicting the classification of heart disease, with an accuracy of 92.2%. Then, we apply the Transformer model based on particle swarm optimization (PSO) algorithm to the same dataset for classification experiment. The results show that the classification accuracy of the model is as high as 96.5%, 4.3 percentage points higher than that of random forest, which verifies the effectiveness of PSO in optimizing Transformer model. From the above research, we can see that particle swarm optimization significantly improves Transformer performance in heart disease prediction. Improving the ability to predict heart disease is a global priority with benefits for all humankind. Accurate prediction can enhance public health, optimize medical resources, and reduce healthcare costs, leading to healthier populations and more productive societies worldwide. This advancement paves the way for more efficient health management and supports the foundation of a healthier, more resilient global community.

[AI-47] FathomGPT: A Natural Language Interface for Interactively Exploring Ocean Science Data

链接: https://arxiv.org/abs/2412.02784
作者: Nabin Khanal,Chun Meng Yu,Jui-Cheng Chiu,Anav Chaudhary,Ziyue Zhang,Kakani Katija,Angus G. Forbes
关键词-EN: open source system, natural language interface, open source, source system, ocean science data
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally to this work. Accepted to the 37th Annual ACM Symposium on User Interface Software and Technology (UIST 2024)

点击查看摘要

Abstract:We introduce FathomGPT, an open source system for the interactive investigation of ocean science data via a natural language interface. FathomGPT was developed in close collaboration with marine scientists to enable researchers to explore and analyze the FathomNet image database. FathomGPT provides a custom information retrieval pipeline that leverages OpenAI’s large language models to enable: the creation of complex queries to retrieve images, taxonomic information, and scientific measurements; mapping common names and morphological features to scientific names; generating interactive charts on demand; and searching by image or specified patterns within an image. In designing FathomGPT, particular emphasis was placed on enhancing the user’s experience by facilitating free-form exploration and optimizing response times. We present an architectural overview and implementation details of FathomGPT, along with a series of ablation studies that demonstrate the effectiveness of our approach to name resolution, fine tuning, and prompt modification. We also present usage scenarios of interactive data exploration sessions and document feedback from ocean scientists and machine learning experts.

[AI-48] WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks

链接: https://arxiv.org/abs/2412.02780
作者: Rajat Shinde,Christopher E. Phillips,Kumar Ankur,Aman Gupta,Simon Pfreundschuh,Sujit Roy,Sheyenne Kirkland,Vishal Gaur,Amy Lin,Aditi Sheshadri,Udaysankar Nair,Manil Maskey,Rahul Ramachandran
关键词-EN: fine-tuning existing models, ready datasets play, weather and climate, High-quality machine learning, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-quality machine learning (ML)-ready datasets play a foundational role in developing new artificial intelligence (AI) models or fine-tuning existing models for scientific applications such as weather and climate analysis. Unfortunately, despite the growing development of new deep learning models for weather and climate, there is a scarcity of curated, pre-processed machine learning (ML)-ready datasets. Curating such high-quality datasets for developing new models is challenging particularly because the modality of the input data varies significantly for different downstream tasks addressing different atmospheric scales (spatial and temporal). Here we introduce WxC-Bench (Weather and Climate Bench), a multi-modal dataset designed to support the development of generalizable AI models for downstream use-cases in weather and climate research. WxC-Bench is designed as a dataset of datasets for developing ML-models for a complex weather and climate system, addressing selected downstream tasks as machine learning phenomenon. WxC-Bench encompasses several atmospheric processes from meso- \beta (20 - 200 km) scale to synoptic scales (2500 km), such as aviation turbulence, hurricane intensity and track monitoring, weather analog search, gravity wave parameterization, and natural language report generation. We provide a comprehensive description of the dataset and also present a technical validation for baseline analysis. The dataset and code to prepare the ML-ready data have been made publicly available on Hugging Face – this https URL

[AI-49] Synergistic Development of Perovskite Memristors and Algorithms for Robust Analog Computing

链接: https://arxiv.org/abs/2412.02779
作者: Nanyang Ye,Qiao Sun,Yifei Wang,Liujia Yang,Jundong Zhou,Lei Wang,Guang-Zhong Yang,Xinbing Wang,Chenghu Zhou,Huaqiang Wu,Qinying Gu
关键词-EN: energy-efficient deep learning, deep learning, energy-efficient deep, Analog, Analog computing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Analog computing using non-volatile memristors has emerged as a promising solution for energy-efficient deep learning. New materials, like perovskites-based memristors are recently attractive due to their cost-effectiveness, energy efficiency and flexibility. Yet, challenges in material diversity and immature fabrications require extensive experimentation for device development. Moreover, significant non-idealities in these memristors often impede them for computing. Here, we propose a synergistic methodology to concurrently optimize perovskite memristor fabrication and develop robust analog DNNs that effectively address the inherent non-idealities of these memristors. Employing Bayesian optimization (BO) with a focus on usability, we efficiently identify optimal materials and fabrication conditions for perovskite memristors. Meanwhile, we developed “BayesMulti”, a DNN training strategy utilizing BO-guided noise injection to improve the resistance of analog DNNs to memristor imperfections. Our approach theoretically ensures that within a certain range of parameter perturbations due to memristor non-idealities, the prediction outcomes remain consistent. Our integrated approach enables use of analog computing in much deeper and wider networks, which significantly outperforms existing methods in diverse tasks like image classification, autonomous driving, species identification, and large vision-language models, achieving up to 100-fold improvements. We further validate our methodology on a 10 \times 10 optimized perovskite memristor crossbar, demonstrating high accuracy in a classification task and low energy consumption. This study offers a versatile solution for efficient optimization of various analog computing systems, encompassing both devices and algorithms.

[AI-50] Hacking CTFs with Plain Agents

链接: https://arxiv.org/abs/2412.02776
作者: Rustem Turtayev,Artem Petrov,Dmitrii Volkov,Denis Volk
关键词-EN: LLM agent design, plain LLM agent, agent design, cs.CR, Abstract
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We saturate a high-school-level hacking benchmark with plain LLM agent design. Concretely, we obtain 95% performance on InterCode-CTF, a popular offensive security benchmark, using prompting, tool use, and multiple attempts. This beats prior work by Phuong et al. 2024 (29%) and Abramovich et al. 2024 (72%). Our results suggest that current LLMs have surpassed the high school level in offensive cybersecurity. Their hacking capabilities remain underelicited: our ReActPlan prompting strategy solves many challenges in 1-2 turns without complex engineering or advanced harnessing. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.02776 [cs.CR] (or arXiv:2412.02776v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.02776 Focus to learn more arXiv-issued DOI via DataCite

[AI-51] Drawing Pandas: A Benchmark for LLM s in Generating Plotting Code

链接: https://arxiv.org/abs/2412.02764
作者: Timur Galimzyanov,Sergey Titov,Yaroslav Golubev,Egor Bogomolov
关键词-EN: visual data exploration, evaluate language models’, language models’ effectiveness, human-curated PandasPlotBench dataset, designed to evaluate
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models’ effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our benchmark is available online: this https URL. The code for running the benchmark is also available: this https URL.

[AI-52] Shaping AIs Impact on Billions of Lives

链接: https://arxiv.org/abs/2412.02730
作者: Mariano-Florentino Cuéllar,Jeff Dean,Finale Doshi-Velez,John Hennessy,Andy Konwinski,Sanmi Koyejo,Pelonomi Moiloa,Emma Pierson,David Patterson
关键词-EN: double-edged sword, significant advancements, advancements or detrimental, detrimental outcomes, Artificial Intelligence
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI), like any transformative technology, has the potential to be a double-edged sword, leading either toward significant advancements or detrimental outcomes for society as a whole. As is often the case when it comes to widely-used technologies in market economies (e.g., cars and semiconductor chips), commercial interest tends to be the predominant guiding factor. The AI community is at risk of becoming polarized to either take a laissez-faire attitude toward AI development, or to call for government overregulation. Between these two poles we argue for the community of AI practitioners to consciously and proactively work for the common good. This paper offers a blueprint for a new type of innovation infrastructure including 18 concrete milestones to guide AI research in that direction. Our view is that we are still in the early days of practical AI, and focused efforts by practitioners, policymakers, and other stakeholders can still maximize the upsides of AI and minimize its downsides. We talked to luminaries such as recent Nobelist John Jumper on science, President Barack Obama on governance, former UN Ambassador and former National Security Advisor Susan Rice on security, philanthropist Eric Schmidt on several topics, and science fiction novelist Neal Stephenson on entertainment. This ongoing dialogue and collaborative effort has produced a comprehensive, realistic view of what the actual impact of AI could be, from a diverse assembly of thinkers with deep understanding of this technology and these domains. From these exchanges, five recurring guidelines emerged, which form the cornerstone of a framework for beginning to harness AI in service of the public good. They not only guide our efforts in discovery but also shape our approach to deploying this transformative technology responsibly and ethically. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2412.02730 [cs.AI] (or arXiv:2412.02730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.02730 Focus to learn more arXiv-issued DOI via DataCite

[AI-53] DYffCast: Regional Precipitation Nowcasting Using IMERG Satellite Data. A case study over South America NEURIPS2024

链接: https://arxiv.org/abs/2412.02723
作者: Daniel Seal,Rossella Arcucci,Salva Rühling-Cachay,César Quilodrán-Casas
关键词-EN: making weather disasters, extreme precipitation events, Climate change, making weather, change is increasing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in the Machine Learning for Physical Sciences workshop @ NeurIPS 2024

点击查看摘要

Abstract:Climate change is increasing the frequency of extreme precipitation events, making weather disasters such as flooding and landslides more likely. The ability to accurately nowcast precipitation is therefore becoming more critical for safeguarding society by providing immediate, accurate information to decision makers. Motivated by the recent success of generative models at precipitation nowcasting, this paper: extends the DYffusion framework to this task and evaluates its performance at forecasting IMERG satellite precipitation data up to a 4-hour horizon; modifies the DYffusion framework to improve its ability to model rainfall data; and introduces a novel loss function that combines MSE, MAE and the LPIPS perceptual score. In a quantitative evaluation of forecasts up to a 4-hour horizon, the modified DYffusion framework trained with the novel loss outperforms four competitor models. It has the highest CSI scores for weak, moderate, and heavy rain thresholds and retains an LPIPS score 0.2 for the entire roll-out, degrading the least as lead-time increases. The proposed nowcasting model demonstrates visually stable and sharp forecasts up to a 2-hour horizon on a heavy rain case study. Code is available at this https URL.

[AI-54] Enhanced N-BEATS for Mid-Term Electricity Demand Forecasting

链接: https://arxiv.org/abs/2412.02722
作者: Mateusz Kasprzyk,Paweł Pełka,Boris N. Oreshkin,Grzegorz Dudek
关键词-EN: improved mid-term electricity, mid-term electricity load, paper presents, presents an enhanced, improved mid-term
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents an enhanced N-BEATS model, N-BEATS*, for improved mid-term electricity load forecasting (MTLF). Building on the strengths of the original N-BEATS architecture, which excels in handling complex time series data without requiring preprocessing or domain-specific knowledge, N-BEATS* introduces two key modifications. (1) A novel loss function – combining pinball loss based on MAPE with normalized MSE, the new loss function allows for a more balanced approach by capturing both L1 and L2 loss terms. (2) A modified block architecture – the internal structure of the N-BEATS blocks is adjusted by introducing a destandardization component to harmonize the processing of different time series, leading to more efficient and less complex forecasting tasks. Evaluated on real-world monthly electricity consumption data from 35 European countries, N-BEATS* demonstrates superior performance compared to its predecessor and other established forecasting methods, including statistical, machine learning, and hybrid models. N-BEATS* achieves the lowest MAPE and RMSE, while also exhibiting the lowest dispersion in forecast errors.

[AI-55] Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments

链接: https://arxiv.org/abs/2412.02713
作者: Alona Strugatski,Giora Alexandron
关键词-EN: raising significant concerns, educational landscape, raising significant, Item Response Theory, transforming the educational
类目: Artificial Intelligence (cs.AI)
*备注: PRE-PRINT VERSION Accepted to The 15th International Learning Analytics and Knowledge Conference (LAK25)

点击查看摘要

Abstract:Generative AI is transforming the educational landscape, raising significant concerns about cheating. Despite the widespread use of multiple-choice questions in assessments, the detection of AI cheating in MCQ-based tests has been almost unexplored, in contrast to the focus on detecting AI-cheating on text-rich student outputs. In this paper, we propose a method based on the application of Item Response Theory to address this gap. Our approach operates on the assumption that artificial and human intelligence exhibit different response patterns, with AI cheating manifesting as deviations from the expected patterns of human responses. These deviations are modeled using Person-Fit Statistics. We demonstrate that this method effectively highlights the differences between human responses and those generated by premium versions of leading chatbots (ChatGPT, Claude, and Gemini), but that it is also sensitive to the amount of AI cheating in the data. Furthermore, we show that the chatbots differ in their reasoning profiles. Our work provides both a theoretical foundation and empirical evidence for the application of IRT to identify AI cheating in MCQ-based assessments.

[AI-56] Fine Tuning Swimming Locomotion Learned from Mosquito Larvae

链接: https://arxiv.org/abs/2412.02702
作者: Pranav Rajbhandari,Karthick Dhileep,Sridhar Ravi,Donald Sofge
关键词-EN: Computational Fluid Dynamics, Fluid Dynamics, Computational Fluid, backwards swimming motion, parameterized swimming motion
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In prior research, we analyzed the backwards swimming motion of mosquito larvae, parameterized it, and replicated it in a Computational Fluid Dynamics (CFD) model. Since the parameterized swimming motion is copied from observed larvae, it is not necessarily the most efficient locomotion for the model of the swimmer. In this project, we further optimize this copied solution for the swimmer model. We utilize Reinforcement Learning to guide local parameter updates. Since the majority of the computation cost arises from the CFD model, we additionally train a deep learning model to replicate the forces acting on the swimmer model. We find that this method is effective at performing local search to improve the parameterized swimming locomotion.

[AI-57] MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation

链接: https://arxiv.org/abs/2412.03039
作者: Hyojeong Lee,Youngwan Jo,Inpyo Hong,Sanghyun Park
关键词-EN: Multifaceted Resilient Network, Resilient Network, Multifaceted Resilient, propose a Multifaceted, Segment Anything Model
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:We propose a Multifaceted Resilient Network(MRNet), a novel architecture developed for medical image-to-image translation that outperforms state-of-the-art methods in MRI-to-CT and MRI-to-MRI conversion. MRNet leverages the Segment Anything Model (SAM) to exploit frequency-based features to build a powerful method for advanced medical image transformation. The architecture extracts comprehensive multiscale features from diverse datasets using a powerful SAM image encoder and performs resolution-aware feature fusion that consistently integrates U-Net encoder outputs with SAM-derived features. This fusion optimizes the traditional U-Net skip connection while leveraging transformer-based contextual analysis. The translation is complemented by an innovative dual-mask configuration incorporating dynamic attention patterns and a specialized loss function designed to address regional mapping mismatches, preserving both the gross anatomy and tissue details. Extensive validation studies have shown that MRNet outperforms state-of-the-art architectures, particularly in maintaining anatomical fidelity and minimizing translation artifacts.

[AI-58] MILLION: A General Multi-Objective Framework with Controllable Risk for Portfolio Management VLDB2025

链接: https://arxiv.org/abs/2412.03038
作者: Liwei Deng,Tianfu Wang,Yan Zhao,Kai Zheng
关键词-EN: allocate investors’ budgets, Portfolio, portfolio interpolation, risk, Portfolio management
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: accepted by VLDB 2025

点击查看摘要

Abstract:Portfolio management is an important yet challenging task in AI for FinTech, which aims to allocate investors’ budgets among different assets to balance the risk and return of an investment. In this study, we propose a general Multi-objectIve framework with controLLable rIsk for pOrtfolio maNagement (MILLION), which consists of two main phases, i.e., return-related maximization and risk control. Specifically, in the return-related maximization phase, we introduce two auxiliary objectives, i.e., return rate prediction, and return rate ranking, combined with portfolio optimization to remit the overfitting problem and improve the generalization of the trained model to future markets. Subsequently, in the risk control phase, we propose two methods, i.e., portfolio interpolation and portfolio improvement, to achieve fine-grained risk control and fast risk adaption to a user-specified risk level. For the portfolio interpolation method, we theoretically prove that the risk can be perfectly controlled if the to-be-set risk level is in a proper interval. In addition, we also show that the return rate of the adjusted portfolio after portfolio interpolation is no less than that of the min-variance optimization, as long as the model in the reward maximization phase is effective. Furthermore, the portfolio improvement method can achieve greater return rates while keeping the same risk level compared to portfolio interpolation. Extensive experiments are conducted on three real-world datasets. The results demonstrate the effectiveness and efficiency of the proposed framework.

机器学习

[LG-0] Soft Checksums to Flag Untrustworthy Machine Learning Surrogate Predictions and Application to Atomic Physics Simulations

链接: https://arxiv.org/abs/2412.03497
作者: Casey Lauer,Robert C. Blake,Jonathan B. Freund
关键词-EN: Trained neural networks, replace costly calculations, Trained neural, neural networks, physical simulations
类目: Machine Learning (cs.LG); Atomic Physics (physics.atom-ph)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Trained neural networks (NN) are attractive as surrogate models to replace costly calculations in physical simulations, but are often unknowingly applied to states not adequately represented in the training dataset. We present the novel technique of soft checksums for scientific machine learning, a general-purpose method to differentiate between trustworthy predictions with small errors on in-distribution (ID) data points, and untrustworthy predictions with large errors on out-of-distribution (OOD) data points. By adding a check node to the existing output layer, we train the model to learn the chosen checksum function encoded within the NN predictions and show that violations of this function correlate with high prediction errors. As the checksum function depends only on the NN predictions, we can calculate the checksum error for any prediction with a single forward pass, incurring negligible time and memory costs. Additionally, we find that incorporating the checksum function into the loss function and exposing the NN to OOD data points during the training process improves separation between ID and OOD predictions. By applying soft checksums to a physically complex and high-dimensional non-local thermodynamic equilibrium atomic physics dataset, we show that a well-chosen threshold checksum error can effectively separate ID and OOD predictions.

[LG-1] Convolutional Neural Networks and Mixture of Experts for Intrusion Detection in 5G Networks and beyond

链接: https://arxiv.org/abs/2412.03483
作者: Loukas Ilias,George Doukas,Vangelis Lamprou,Christos Ntanos,Dimitris Askounis
关键词-EN: including extreme capacity, extreme capacity, including Logistic Regression, Artificial Intelligence algorithms, advanced Artificial Intelligence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advent of 6G/NextG networks comes along with a series of benefits, including extreme capacity, reliability, and efficiency. However, these networks may become vulnerable to new security threats. Therefore, 6G/NextG networks must be equipped with advanced Artificial Intelligence algorithms, in order to evade these attacks. Existing studies on the intrusion detection task rely on the train of shallow machine learning classifiers, including Logistic Regression, Decision Trees, and so on, yielding suboptimal performance. Others are based on deep neural networks consisting of static components, which are not conditional on the input. This limits their representation power and efficiency. To resolve these issues, we present the first study integrating Mixture of Experts (MoE) for identifying malicious traffic. Specifically, we use network traffic data and convert the 1D array of features into a 2D matrix. Next, we pass this matrix through convolutional neural network (CNN) layers followed by batch normalization and max pooling layers. After obtaining the representation vector via the CNN layers, a sparsely gated MoE layer is used. This layer consists of a set of experts (dense layers) and a router, where the router assigns weights to the output of each expert. Sparsity is achieved by choosing the most relevant experts of the total ones. Finally, we perform a series of ablation experiments to prove the effectiveness of our proposed model. Experiments are conducted on the 5G-NIDD dataset, a network intrusion detection dataset generated from a real 5G test network. Results show that our introduced approach reaches weighted F1-score up to 99.95% achieving comparable performance to existing approaches. Findings also show that our proposed model achieves multiple advantages over state-of-the-art approaches.

[LG-2] Cluster Specific Representation Learning

链接: https://arxiv.org/abs/2412.03471
作者: Mahalakshmi Sabanayagam,Omar Al-Dabooni,Pascal Esser
关键词-EN: meaningful lower-dimensional embeddings, extract meaningful lower-dimensional, meaningful lower-dimensional, Representation learning aims, Representation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning aims to extract meaningful lower-dimensional embeddings from data, known as representations. Despite its widespread application, there is no established definition of a ``good’’ representation. Typically, the representation quality is evaluated based on its performance in downstream tasks such as clustering, de-noising, etc. However, this task-specific approach has a limitation where a representation that performs well for one task may not necessarily be effective for another. This highlights the need for a more agnostic formulation, which is the focus of our work. We propose a downstream-agnostic formulation: when inherent clusters exist in the data, the representations should be specific to each cluster. Under this idea, we develop a meta-algorithm that jointly learns cluster-specific representations and cluster assignments. As our approach is easy to integrate with any representation learning framework, we demonstrate its effectiveness in various setups, including Autoencoders, Variational Autoencoders, Contrastive learning models, and Restricted Boltzmann Machines. We qualitatively compare our cluster-specific embeddings to standard embeddings and downstream tasks such as de-noising and clustering. While our method slightly increases runtime and parameters compared to the standard model, the experiments clearly show that it extracts the inherent cluster structures in the data, resulting in improved performance in relevant applications.

[LG-3] State Frequency Estimation for Anomaly Detection

链接: https://arxiv.org/abs/2412.03442
作者: Clinton Cao,Agathe Blaise,Annibale Panichella,Sicco Verwer
关键词-EN: studied the efficacy, works typically learn, state machines, scores, SEQUENT
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages

点击查看摘要

Abstract:Many works have studied the efficacy of state machines for detecting anomalies within NetFlows. These works typically learn a model from unlabeled data and compute anomaly scores for arbitrary traces based on their likelihood of occurrence or how well they fit within the model. However, these methods do not dynamically adapt their scores based on the traces seen at test time. This becomes a problem when an adversary produces seemingly common traces in their attack, causing the model to miss the detection by assigning low anomaly scores. We propose SEQUENT, a new approach that uses the state visit frequency to adapt its scoring for anomaly detection dynamically. SEQUENT subsequently uses the scores to generate root causes for anomalies. These allow the grouping of alarms and simplify the analysis of anomalies. Our evaluation of SEQUENT on three NetFlow datasets indicates that our approach outperforms existing methods, demonstrating its effectiveness in detecting anomalies.

[LG-4] Assessing Foundation Models Transferability to Physiological Signals in Precision Medicine ALT

链接: https://arxiv.org/abs/2412.03427
作者: Matthias Christenson,Cove Geary,Brian Locke,Pranav Koirala,Warren Woodrich Pettine
关键词-EN: heterogeneous patient populations, patient populations, precision medicine, effectively process, process and interpret
类目: Machine Learning (cs.LG)
*备注: Presented at the precision medicine workshop at the AI in Medicine conference (2024) in Salt Lake City

点击查看摘要

Abstract:The success of precision medicine requires computational models that can effectively process and interpret diverse physiological signals across heterogeneous patient populations. While foundation models have demonstrated remarkable transfer capabilities across various domains, their effectiveness in handling individual-specific physiological signals - crucial for precision medicine - remains largely unexplored. This work introduces a systematic pipeline for rapidly and efficiently evaluating foundation models’ transfer capabilities in medical contexts. Our pipeline employs a three-stage approach. First, it leverages physiological simulation software to generate diverse, clinically relevant scenarios, particularly focusing on data-scarce medical conditions. This simulation-based approach enables both targeted capability assessment and subsequent model fine-tuning. Second, the pipeline projects these simulated signals through the foundation model to obtain embeddings, which are then evaluated using linear methods. This evaluation quantifies the model’s ability to capture three critical aspects: physiological feature independence, temporal dynamics preservation, and medical scenario differentiation. Finally, the pipeline validates these representations through specific downstream medical tasks. Initial testing of our pipeline on the Moirai time series foundation model revealed significant limitations in physiological signal processing, including feature entanglement, temporal dynamics distortion, and reduced scenario discrimination. These findings suggest that current foundation models may require substantial architectural modifications or targeted fine-tuning before deployment in clinical settings.

[LG-5] Deep Operator BSDE: a Numerical Scheme to Approximate the Solution Operators

链接: https://arxiv.org/abs/2412.03405
作者: Giulia Di Nunno,Pere Díaz Lozano
关键词-EN: Stochastic Differential Equation, Backward Stochastic Differential, dynamic risk measures, Differential Equation, Backward Stochastic
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Motivated by dynamic risk measures and conditional g -expectations, in this work we propose a numerical method to approximate the solution operator given by a Backward Stochastic Differential Equation (BSDE). The main ingredients for this are the Wiener chaos decomposition and the classical Euler scheme for BSDEs. We show convergence of this scheme under very mild assumptions, and provide a rate of convergence in more restrictive cases. We then implement it using neural networks, and we present several numerical examples where we can check the accuracy of the method.

[LG-6] Can neural operators always be continuously discretized?

链接: https://arxiv.org/abs/2412.03393
作者: Takashi Furuya,Michael Puthawala,Maarten V. de Hoop,Matti Lassas
关键词-EN: including skip connections, framework including skip, neural operators, Hilbert spaces, neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of discretization of neural operators between Hilbert spaces in a general framework including skip connections. We focus on bijective neural operators through the lens of diffeomorphisms in infinite dimensions. Framed using category theory, we give a no-go theorem that shows that diffeomorphisms between Hilbert spaces or Hilbert manifolds may not admit any continuous approximations by diffeomorphisms on finite-dimensional spaces, even if the approximations are nonlinear. The natural way out is the introduction of strongly monotone diffeomorphisms and layerwise strongly monotone neural operators which have continuous approximations by strongly monotone diffeomorphisms on finite-dimensional spaces. For these, one can guarantee discretization invariance, while ensuring that finite-dimensional approximations converge not only as sequences of functions, but that their representations converge in a suitable sense as well. Finally, we show that bilipschitz neural operators may always be written in the form of an alternating composition of strongly monotone neural operators, plus a simple isometry. Thus we realize a rigorous platform for discretization of a generalization of a neural operator. We also show that neural operators of this type may be approximated through the composition of finite-rank residual neural operators, where each block is strongly monotone, and may be inverted locally via iteration. We conclude by providing a quantitative approximation result for the discretization of general bilipschitz neural operators.

[LG-7] Risk-aware Classification via Uncertainty Quantification

链接: https://arxiv.org/abs/2412.03391
作者: Murat Sensoy,Lance M. Kaplan,Simon Julier,Maryam Saleki,Federico Cerutti
关键词-EN: deep learning models, models to improve, Evidential Deep Learning, deep learning, learning models
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in Expert Systems with Applications

点击查看摘要

Abstract:Autonomous and semi-autonomous systems are using deep learning models to improve decision-making. However, deep classifiers can be overly confident in their incorrect predictions, a major issue especially in safety-critical domains. The present study introduces three foundational desiderata for developing real-world risk-aware classification systems. Expanding upon the previously proposed Evidential Deep Learning (EDL), we demonstrate the unity between these principles and EDL’s operational attributes. We then augment EDL empowering autonomous agents to exercise discretion during structured decision-making when uncertainty and risks are inherent. We rigorously examine empirical scenarios to substantiate these theoretical innovations. In contrast to existing risk-aware classifiers, our proposed methodologies consistently exhibit superior performance, underscoring their transformative potential in risk-conscious classification strategies.

[LG-8] Reactive Orchestration for Hierarchical Federated Learning Under a Communication Cost Budget

链接: https://arxiv.org/abs/2412.03385
作者: Ivan Čilić,Anna Lackinger,Pantelis Frangoudis,Ivana Podnar Žarko,Alireza Furutanpey,Ilir Murturi,Schahram Dustdar
关键词-EN: Hierarchical Federated Learning, Federated Learning, requires careful organization, Deploying a Hierarchical, Hierarchical Federated
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Deploying a Hierarchical Federated Learning (HFL) pipeline across the computing continuum (CC) requires careful organization of participants into a hierarchical structure with intermediate aggregation nodes between FL clients and the global FL server. This is challenging to achieve due to (i) cost constraints, (ii) varying data distributions, and (iii) the volatile operating environment of the CC. In response to these challenges, we present a framework for the adaptive orchestration of HFL pipelines, designed to be reactive to client churn and infrastructure-level events, while balancing communication cost and ML model accuracy. Our mechanisms identify and react to events that cause HFL reconfiguration actions at runtime, building on multi-level monitoring information (model accuracy, resource availability, resource cost). Moreover, our framework introduces a generic methodology for estimating reconfiguration costs to continuously re-evaluate the quality of adaptation actions, while being extensible to optimize for various HFL performance criteria. By extending the Kubernetes ecosystem, our framework demonstrates the ability to react promptly and effectively to changes in the operating environment, making the best of the available communication cost budget and effectively balancing costs and ML performance at runtime.

[LG-9] Granular Ball Twin Support Vector Machine with Universum Data

链接: https://arxiv.org/abs/2412.03375
作者: M. A. Ganaie,Vrushank Ahire
关键词-EN: support vector machines, Twin Support Vector, Universum data, support vector, Ball Twin Support
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classification with support vector machines (SVM) often suffers from limited performance when relying solely on labeled data from target classes and is sensitive to noise and outliers. Incorporating prior knowledge from Universum data and more robust data representations can enhance accuracy and efficiency. Motivated by these findings, we propose a novel Granular Ball Twin Support Vector Machine with Universum Data (GBU-TSVM) that extends the TSVM framework to leverage both Universum samples and granular ball computing during model training. Unlike existing TSVM methods, the proposed GBU-TSVM represents data instances as hyper-balls rather than points in the feature space. This innovative approach improves the model’s robustness and efficiency, particularly in handling noisy and large datasets. By grouping data points into granular balls, the model achieves superior computational efficiency, increased noise resistance, and enhanced interpretability. Additionally, the inclusion of Universum data, which consists of samples that are not strictly from the target classes, further refines the classification boundaries. This integration enriches the model with contextual information, refining classification boundaries and boosting overall accuracy. Experimental results on UCI benchmark datasets demonstrate that the GBU-TSVM outperforms existing TSVM models in both accuracy and computational efficiency. These findings highlight the potential of the GBU-TSVM model in setting a new standard in data representation and classification.

[LG-10] On Approximability of ell_22 Min-Sum Clustering

链接: https://arxiv.org/abs/2412.03332
作者: Karthik C. S.,Euiwoong Lee,Yuval Rabani,Chris Schwiegelshohn,Samson Zhou
关键词-EN: ell, min-sum, clustering, clustering problem, Johnson Coverage Hypothesis
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The \ell_2^2 min-sum k -clustering problem is to partition an input set into clusters C_1,\ldots,C_k to minimize \sum_i=1^k\sum_p,q\in C_i|p-q|_2^2 . Although \ell_2^2 min-sum k -clustering is NP-hard, it is not known whether it is NP-hard to approximate \ell_2^2 min-sum k -clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the \ell_2^2 min-sum k -clustering problem. We show that it is NP-hard to approximate the objective to a factor better than 1.056 and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving the first (1+\varepsilon) -coreset construction for \ell_2^2 min-sum k -clustering. Our coreset uses \mathcalO\left(k^\varepsilon^-4\right) space and can be leveraged to achieve a polynomial-time approximation scheme with runtime nd\cdot f(k,\varepsilon^-1) , where d is the underlying dimension of the input dataset and f is a fixed function. Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label i\in[k] for input point, thereby implicitly partitioning the input dataset into k clusters that induce an approximately optimal solution, up to some amount of adversarial error \alpha\in\left[0,\frac12\right) . We give a polynomial-time algorithm that outputs a \frac1+\gamma\alpha(1-\alpha)^2 -approximation to \ell_2^2 min-sum k -clustering, for a fixed constant \gamma0 . Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computational Geometry (cs.CG); Machine Learning (cs.LG) Cite as: arXiv:2412.03332 [cs.DS] (or arXiv:2412.03332v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.03332 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Karthik C. S. [view email] [v1] Wed, 4 Dec 2024 14:03:27 UTC (41 KB)

[LG-11] Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis ICONIP2023

链接: https://arxiv.org/abs/2412.03321
作者: Zerui Tao,Toshihisa Tanaka,Qibin Zhao
关键词-EN: Tensor decompositions play, Bayesian Tensor Ring, multi-way data analysis, Automatic Relevance Determination, numerous applications related
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICONIP 2023

点击查看摘要

Abstract:Tensor decompositions play a crucial role in numerous applications related to multi-way data analysis. By employing a Bayesian framework with sparsity-inducing priors, Bayesian Tensor Ring (BTR) factorization offers probabilistic estimates and an effective approach for automatically adapting the tensor ring rank during the learning process. However, previous BTR method employs an Automatic Relevance Determination (ARD) prior, which can lead to sub-optimal solutions. Besides, it solely focuses on continuous data, whereas many applications involve discrete data. More importantly, it relies on the Coordinate-Ascent Variational Inference (CAVI) algorithm, which is inadequate for handling large tensors with extensive observations. These limitations greatly limit its application scales and scopes, making it suitable only for small-scale problems, such as image/video completion. To address these issues, we propose a novel BTR model that incorporates a nonparametric Multiplicative Gamma Process (MGP) prior, known for its superior accuracy in identifying latent structures. To handle discrete data, we introduce the Pólya-Gamma augmentation for closed-form updates. Furthermore, we develop an efficient Gibbs sampler for consistent posterior simulation, which reduces the computational complexity of previous VI algorithm by two orders, and an online EM algorithm that is scalable to extremely large tensors. To showcase the advantages of our model, we conduct extensive experiments on both simulation data and real-world applications.

[LG-12] FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

链接: https://arxiv.org/abs/2412.03317
作者: Vincent Abbott,Gioele Zardini
关键词-EN: Optimizing deep learning, manual derivation, Optimizing deep, requires slow, potentially leaving
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing deep learning algorithms currently requires slow, manual derivation, potentially leaving much performance untapped. Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers, but required three iterations over three years. Automated compiled methods have consistently lagged behind. GPUs are limited by both transfers to processors and available compute, with transfer bandwidth having improved at a far slower pace. Already, transfer bandwidth accounts for 46% of GPU energy costs. This indicates the future of energy and capital-efficient algorithms relies on improved consideration of transfer costs (IO-awareness) and a systematic method for deriving optimized algorithms. In this paper, we present a diagrammatic approach to deep learning models which, with simple relabelings, derive optimal implementations and performance models that consider low-level memory. Diagrams generalize down the GPU hierarchy, providing a universal performance model for comparing hardware and quantization choices. Diagrams generate pseudocode, which reveals the application of hardware-specific features such as coalesced memory access, tensor core operations, and overlapped computation. We present attention algorithms for Ampere, which fits 13 warps per SM (FlashAttention fits 8), and for Hopper, which has improved overlapping and may achieve 1.32 PFLOPs.

[LG-13] Conveying Emotions to Robots through Touch and Sound

链接: https://arxiv.org/abs/2412.03300
作者: Qiaoqiao Ren,Remko Proesmans,Frederick Bossuyt,Jan Vanfleteren,Francis Wyffels,Tony Belpaeme
关键词-EN: Human emotions, nuanced touch gestures, Human, touch, touch gestures
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human emotions can be conveyed through nuanced touch gestures. However, there is a lack of understanding of how consistently emotions can be conveyed to robots through touch. This study explores the consistency of touch-based emotional expression toward a robot by integrating tactile and auditory sensory reading of affective haptic expressions. We developed a piezoresistive pressure sensor and used a microphone to mimic touch and sound channels, respectively. In a study with 28 participants, each conveyed 10 emotions to a robot using spontaneous touch gestures. Our findings reveal a statistically significant consistency in emotion expression among participants. However, some emotions obtained low intraclass correlation values. Additionally, certain emotions with similar levels of arousal or valence did not exhibit significant differences in the way they were conveyed. We subsequently constructed a multi-modal integrating touch and audio features to decode the 10 emotions. A support vector machine (SVM) model demonstrated the highest accuracy, achieving 40% for 10 classes, with “Attention” being the most accurately conveyed emotion at a balanced accuracy of 87.65%.

[LG-14] Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.03258
作者: Mianchu Wang,Yue Jin,Giovanni Montana
关键词-EN: Offline reinforcement learning, Offline reinforcement, learn optimal policies, seeks to learn, learn optimal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.

[LG-15] Variable-Speed Teaching-Playback as Real-World Data Augmentation for Imitation Learning WWW

链接: https://arxiv.org/abs/2412.03252
作者: Nozomu Masuya,Hiroshi Sato,Koki Yamane,Takuya Kusume,Sho Sakaino,Toshiaki Tsuji
关键词-EN: data augmentation, real-world data augmentation, data, imitation learning relies, shortage of training
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 16 pages, 12 figures, 4 tables. This is a preprint of an article submitted for consideration in ADVANCED ROBOTICS, copyright Taylor Francis and Robotics Society of Japan; ADVANCED ROBOTICS is available online at this http URL

点击查看摘要

Abstract:Because imitation learning relies on human demonstrations in hard-to-simulate settings, the inclusion of force control in this method has resulted in a shortage of training data, even with a simple change in speed. Although the field of data augmentation has addressed the lack of data, conventional methods of data augmentation for robot manipulation are limited to simulation-based methods or downsampling for position control. This paper proposes a novel method of data augmentation that is applicable to force control and preserves the advantages of real-world datasets. We applied teaching-playback at variable speeds as real-world data augmentation to increase both the quantity and quality of environmental reactions at variable speeds. An experiment was conducted on bilateral control-based imitation learning using a method of imitation learning equipped with position-force control. We evaluated the effect of real-world data augmentation on two tasks, pick-and-place and wiping, at variable speeds, each from two human demonstrations at fixed speed. The results showed a maximum 55% increase in success rate from a simple change in speed of real-world reactions and improved accuracy along the duration/frequency command by gathering environmental reactions at variable speeds.

[LG-16] Dynamic Consistent k-Center Clustering with Optimal Recourse

链接: https://arxiv.org/abs/2412.03238
作者: Sebastian Forster,Antonis Skarlatos
关键词-EN: arbitrary metric space, minimum number, arbitrary metric, metric space, algorithm
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: In Proceedings SODA 2025

点击查看摘要

Abstract:Given points from an arbitrary metric space and a sequence of point updates sent by an adversary, what is the minimum recourse per update (i.e., the minimum number of changes needed to the set of centers after an update), in order to maintain a constant-factor approximation to a k -clustering problem? This question has received attention in recent years under the name consistent clustering. Previous works by Lattanzi and Vassilvitskii [ICLM '17] and Fichtenberger, Lattanzi, Norouzi-Fard, and Svensson [SODA '21] studied k -clustering objectives, including the k -center and the k -median objectives, under only point insertions. In this paper we study the k -center objective in the fully dynamic setting, where the update is either a point insertion or a point deletion. Before our work, Łącki, Haeupler, Grunau, Rozhoň, and Jayaram [SODA '24] gave a deterministic fully dynamic constant-factor approximation algorithm for the k -center objective with worst-case recourse of 2 per update. In this work, we prove that the k -center clustering problem admits optimal recourse bounds by developing a deterministic fully dynamic constant-factor approximation algorithm with worst-case recourse of 1 per update. Moreover our algorithm performs simple choices based on light data structures, and thus is arguably more direct and faster than the previous one which uses a sophisticated combinatorial structure. Additionally, we develop a new deterministic decremental algorithm and a new deterministic incremental algorithm, both of which maintain a 6 -approximate k -center solution with worst-case recourse of 1 per update. Our incremental algorithm improves over the 8 -approximation algorithm by Charikar, Chekuri, Feder, and Motwani [STOC '97]. Finally, we remark that since all three of our algorithms are deterministic, they work against an adaptive adversary. Comments: In Proceedings SODA 2025 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2412.03238 [cs.DS] (or arXiv:2412.03238v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.03238 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Channel Reflection: Knowledge-Driven Data Augmentation for EEG-Based Brain-Computer Interfaces

链接: https://arxiv.org/abs/2412.03224
作者: Ziwei Wang,Siyang Li,Jingwei Luo,Jiajing Liu,Dongrui Wu
关键词-EN: enables direct communication, brain-computer interface, enables direct, external devices, data augmentation
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:A brain-computer interface (BCI) enables direct communication between the human brain and external devices. Electroencephalography (EEG) based BCIs are currently the most popular for able-bodied users. To increase user-friendliness, usually a small amount of user-specific EEG data are used for calibration, which may not be enough to develop a pure data-driven decoding model. To cope with this typical calibration data shortage challenge in EEG-based BCIs, this paper proposes a parameter-free channel reflection (CR) data augmentation approach that incorporates prior knowledge on the channel distributions of different BCI paradigms in data augmentation. Experiments on eight public EEG datasets across four different BCI paradigms (motor imagery, steady-state visual evoked potential, P300, and seizure classifications) using different decoding algorithms demonstrated that: 1) CR is effective, i.e., it can noticeably improve the classification accuracy; 2) CR is robust, i.e., it consistently outperforms existing data augmentation approaches in the literature; and, 3) CR is flexible, i.e., it can be combined with other data augmentation approaches to further increase the performance. We suggest that data augmentation approaches like CR should be an essential step in EEG-based BCIs. Our code is available online.

[LG-18] Survey of different Large Language Model Architectures: Trends Benchmarks and Challenges

链接: https://arxiv.org/abs/2412.03220
作者: Minghao Shao,Abdul Basit,Ramesh Karri,Muhammad Shafique
关键词-EN: generating coherent responses, Large Language Models, learning models adept, deep learning models, understanding natural language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.

[LG-19] Node Classification With Integrated Reject Option

链接: https://arxiv.org/abs/2412.03190
作者: Uday Bhaskar,Jayadratha Gayen,Charu Sharma,Naresh Manwani
关键词-EN: Graph neural networks, Graph neural, graph learning, key tasks, neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the key tasks in graph learning is node classification. While Graph neural networks have been used for various applications, their adaptivity to reject option setting is not previously explored. In this paper, we propose NCwR, a novel approach to node classification in Graph Neural Networks (GNNs) with an integrated reject option, which allows the model to abstain from making predictions when uncertainty is high. We propose both cost-based and coverage-based methods for classification with abstention in node classification setting using GNNs. We perform experiments using our method on three standard citation network datasets Cora, Citeseer and Pubmed and compare with relevant baselines. We also model the Legal judgment prediction problem on ILDC dataset as a node classification problem where nodes represent legal cases and edges represent citations. We further interpret the model by analyzing the cases that the model abstains from predicting by visualizing which part of the input features influenced this decision.

[LG-20] opological Trajectory Classification and Landmark Inference on Simplicial Complexes

链接: https://arxiv.org/abs/2412.03145
作者: Vincent P. Grande,Josef Hoppe,Florian Frantzen,Michael T. Schaub
关键词-EN: discrete or discretised, manifold modelled, classifying trajectories, Hodge Laplacian, harmonic space
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, Accepted at the 58th Annual Asilomar Conference on Signals, Systems, and Computers 2024

点击查看摘要

Abstract:We consider the problem of classifying trajectories on a discrete or discretised 2-dimensional manifold modelled by a simplicial complex. Previous works have proposed to project the trajectories into the harmonic eigenspace of the Hodge Laplacian, and then cluster the resulting embeddings. However, if the considered space has vanishing homology (i.e., no “holes”), then the harmonic space of the 1-Hodge Laplacian is trivial and thus the approach fails. Here we propose to view this issue akin to a sensor placement problem and present an algorithm that aims to learn “optimal holes” to distinguish a set of given trajectory classes. Specifically, given a set of labelled trajectories, which we interpret as edge-flows on the underlying simplicial complex, we search for 2-simplices whose deletion results in an optimal separation of the trajectory labels according to the corresponding spectral embedding of the trajectories into the harmonic space. Finally, we generalise this approach to the unsupervised setting.

[LG-21] Unifying KV Cache Compression for Large Language Models with LeanKV

链接: https://arxiv.org/abs/2412.03131
作者: Yanqi Zhang,Yuwei Hu,Runyuan Zhao,John C.S. Lui,Haibo Chen
关键词-EN: Large language models, demonstrate exceptional performance, substantial memory demands, Large language, incur high serving
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate exceptional performance but incur high serving costs due to substantial memory demands, with the key-value (KV) cache being a primary bottleneck. Existing KV cache compression methods, including quantization and pruning, struggle with limitations such as uniform treatment of keys and values and static memory allocation across attention heads. To address these challenges, we introduce LeanKV, a unified KV cache compression framework that enhances LLM serving efficiency without compromising accuracy through three innovations: (1) Hetero-KV quantization, which stores keys at a higher precision than values to reflect their greater impact on attention computations; (2) per-head dynamic sparsity, which allocates memory based on token importance per head and per request; and (3) unified KV compression, integrating mixed-precision quantization and selective pruning to enable a smooth tradeoff between model accuracy and memory efficiency. To efficiently support these techniques, LeanKV introduces systems optimizations including unified paging and on-GPU parallel memory management. Implemented on vLLM, LeanKV compresses the KV cache by 3.0\times to 5.0\times without accuracy loss and up to 11.0\times with under 5% accuracy loss, enhancing throughput by 1.9\times to 2.5\times , and up to 6.9\times .

[LG-22] Sinkhorn Algorithm for Sequentially Composed Optimal Transports

链接: https://arxiv.org/abs/2412.03120
作者: Kazuki Watanabe,Noboru Isobe
关键词-EN: including image processing, natural language processing, de-facto standard approximation, Sinkhorn algorithm, standard approximation algorithm
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Preprint

点击查看摘要

Abstract:Sinkhorn algorithm is the de-facto standard approximation algorithm for optimal transport, which has been applied to a variety of applications, including image processing and natural language processing. In theory, the proof of its convergence follows from the convergence of the Sinkhorn–Knopp algorithm for the matrix scaling problem, and Altschuler et al. show that its worst-case time complexity is in near-linear time. Very recently, sequentially composed optimal transports were proposed by Watanabe and Isobe as a hierarchical extension of optimal transports. In this paper, we present an efficient approximation algorithm, namely Sinkhorn algorithm for sequentially composed optimal transports, for its entropic regularization. Furthermore, we present a theoretical analysis of the Sinkhorn algorithm, namely (i) its exponential convergence to the optimal solution with respect to the Hilbert pseudometric, and (ii) a worst-case complexity analysis for the case of one sequential composition.

[LG-23] Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing

链接: https://arxiv.org/abs/2412.03097
作者: Wenyi Liu,Ziqi Zhang,Xinshi Li,Jiacheng Hu,Yuanshuai Luo,Junliang Du
关键词-EN: Graph Neural Networks, leveraging Graph Neural, network hierarchy deepens, Graph Neural, paper addresses key
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses key challenges in enhancing recommendation systems by leveraging Graph Neural Networks (GNNs) and addressing inherent limitations such as over-smoothing, which reduces model effectiveness as network hierarchy deepens. The proposed approach introduces three GNN-based recommendation models, specifically designed to mitigate over-smoothing through innovative mechanisms like residual connections and identity mapping within the aggregation propagation process. These modifications enable more effective information flow across layers, preserving essential user-item interaction details to improve recommendation accuracy. Additionally, the study emphasizes the critical need for interpretability in recommendation systems, aiming to provide transparent and justifiable suggestions tailored to dynamic user preferences. By integrating collaborative filtering with GNN architectures, the proposed models not only enhance predictive accuracy but also align recommendations more closely with individual behaviors, adapting to nuanced shifts in user interests. This work advances the field by tackling both technical and user-centric challenges, contributing to the development of robust and explainable recommendation systems capable of managing the complexity and scale of modern online environments.

[LG-24] A Granger-Causal Perspective on Gradient Descent with Application to Pruning

链接: https://arxiv.org/abs/2412.03035
作者: Aditya Shah,Aditya Challa,Sravan Danda,Archana Mathur,Snehanshu Saha
关键词-EN: Stochastic Gradient Descent, optimizing neural networks, Stochastic Gradient, Gradient Descent, optimizing neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) is the main approach to optimizing neural networks. Several generalization properties of deep networks, such as convergence to a flatter minima, are believed to arise from SGD. This article explores the causality aspect of gradient descent. Specifically, we show that the gradient descent procedure has an implicit granger-causal relationship between the reduction in loss and a change in parameters. By suitable modifications, we make this causal relationship explicit. A causal approach to gradient descent has many significant applications which allow greater control. In this article, we illustrate the significance of the causal approach using the application of Pruning. The causal approach to pruning has several interesting properties - (i) We observe a phase shift as the percentage of pruned parameters increase. Such phase shift is indicative of an optimal pruning strategy. (ii) After pruning, we see that minima becomes “flatter”, explaining the increase in accuracy after pruning weights.

[LG-25] Learning Whole-Body Loco-Manipulation for Omni-Directional Task Space Pose Tracking with a Wheeled-Quadrupedal-Manipulator

链接: https://arxiv.org/abs/2412.03012
作者: Kaiwen Jiang,Zhen Fu,Junde Guo,Wei Zhang,Hua Chen
关键词-EN: whole-body loco-manipulation problem, reinforcement learning, pose tracking, pose tracking problem, whole-body loco-manipulation
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the whole-body loco-manipulation problem using reinforcement learning (RL). Specifically, we focus on the problem of how to coordinate the floating base and the robotic arm of a wheeled-quadrupedal manipulator robot to achieve direct six-dimensional (6D) end-effector (EE) pose tracking in task space. Different from conventional whole-body loco-manipulation problems that track both floating-base and end-effector commands, the direct EE pose tracking problem requires inherent balance among redundant degrees of freedom in the whole-body motion. We leverage RL to solve this challenging problem. To address the associated difficulties, we develop a novel reward fusion module (RFM) that systematically integrates reward terms corresponding to different tasks in a nonlinear manner. In such a way, the inherent multi-stage and hierarchical feature of the loco-manipulation problem can be carefully accommodated. By combining the proposed RFM with the a teacher-student RL training paradigm, we present a complete RL scheme to achieve 6D EE pose tracking for the wheeled-quadruped manipulator robot. Extensive simulation and hardware experiments demonstrate the significance of the RFM. In particular, we enable smooth and precise tracking performance, achieving state-of-the-art tracking position error of less than 5 cm, and rotation error of less than 0.1 rad. Please refer to this https URL for more experimental videos.

[LG-26] Data Acquisition for Improving Model Fairness using Reinforcement Learning

链接: https://arxiv.org/abs/2412.03009
作者: Jahid Hasan,Romila Pradhan
关键词-EN: critical decision making, Machine learning systems, Machine learning, data points, machine learning model
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Machine learning systems are increasingly being used in critical decision making such as healthcare, finance, and criminal justice. Concerns around their fairness have resulted in several bias mitigation techniques that emphasize the need for high-quality data to ensure fairer decisions. However, the role of earlier stages of machine learning pipelines in mitigating model bias has not been explored well. In this paper, we focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness. Since not all data points in a data pool are equally beneficial to the task of fairness, we generate an ordering in which data points should be acquired. We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire. Over several iterations, DataSift selects a partition and randomly samples a batch of data points from the selected partition, evaluates the benefit of acquiring the batch on model fairness, and updates the utility of partitions depending on the benefit. To further improve the effectiveness and efficiency of evaluating batches, we leverage influence functions that estimate the effect of acquiring a batch without retraining the model. We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.

[LG-27] Provably Extending PageRank-based Local Clustering Algorithm to Weighted Directed Graphs with Self-Loops and to Hypergraphs

链接: https://arxiv.org/abs/2412.03008
作者: Zihao Li,Dongqi Fu,Hengyu Liu,Jingrui He
关键词-EN: Local clustering aims, starting instances, aims to find, find a compact, Local clustering
类目: ocial and Information Networks (cs.SI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Preprint, 42 pages

点击查看摘要

Abstract:Local clustering aims to find a compact cluster near the given starting instances. This work focuses on graph local clustering, which has broad applications beyond graphs because of the internal connectivities within various modalities. While most existing studies on local graph clustering adopt the discrete graph setting (i.e., unweighted graphs without self-loops), real-world graphs can be more complex. In this paper, we extend the non-approximating Andersen-Chung-Lang (“ACL”) algorithm beyond discrete graphs and generalize its quadratic optimality to a wider range of graphs, including weighted, directed, and self-looped graphs and hypergraphs. Specifically, leveraging PageRank, we propose two algorithms: GeneralACL for graphs and HyperACL for hypergraphs. We theoretically prove that, under two mild conditions, both algorithms can identify a quadratically optimal local cluster in terms of conductance with at least 1/2 probability. On the property of hypergraphs, we address a fundamental gap in the literature by defining conductance for hypergraphs from the perspective of hypergraph random walks. Additionally, we provide experiments to validate our theoretical findings.

[LG-28] How Many Ratings per Item are Necessary for Reliable Significance Testing?

链接: https://arxiv.org/abs/2412.02968
作者: Christopher Homan,Flip Korn,Chris Welty
关键词-EN: machine learning evaluation, learning evaluation assume, machine learning, assume scores, data with unitary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most approaches to machine learning evaluation assume that machine and human responses are repeatable enough to be measured against data with unitary, authoritative, “gold standard” responses, via simple metrics such as accuracy, precision, and recall that assume scores are independent given the test item. However, AI models have multiple sources of stochasticity and the human raters who create gold standards tend to disagree with each other, often in meaningful ways, hence a single output response per input item may not provide enough information. We introduce methods for determining whether an (existing or planned) evaluation dataset has enough responses per item to reliably compare the performance of one model to another. We apply our methods to several of very few extant gold standard test sets with multiple disaggregated responses per item and show that there are usually not enough responses per item to reliably compare the performance of one model against another. Our methods also allow us to estimate the number of responses per item for hypothetical datasets with similar response distributions to the existing datasets we study. When two models are very far apart in their predictive performance, fewer raters are needed to confidently compare them, as expected. However, as the models draw closer, we find that a larger number of raters than are currently typical in annotation collection are needed to ensure that the power analysis correctly reflects the difference in performance.

[LG-29] Incorporating System-level Safety Requirements in Perception Models via Reinforcement Learning

链接: https://arxiv.org/abs/2412.02951
作者: Weisi Fan,Jesse Lane,Qisai Liu,Soumik Sarkar,Tichakorn Wongpiromsarn
关键词-EN: established performance metrics, system-level safety, relying on established, metrics like accuracy, system-level safety objectives
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perception components in autonomous systems are often developed and optimized independently of downstream decision-making and control components, relying on established performance metrics like accuracy, precision, and recall. Traditional loss functions, such as cross-entropy loss and negative log-likelihood, focus on reducing misclassification errors but fail to consider their impact on system-level safety, overlooking the varying severities of system-level failures caused by these errors. To address this limitation, we propose a novel training paradigm that augments the perception component with an understanding of system-level safety objectives. Central to our approach is the translation of system-level safety requirements, formally specified using the rulebook formalism, into safety scores. These scores are then incorporated into the reward function of a reinforcement learning framework for fine-tuning perception models with system-level safety objectives. Simulation results demonstrate that models trained with this approach outperform baseline perception models in terms of system-level safety.

[LG-30] BGTplanner: Maximizing Training Accuracy for Differentially Private Federated Recommenders via Strategic Privacy Budget Allocation

链接: https://arxiv.org/abs/2412.02934
作者: Xianzhi Zhang,Yipeng Zhou,Miao Hu,Di Wu,Pengshan Liao,Mohsen Guizani,Michael Sheng
关键词-EN: user-item rating data, raw user-item rating, decentralized clients co-train, private federated recommender, federated recommender
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:To mitigate the rising concern about privacy leakage, the federated recommender (FR) paradigm emerges, in which decentralized clients co-train the recommendation model without exposing their raw user-item rating data. The differentially private federated recommender (DPFR) further enhances FR by injecting differentially private (DP) noises into clients. Yet, current DPFRs, suffering from noise distortion, cannot achieve satisfactory accuracy. Various efforts have been dedicated to improving DPFRs by adaptively allocating the privacy budget over the learning process. However, due to the intricate relation between privacy budget allocation and model accuracy, existing works are still far from maximizing DPFR accuracy. To address this challenge, we develop BGTplanner (Budget Planner) to strategically allocate the privacy budget for each round of DPFR training, improving overall training performance. Specifically, we leverage the Gaussian process regression and historical information to predict the change in recommendation accuracy with a certain allocated privacy budget. Additionally, Contextual Multi-Armed Bandit (CMAB) is harnessed to make privacy budget allocation decisions by reconciling the current improvement and long-term privacy constraints. Our extensive experimental results on real datasets demonstrate that \emphBGTplanner achieves an average improvement of 6.76% in training performance compared to state-of-the-art baselines.

[LG-31] Harnessing Loss Decomposition for Long-Horizon Wave Predictions via Deep Neural Networks NEURIPS

链接: https://arxiv.org/abs/2412.02924
作者: Indu Kant Deo,Rajeev Jaiman
关键词-EN: modeling complex physical, complex physical processes, long time horizons, Accurate prediction, time horizons
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, NeurIPS Machine Learning for Physical Sciences workshop

点击查看摘要

Abstract:Accurate prediction over long time horizons is crucial for modeling complex physical processes such as wave propagation. Although deep neural networks show promise for real-time forecasting, they often struggle with accumulating phase and amplitude errors as predictions extend over a long period. To address this issue, we propose a novel loss decomposition strategy that breaks down the loss into separate phase and amplitude components. This technique improves the long-term prediction accuracy of neural networks in wave propagation tasks by explicitly accounting for numerical errors, improving stability, and reducing error accumulation over extended forecasts.

[LG-32] Measuring Bias of Web-filtered Text Datasets and Bias Propagation Through Training

链接: https://arxiv.org/abs/2412.02857
作者: Youssef Mansour,Reinhard Heckel
关键词-EN: dataset classification experiments, large language models, classification experiments, large language, pretraining datasets
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.

[LG-33] Optimized IoT Intrusion Detection using Machine Learning Technique

链接: https://arxiv.org/abs/2412.02845
作者: Muhammad Zawad Mahmud,Samiha Islam,Shahran Rahman Alve,Al Jubayer Pial
关键词-EN: Intrusion Detection System, Intrusion Detection, identify network intrusions, application of software, Classifier
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted in an international conference

点击查看摘要

Abstract:An application of software known as an Intrusion Detection System (IDS) employs machine algorithms to identify network intrusions. Selective logging, safeguarding privacy, reputation-based defense against numerous attacks, and dynamic response to threats are a few of the problems that intrusion identification is used to solve. The biological system known as IoT has seen a rapid increase in high dimensionality and information traffic. Self-protective mechanisms like intrusion detection systems (IDSs) are essential for defending against a variety of attacks. On the other hand, the functional and physical diversity of IoT IDS systems causes significant issues. These attributes make it troublesome and unrealistic to completely use all IoT elements and properties for IDS self-security. For peculiarity-based IDS, this study proposes and implements a novel component selection and extraction strategy (our strategy). A five-ML algorithm model-based IDS for machine learning-based networks with proper hyperparamater tuning is presented in this paper by examining how the most popular feature selection methods and classifiers are combined, such as K-Nearest Neighbors (KNN) Classifier, Decision Tree (DT) Classifier, Random Forest (RF) Classifier, Gradient Boosting Classifier, and Ada Boost Classifier. The Random Forest (RF) classifier had the highest accuracy of 99.39%. The K-Nearest Neighbor (KNN) classifier exhibited the lowest performance among the evaluated models, achieving an accuracy of 94.84%. This study’s models have a significantly higher performance rate than those used in previous studies, indicating that they are more reliable.

[LG-34] Batch Normalization Decomposed

链接: https://arxiv.org/abs/2412.02843
作者: Ido Nachum,Marco Bondaschi,Michael Gastpar,Anatoly Khina
关键词-EN: successful building block, neural network architectures, Batch normalization, emph, successful building
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:\emphBatch normalization is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emphrecentering the mean of the representation to zero, \emphrescaling the variance of the representation to one, and finally applying a \emphnon-linearity. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emphlinear neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

[LG-35] Geographical Information Alignment Boosts Traffic Analysis via Transpose Cross-attention

链接: https://arxiv.org/abs/2412.02839
作者: Xiangyu Jiang,Xiwen Chen,Hao Wang,Abolfazl Razi
关键词-EN: Graph Neural Networks, recent Graph Neural, Neural Networks, Graph Neural, enhancing road safety
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic accident prediction is crucial for enhancing road safety and mitigating congestion, and recent Graph Neural Networks (GNNs) have shown promise in modeling the inherent graph-based traffic data. However, existing GNN- based approaches often overlook or do not explicitly exploit geographic position information, which often plays a critical role in understanding spatial dependencies. This is also aligned with our observation, where accident locations are often highly relevant. To address this issue, we propose a plug-in-and-play module for common GNN frameworks, termed Geographic Information Alignment (GIA). This module can efficiently fuse the node feature and geographic position information through a novel Transpose Cross-attention mechanism. Due to the large number of nodes for traffic data, the conventional cross-attention mechanism performing the node-wise alignment may be infeasible in computation-limited resources. Instead, we take the transpose operation for Query, Key, and Value in the Cross-attention mechanism, which substantially reduces the computation cost while maintaining sufficient information. Experimental results for both traffic occurrence prediction and severity prediction (severity levels based on the interval of recorded crash counts) on large-scale city-wise datasets confirm the effectiveness of our proposed method. For example, our method can obtain gains ranging from 1.3% to 10.9% in F1 score and 0.3% to 4.8% in AUC.

[LG-36] RoboFail: Analyzing Failures in Robot Learning Policies

链接: https://arxiv.org/abs/2412.02818
作者: Som Sagar,Ransalu Senanayake
关键词-EN: increasingly large datasets, trained on increasingly, overfit to specific, specific environments, large datasets
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 Pages, 6 figures

点击查看摘要

Abstract:Despite being trained on increasingly large datasets, robot models often overfit to specific environments or datasets. Consequently, they excel within their training distribution but face challenges in generalizing to novel or unforeseen scenarios. This paper presents a method to proactively identify failure mode probabilities in robot manipulation policies, providing insights into where these models are likely to falter. To this end, since exhaustively searching over a large space of failures is infeasible, we propose a deep reinforcement learning-based framework, RoboFail. It is designed to detect scenarios prone to failure and quantify their likelihood, thus offering a structured approach to anticipate failures. By identifying these high-risk states in advance, RoboFail enables researchers and engineers to better understand the robustness limits of robot policies, contributing to the development of safer and more adaptable robotic systems.

[LG-37] Learning Koopman-based Stability Certificates for Unknown Nonlinear Systems

链接: https://arxiv.org/abs/2412.02807
作者: Ruikun Zhou,Yiming Meng,Zhexuan Zeng,Jun Liu
关键词-EN: gained significant attention, Koopman operator theory, identifying discrete-time nonlinear, infinite-dimensional linear vector, discrete-time nonlinear systems
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: Submitted to L4DC 2025

点击查看摘要

Abstract:Koopman operator theory has gained significant attention in recent years for identifying discrete-time nonlinear systems by embedding them into an infinite-dimensional linear vector space. However, providing stability guarantees while learning the continuous-time dynamics, especially under conditions of relatively low observation frequency, remains a challenge within the existing Koopman-based learning frameworks. To address this challenge, we propose an algorithmic framework to simultaneously learn the vector field and Lyapunov functions for unknown nonlinear systems, using a limited amount of data sampled across the state space and along the trajectories at a relatively low sampling frequency. The proposed framework builds upon recently developed high-accuracy Koopman generator learning for capturing transient system transitions and physics-informed neural networks for training Lyapunov functions. We show that the learned Lyapunov functions can be formally verified using a satisfiability modulo theories (SMT) solver and provide less conservative estimates of the region of attraction compared to existing methods.

[LG-38] CPP-UT-Bench: Can LLM s Write Complex Unit Tests in C?

链接: https://arxiv.org/abs/2412.02735
作者: Vaishnavi Bhargava,Rajat Ghosh,Debojyoti Dutta
关键词-EN: test generation capability, unit test generation, large language model, generation capability, large language
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce CPP-UT-Bench, a benchmark dataset to measure C++ unit test generation capability of a large language model (LLM). CPP-UT-Bench aims to reflect a broad and diverse set of C++ codebases found in the real world. The dataset includes 2,653 code, unit test pairs drawn from 14 different opensource C++ codebases spanned across nine diverse domains including machine learning, software testing, parsing, standard input-output, data engineering, logging, complete expression evaluation, key value storage, and server protocols. We demonstrated the effectiveness of CPP-UT-Bench as a benchmark dataset through extensive experiments in in-context learning, parameter-efficient fine-tuning (PEFT), and full-parameter fine-tuning. We also discussed the challenges of the dataset compilation and insights we learned from in-context learning and fine-tuning experiments. Besides the CPP-UT-Bench dataset and data compilation code, we are also offering the fine-tuned model weights for further research. For nine out of ten experiments, our fine-tuned LLMs outperformed the corresponding base models by an average of more than 70%.

[LG-39] Resource-Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-Performance Computing Systems

链接: https://arxiv.org/abs/2412.02729
作者: Marcel Aach,Rakesh Sarma,Helmut Neukirchen,Morris Riedel,Andreas Lintermann
关键词-EN: High-Performance Computing, Successive Halving Algorithm, Successive Doubling Algorithm, Resource-Adaptive Successive Doubling, Asynchronous Successive Halving
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: submitted to Future Generation Computer Systems

点击查看摘要

Abstract:On High-Performance Computing (HPC) systems, several hyperparameter configurations can be evaluated in parallel to speed up the Hyperparameter Optimization (HPO) process. State-of-the-art HPO methods follow a bandit-based approach and build on top of successive halving, where the final performance of a combination is estimated based on a lower than fully trained fidelity performance metric and more promising combinations are assigned more resources over time. Frequently, the number of epochs is treated as a resource, letting more promising combinations train longer. Another option is to use the number of workers as a resource and directly allocate more workers to more promising configurations via data-parallel training. This article proposes a novel Resource-Adaptive Successive Doubling Algorithm (RASDA), which combines a resource-adaptive successive doubling scheme with the plain Asynchronous Successive Halving Algorithm (ASHA). Scalability of this approach is shown on up to 1,024 Graphics Processing Units (GPUs) on modern HPC systems. It is applied to different types of Neural Networks (NNs) and trained on large datasets from the Computer Vision (CV), Computational Fluid Dynamics (CFD), and Additive Manufacturing (AM) domains, where performing more than one full training run is usually infeasible. Empirical results show that RASDA outperforms ASHA by a factor of up to 1.9 with respect to the runtime. At the same time, the solution quality of final ASHA models is maintained or even surpassed by the implicit batch size scheduling of RASDA. With RASDA, systematic HPO is applied to a terabyte-scale scientific dataset for the first time in the literature, enabling efficient optimization of complex models on massive scientific data. The implementation of RASDA is available on this https URL

[LG-40] Self-test loss functions for learning weak-form operators and gradient flows

链接: https://arxiv.org/abs/2412.03506
作者: Yuan Gao,Quanjun Lang,Fei Lu
关键词-EN: data-driven modeling involving, modeling involving weak-form, involving weak-form operators, test functions appropriately, loss functions presents
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The construction of loss functions presents a major challenge in data-driven modeling involving weak-form operators in PDEs and gradient flows, particularly due to the need to select test functions appropriately. We address this challenge by introducing self-test loss functions, which employ test functions that depend on the unknown parameters, specifically for cases where the operator depends linearly on the unknowns. The proposed self-test loss function conserves energy for gradient flows and coincides with the expected log-likelihood ratio for stochastic differential equations. Importantly, it is quadratic, facilitating theoretical analysis of identifiability and well-posedness of the inverse problem, while also leading to efficient parametric or nonparametric regression algorithms. It is computationally simple, requiring only low-order derivatives or even being entirely derivative-free, and numerical experiments demonstrate its robustness against noisy and discrete data.

[LG-41] RENDy: Temporal Regression of Effective Non-linear Dynamics

链接: https://arxiv.org/abs/2412.03496
作者: Matthew Ricci,Guy Pelc,Zoe Piran,Noa Moriel,Mor Nitzan
关键词-EN: controlling cell division, protein waves controlling, waves controlling cell, morphogen dynamics underlying, Spatiotemporal dynamics pervade
类目: Pattern Formation and Solitons (nlin.PS); Machine Learning (cs.LG)
*备注: 10 pages, 14 appendix pages, 5 figures, 7 appendix figures

点击查看摘要

Abstract:Spatiotemporal dynamics pervade the natural sciences, from the morphogen dynamics underlying patterning in animal pigmentation to the protein waves controlling cell division. A central challenge lies in understanding how controllable parameters induce qualitative changes in system behavior called bifurcations. This endeavor is made particularly difficult in realistic settings where governing partial differential equations (PDEs) are unknown and data is limited and noisy. To address this challenge, we propose TRENDy (Temporal Regression of Effective Nonlinear Dynamics), an equation-free approach to learning low-dimensional, predictive models of spatiotemporal dynamics. Following classical work in spatial coarse-graining, TRENDy first maps input data to a low-dimensional space of effective dynamics via a cascade of multiscale filtering operations. Our key insight is the recognition that these effective dynamics can be fit by a neural ordinary differential equation (NODE) having the same parameter space as the input PDE. The preceding filtering operations strongly regularize the phase space of the NODE, making TRENDy significantly more robust to noise compared to existing methods. We train TRENDy to predict the effective dynamics of synthetic and real data representing dynamics from across the physical and life sciences. We then demonstrate how our framework can automatically locate both Turing and Hopf bifurcations in unseen regions of parameter space. We finally apply our method to the analysis of spatial patterning of the ocellated lizard through development. We found that TRENDy’s effective state not only accurately predicts spatial changes over time but also identifies distinct pattern features unique to different anatomical regions, highlighting the potential influence of surface geometry on reaction-diffusion mechanisms and their role in driving spatially varying pattern dynamics.

[LG-42] Beyond algorithm hyperparameters: on preprocessing hyperparameters and associated pitfalls in machine learning applications

链接: https://arxiv.org/abs/2412.03491
作者: Christina Sauer,Anne-Laure Boulesteix,Luzia Hanßum,Farina Hodiamont,Claudia Bausewein,Theresa Ullmann
关键词-EN: supervised machine learning, applied research areas, Adequately generating, machine learning, research areas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Adequately generating and evaluating prediction models based on supervised machine learning (ML) is often challenging, especially for less experienced users in applied research areas. Special attention is required in settings where the model generation process involves hyperparameter tuning, i.e. data-driven optimization of different types of hyperparameters to improve the predictive performance of the resulting model. Discussions about tuning typically focus on the hyperparameters of the ML algorithm (e.g., the minimum number of observations in each terminal node for a tree-based algorithm). In this context, it is often neglected that hyperparameters also exist for the preprocessing steps that are applied to the data before it is provided to the algorithm (e.g., how to handle missing feature values in the data). As a consequence, users experimenting with different preprocessing options to improve model performance may be unaware that this constitutes a form of hyperparameter tuning - albeit informal and unsystematic - and thus may fail to report or account for this optimization. To illuminate this issue, this paper reviews and empirically illustrates different procedures for generating and evaluating prediction models, explicitly addressing the different ways algorithm and preprocessing hyperparameters are typically handled by applied ML users. By highlighting potential pitfalls, especially those that may lead to exaggerated performance claims, this review aims to further improve the quality of predictive modeling in ML applications.

[LG-43] ght PAC-Bayesian Risk Certificates for Contrastive Learning

链接: https://arxiv.org/abs/2412.03486
作者: Anna van Elst,Debarghya Ghoshdastidar
关键词-EN: embed semantically similar, semantically similar pairs, independently drawn samples, contrastive models learn, closer than independently
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive representation learning is a modern paradigm for learning representations of unlabeled data via augmentations – precisely, contrastive models learn to embed semantically similar pairs of samples (positive pairs) closer than independently drawn samples (negative samples). In spite of its empirical success and widespread use in foundation models, statistical theory for contrastive learning remains less explored. Recent works have developed generalization error bounds for contrastive losses, but the resulting risk certificates are either vacuous (certificates based on Rademacher complexity or f -divergence) or require strong assumptions about samples that are unreasonable in practice. The present paper develops non-vacuous PAC-Bayesian risk certificates for contrastive representation learning, considering the practical considerations of the popular SimCLR framework. Notably, we take into account that SimCLR reuses positive pairs of augmented data as negative samples for other data, thereby inducing strong dependence and making classical PAC or PAC-Bayesian bounds inapplicable. We further refine existing bounds on the downstream classification loss by incorporating SimCLR-specific factors, including data augmentation and temperature scaling, and derive risk certificates for the contrastive zero-one risk. The resulting bounds for contrastive loss and downstream prediction are much tighter than those of previous risk certificates, as demonstrated by experiments on CIFAR-10.

[LG-44] Validity and efficiency of the conformal CUSUM procedure

链接: https://arxiv.org/abs/2412.03464
作者: Vladimir Vovk,Ilia Nouretdinov,Alex Gammerman
关键词-EN: CUSUM procedure, experimentally and theoretically, paper we study, study the validity, validity and efficiency
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:In this paper we study the validity and efficiency of a conformal version of the CUSUM procedure for change detection both experimentally and theoretically.

[LG-45] Classical Shadows with Improved Median-of-Means Estimation

链接: https://arxiv.org/abs/2412.03381
作者: Winston Fu,Dax Enshan Koh,Siong Thye Goh,Jian Feng Kong
关键词-EN: Nat. Phys., classical shadows protocol, modified estimators, Clifford measurements, classical shadows
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:The classical shadows protocol, introduced by Huang et al. [Nat. Phys. 16, 1050 (2020)], makes use of the median-of-means (MoM) estimator to efficiently estimate the expectation values of M observables with failure probability \delta using only \mathcalO(\log(M/\delta)) measurements. In their analysis, Huang et al. used loose constants in their asymptotic performance bounds for simplicity. However, the specific values of these constants can significantly affect the number of shots used in practical implementations. To address this, we studied a modified MoM estimator proposed by Minsker [PMLR 195, 5925 (2023)] that uses optimal constants and involves a U-statistic over the data set. For efficient estimation, we implemented two types of incomplete U-statistics estimators, the first based on random sampling and the second based on cyclically permuted sampling. We compared the performance of the original and modified estimators when used with the classical shadows protocol with single-qubit Clifford unitaries (Pauli measurements) for an Ising spin chain, and global Clifford unitaries (Clifford measurements) for the Greenberger-Horne-Zeilinger (GHZ) state. While the original estimator outperformed the modified estimators for Pauli measurements, the modified estimators showed improved performance over the original estimator for Clifford measurements. Our findings highlight the importance of tailoring estimators to specific measurement settings to optimize the performance of the classical shadows protocol in practical applications.

[LG-46] Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

链接: https://arxiv.org/abs/2412.03326
作者: Jing Fu,Bill Moran,José Niño-Mora
关键词-EN: Markov decision process, multi-action bandit processes, Markov decision, bandit processes, action spaces
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR)
*备注: 70 pages,0 figure

点击查看摘要

Abstract:We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality.

[LG-47] Gaussian Processes for Probabilistic Estimates of Earthquake Ground Shaking: A 1-D Proof-of-Concept NEURIPS2024

链接: https://arxiv.org/abs/2412.03299
作者: Sam A. Scivier,Tarje Nissen-Meyer,Paula Koelemeijer,Atılım Güneş Baydin
关键词-EN: velocity models, seismic velocity models, velocity, key input parameters, ground motion
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 8 pages, 2 figures, accepted in the Machine Learning and the Physical Sciences Workshop at NeurIPS 2024

点击查看摘要

Abstract:Estimates of seismic wave speeds in the Earth (seismic velocity models) are key input parameters to earthquake simulations for ground motion prediction. Owing to the non-uniqueness of the seismic inverse problem, typically many velocity models exist for any given region. The arbitrary choice of which velocity model to use in earthquake simulations impacts ground motion predictions. However, current hazard analysis methods do not account for this source of uncertainty. We present a proof-of-concept ground motion prediction workflow for incorporating uncertainties arising from inconsistencies between existing seismic velocity models. Our analysis is based on the probabilistic fusion of overlapping seismic velocity models using scalable Gaussian process (GP) regression. Specifically, we fit a GP to two synthetic 1-D velocity profiles simultaneously, and show that the predictive uncertainty accounts for the differences between the models. We subsequently draw velocity model samples from the predictive distribution and estimate peak ground displacement using acoustic wave propagation through the velocity models. The resulting distribution of possible ground motion amplitudes is much wider than would be predicted by simulating shaking using only the two input velocity models. This proof-of-concept illustrates the importance of probabilistic methods for physics-based seismic hazard analysis.

[LG-48] Nonparametric Filtering Estimation and Classification using Neural Jump ODEs

链接: https://arxiv.org/abs/2412.03271
作者: Jakob Heiss,Florian Krach,Thorsten Schmidt,Félix B. Tambe-Ndonfack
关键词-EN: Neural Jump ODEs, Neural Jump, Jump ODEs model, Jump ODEs, neural ODEs
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Neural Jump ODEs model the conditional expectation between observations by neural ODEs and jump at arrival of new observations. They have demonstrated effectiveness for fully data-driven online forecasting in settings with irregular and partial observations, operating under weak regularity assumptions. This work extends the framework to input-output systems, enabling direct applications in online filtering and classification. We establish theoretical convergence guarantees for this approach, providing a robust solution to L^2 -optimal filtering. Empirical experiments highlight the model’s superior performance over classical parametric methods, particularly in scenarios with complex underlying distributions. These results emphasise the approach’s potential in time-sensitive domains such as finance and health monitoring, where real-time accuracy is crucial.

[LG-49] LEP-QNN: Loan Eligibility Prediction Using Quantum Neural Networks

链接: https://arxiv.org/abs/2412.03158
作者: Nouhaila Innan,Alberto Marchisio,Mohamed Bennai,Muhammad Shafique
关键词-EN: Predicting loan eligibility, high accuracy remains, loan eligibility, Loan Eligibility Prediction, finance sector
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages. 6 figures, 3 tables

点击查看摘要

Abstract:Predicting loan eligibility with high accuracy remains a significant challenge in the finance sector. Accurate predictions enable financial institutions to make informed decisions, mitigate risks, and effectively adapt services to meet customer needs. However, the complexity and the high-dimensional nature of financial data have always posed significant challenges to achieving this level of precision. To overcome these issues, we propose a novel approach that employs Quantum Machine Learning (QML) for Loan Eligibility Prediction using Quantum Neural Networks (LEP-QNN).Our innovative approach achieves an accuracy of 98% in predicting loan eligibility from a single, comprehensive dataset. This performance boost is attributed to the strategic implementation of a dropout mechanism within the quantum circuit, aimed at minimizing overfitting and thereby improving the model’s predictive reliability. In addition, our exploration of various optimizers leads to identifying the most efficient setup for our LEP-QNN framework, optimizing its performance. We also rigorously evaluate the resilience of LEP-QNN under different quantum noise scenarios, ensuring its robustness and dependability for quantum computing environments. This research showcases the potential of QML in financial predictions and establishes a foundational guide for advancing QML technologies, marking a step towards developing advanced, quantum-driven financial decision-making tools.

[LG-50] Generalized Diffusion Model with Adjusted Offset Noise

链接: https://arxiv.org/abs/2412.03134
作者: Takuro Kutsuna
关键词-EN: drug discovery, image generation, audio synthesis, fundamental tools, tools for modeling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have become fundamental tools for modeling data distributions in machine learning and have applications in image generation, drug discovery, and audio synthesis. Despite their success, these models face challenges when generating data with extreme brightness values, as evidenced by limitations in widely used frameworks like Stable Diffusion. Offset noise has been proposed as an empirical solution to this issue, yet its theoretical basis remains insufficiently explored. In this paper, we propose a generalized diffusion model that naturally incorporates additional noise within a rigorous probabilistic framework. Our approach modifies both the forward and reverse diffusion processes, enabling inputs to be diffused into Gaussian distributions with arbitrary mean structures. We derive a loss function based on the evidence lower bound, establishing its theoretical equivalence to offset noise with certain adjustments, while broadening its applicability. Experiments on synthetic datasets demonstrate that our model effectively addresses brightness-related challenges and outperforms conventional methods in high-dimensional scenarios.

[LG-51] A Scalable Quantum Neural Network for Approximate SRBB-Based Unitary Synthesis

链接: https://arxiv.org/abs/2412.03083
作者: Giacomo Belli,Marco Mordacci,Michele Amoretti
关键词-EN: Recursive Block Basis, Standard Recursive Block, Block Basis, Standard Recursive, Recursive Block
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Journal

点击查看摘要

Abstract:In this work, scalable quantum neural networks are introduced to approximate unitary evolutions through the Standard Recursive Block Basis (SRBB) and, subsequently, redesigned with a reduced number of CNOTs. This algebraic approach to the problem of unitary synthesis exploits Lie algebras and their topological features to obtain scalable parameterizations of unitary operators. First, the recursive algorithm that builds the SRBB is presented, framed in the original scalability scheme already known to the literature only from a theoretical point of view. Unexpectedly, 2-qubit systems emerge as a special case outside this scheme. Furthermore, an algorithm to reduce the number of CNOTs is proposed, thus deriving a new implementable scaling scheme that requires one single layer of approximation. From the mathematical algorithm, the scalable CNOT-reduced quantum neural network is implemented and its performance is assessed with a variety of different unitary matrices, both sparse and dense, up to 6 qubits via the PennyLane library. The effectiveness of the approximation is measured with different metrics in relation to two optimizers: a gradient-based method and the Nelder-Mead method. The approximate SRBB-based synthesis algorithm with CNOT-reduction is also tested on real hardware and compared with other valid approximation and decomposition methods available in the literature.

[LG-52] Hamiltonian-based neural networks for systems under nonholonomic constraints

链接: https://arxiv.org/abs/2412.03018
作者: Ignacio Puiggros T.,A. Srikantha Phani
关键词-EN: incorporate physics priors, Hamiltonian neural networks, Hamiltonian, neural network, Hamiltonian neural
类目: Classical Physics (physics.class-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been increasing interest in methodologies that incorporate physics priors into neural network architectures to enhance their modeling capabilities. A family of these methodologies that has gained traction are Hamiltonian neural networks (HNN) and their variations. These architectures explicitly encode Hamiltonian mechanics both in their structure and loss function. Although Hamiltonian systems under nonholonomic constraints are in general not Hamiltonian, it is possible to formulate them in pseudo-Hamiltonian form, equipped with a Lie bracket which is almost Poisson. This opens the possibility of using some principles of HNNs in systems under nonholonomic constraints. The goal of the present work is to develop a modified Hamiltonian neural network architecture capable of modeling Hamiltonian systems under holonomic and nonholonomic constraints. A three-network parallel architecture is proposed to simultaneously learn the Hamiltonian of the system, the constraints, and their associated multipliers. A rolling disk and a ball on a spinning table are considered as canonical examples to assess the performance of the proposed Hamiltonian architecture. The experiments are then repeated with a noisy training set to study modeling performance under more realistic conditions.

[LG-53] Preference-based Pure Exploration

链接: https://arxiv.org/abs/2412.02988
作者: Apurv Shukla,Debabrota Basu
关键词-EN: preference-based pure exploration, pure exploration problem, pure exploration, Pareto optimal arms, preference-based pure
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the preference-based pure exploration problem for bandits with vector-valued rewards. The rewards are ordered using a (given) preference cone \mathcalC and our the goal is to identify the set of Pareto optimal arms. First, to quantify the impact of preferences, we derive a novel lower bound on the sample complexity for identifying the most preferred policy with confidence level 1-\delta . Our lower bound elicits the role played by the geometry of the preference cone and punctuates the difference in hardness compared to existing best-arm identification variants of the problem. We further explicate this geometry when rewards follow Gaussian distributions. We then provide a convex relaxation of the lower bound. and leverage it to design Preference-based Track and Stop (PreTS) algorithm that identifies the most preferred policy. Finally, we show that sample complexity of PreTS is asymptotically tight by deriving a new concentration inequality for vector-valued rewards.

[LG-54] Unified Inductive Logic: From Formal Learning to Statistical Inference to Supervised Learning

链接: https://arxiv.org/abs/2412.02969
作者: Hanti Lin
关键词-EN: formal learning theory, unify formal learning, logic is Carnapian, develop a Peircean, Peircean alternative
类目: Other Statistics (stat.OT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the traditional conception of inductive logic is Carnapian, I develop a Peircean alternative and use it to unify formal learning theory, statistics, and a significant part of machine learning: supervised learning. Some crucial standards for evaluating non-deductive inferences have been assumed separately in those areas, but can actually be justified by a unifying principle.

[LG-55] MACAW: A Causal Generative Model for Medical Imaging

链接: https://arxiv.org/abs/2412.02900
作者: Vibujithan Vigneshwaran,Erik Ohara,Matthias Wilms,Nils Forkert
关键词-EN: deep learning techniques, techniques show promising, learning techniques show, clinical scenarios, deep learning models
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Although deep learning techniques show promising results for many neuroimaging tasks in research settings, they have not yet found widespread use in clinical scenarios. One of the reasons for this problem is that many machine learning models only identify correlations between the input images and the outputs of interest, which can lead to many practical problems, such as encoding of uninformative biases and reduced explainability. Thus, recent research is exploring if integrating a priori causal knowledge into deep learning models is a potential avenue to identify these problems. This work introduces a new causal generative architecture named Masked Causal Flow (MACAW) for neuroimaging applications. Within this context, three main contributions are described. First, a novel approach that integrates complex causal structures into normalizing flows is proposed. Second, counterfactual prediction is performed to identify the changes in effect variables associated with a cause variable. Finally, an explicit Bayesian inference for classification is derived and implemented, providing an inherent uncertainty estimation. The feasibility of the proposed method was first evaluated using synthetic data and then using MRI brain data from more than 23000 participants of the UK biobank study. The evaluation results show that the proposed method can (1) accurately encode causal reasoning and generate counterfactuals highlighting the structural changes in the brain known to be associated with aging, (2) accurately predict a subject’s age from a single 2D MRI slice, and (3) generate new samples assuming other values for subject-specific indicators such as age, sex, and body mass index. The code for a toy dataset is available at the following link: this https URL.

[LG-56] An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits NEURIPS2025

链接: https://arxiv.org/abs/2412.02861
作者: Amaury Gouverneur,Borja Rodríguez-Gálvez,Tobias J. Oechtering,Mikael Skoglund
关键词-EN: Thompson Sampling algorithm, logistic bandit problems, agent receives binary, Thompson Sampling, receives binary rewards
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, Accepted to NeurIPS 2025 Workshop on Bayesian Decision-Making and Uncertainty

点击查看摘要

Abstract:We study the performance of the Thompson Sampling algorithm for logistic bandit problems, where the agent receives binary rewards with probabilities determined by a logistic function \exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle)) . We focus on the setting where the action a and parameter \theta lie within the d -dimensional unit ball with the action space encompassing the parameter space. Adopting the information-theoretic framework introduced by (Russo \ Van Roy, 2015), we analyze the information ratio, which is defined as the ratio of the expected squared difference between the optimal and actual rewards to the mutual information between the optimal action and the reward. Improving upon previous results, we establish that the information ratio is bounded by \tfrac92d . Notably, we obtain a regret bound in O(d\sqrtT \log(\beta T/d)) that depends only logarithmically on the parameter \beta .

[LG-57] Universal Rates of Empirical Risk Minimization NEURIPS2024

链接: https://arxiv.org/abs/2412.02810
作者: Steve Hanneke,Mingyue Xu
关键词-EN: classical PAC theory, empirical risk minimization, well-known empirical risk, learning, machine learning algorithms
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: This paper has been accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:The well-known empirical risk minimization (ERM) principle is the basis of many widely used machine learning algorithms, and plays an essential role in the classical PAC theory. A common description of a learning algorithm’s performance is its so-called “learning curve”, that is, the decay of the expected error as a function of the input sample size. As the PAC model fails to explain the behavior of learning curves, recent research has explored an alternative universal learning model and has ultimately revealed a distinction between optimal universal and uniform learning rates (Bousquet et al., 2021). However, a basic understanding of such differences with a particular focus on the ERM principle has yet to be developed. In this paper, we consider the problem of universal learning by ERM in the realizable case and study the possible universal rates. Our main result is a fundamental tetrachotomy: there are only four possible universal learning rates by ERM, namely, the learning curves of any concept class learnable by ERM decay either at e^-n , 1/n , \log(n)/n , or arbitrarily slow rates. Moreover, we provide a complete characterization of which concept classes fall into each of these categories, via new complexity structures. We also develop new combinatorial dimensions which supply sharp asymptotically-valid constant factors for these rates, whenever possible. Comments: This paper has been accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2412.02810 [stat.ML] (or arXiv:2412.02810v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.02810 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Harnessing Multiple Correlated Networks for Exact Community Recovery NEURIPS

链接: https://arxiv.org/abs/2412.02796
作者: Miklós Z. Rácz,Jifan Zhang
关键词-EN: edge-correlated stochastic block, stochastic block models, multiple correlated networks, exact community recovery, focusing on edge-correlated
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Probability (math.PR)
*备注: 53 pages, 4 figures. To appear in Advances in Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:We study the problem of learning latent community structure from multiple correlated networks, focusing on edge-correlated stochastic block models with two balanced communities. Recent work of Gaudio, Rácz, and Sridhar (COLT 2022) determined the precise information-theoretic threshold for exact community recovery using two correlated graphs; in particular, this showcased the subtle interplay between community recovery and graph matching. Here we study the natural setting of more than two graphs. The main challenge lies in understanding how to aggregate information across several graphs when none of the pairwise latent vertex correspondences can be exactly recovered. Our main result derives the precise information-theoretic threshold for exact community recovery using any constant number of correlated graphs, answering a question of Gaudio, Rácz, and Sridhar (COLT 2022). In particular, for every K \geq 3 we uncover and characterize a region of the parameter space where exact community recovery is possible using K correlated graphs, even though (1) this is information-theoretically impossible using any K-1 of them and (2) none of the latent matchings can be exactly recovered.

[LG-59] Methods with Local Steps and Random Reshuffling for Generally Smooth Non-Convex Federated Optimization

链接: https://arxiv.org/abs/2412.02781
作者: Yury Demidovich,Petr Ostroukhov,Grigory Malinovsky,Samuel Horváth,Martin Takáč,Peter Richtárik,Eduard Gorbunov
关键词-EN: Non-convex Machine Learning, Non-convex Machine, Machine Learning problems, Machine Learning, Learning problems typically
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-convex Machine Learning problems typically do not adhere to the standard smoothness assumption. Based on empirical findings, Zhang et al. (2020b) proposed a more realistic generalized (L_0, L_1) -smoothness assumption, though it remains largely unexplored. Many existing algorithms designed for standard smooth problems need to be revised. However, in the context of Federated Learning, only a few works address this problem but rely on additional limiting assumptions. In this paper, we address this gap in the literature: we propose and analyze new methods with local steps, partial participation of clients, and Random Reshuffling without extra restrictive assumptions beyond generalized smoothness. The proposed methods are based on the proper interplay between clients’ and server’s stepsizes and gradient clipping. Furthermore, we perform the first analysis of these methods under the Polyak-Ł ojasiewicz condition. Our theory is consistent with the known results for standard smooth problems, and our experimental results support the theoretical insights.

信息检索

[IR-0] Freshness and Informativity Weighted Cognitive Extent and Its Correlation with Cumulative Citation Count

链接: https://arxiv.org/abs/2412.03557
作者: Zihe Wang,Jian Wu
关键词-EN: revisit cognitive extent, Weighted Cognitive Extent, Informative Weighted Cognitive, cognitive extent, revisit cognitive
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In this paper, we revisit cognitive extent, originally defined as the number of unique phrases in a quota. We introduce Freshness and Informative Weighted Cognitive Extent (FICE), calculated based on two novel weighting factors, the lifetime ratio and informativity of scientific entities. We model the lifetime of each scientific entity as the time-dependent document frequency, which is fit by the composition of multiple Gaussian profiles. The lifetime ratio is then calculated as the cumulative document frequency at the publication time t_0 divided by the cumulative document frequency over its entire lifetime. The informativity is calculated by normalizing the document frequency across all scientific entities recognized in a title. Using the ACL Anthology, we verified the trend formerly observed in several other domains that the number of unique scientific entities per quota increased gradually at a slower rate. We found that FICE exhibits a strong correlation with the average cumulative citation count within a quota. Our code is available at \hrefthis https URLthis https URL

[IR-1] Beyond Questions: Leveraging ColBERT for Keyphrase Search

链接: https://arxiv.org/abs/2412.03193
作者: Jorge Gabín,Javier Parapar,Craig Macdonald
关键词-EN: engines’ users increasingly, users increasingly adopt, search engines’ users, gaining popularity, engines’ users
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:While question-like queries are gaining popularity and search engines’ users increasingly adopt them, keyphrase search has traditionally been the cornerstone of web search. This query type is also prevalent in specialised search tasks such as academic or professional search, where experts rely on keyphrases to articulate their information needs. However, current dense retrieval models often fail with keyphrase-like queries, primarily because they are mostly trained on question-like ones. This paper introduces a novel model that employs the ColBERT architecture to enhance document ranking for keyphrase queries. For that, given the lack of large keyphrase-based retrieval datasets, we first explore how Large Language Models can convert question-like queries into keyphrase format. Then, using those keyphrases, we train a keyphrase-based ColBERT ranker (ColBERTKP_QD) to improve the performance when working with keyphrase queries. Furthermore, to reduce the training costs associated with training the full ColBERT model, we investigate the feasibility of training only a keyphrase query encoder while keeping the document encoder weights static (ColBERTKP_Q). We assess our proposals’ ranking performance using both automatically generated and manually annotated keyphrases. Our results reveal the potential of the late interaction architecture when working under the keyphrase search scenario.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-05

目录

概览 (2024-12-05)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载