Arxiv今日论文 | 2025-01-09

本篇博文主要内容为 2025-01-09 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有代码大语言模型（LLMs）在指令调优（instruction tuning）过程中面临的数据复杂性和多样性不足的问题。现有方法主要依赖于代码片段，这些片段通常局限于特定功能和固定结构，限制了生成数据的复杂性和多样性。为解决这一问题，论文提出了一种基于特征树（feature tree）的合成框架，该框架受抽象语法树（AST）启发，但与AST仅捕捉代码的语法结构不同，特征树能够建模代码元素之间的语义关系，从而生成更具细微差别和多样性的数据。关键解决方案在于通过从原始数据构建特征树，并迭代优化以增加提取特征的数量和多样性，从而识别代码中更复杂的模式和关系。通过控制子树的深度和广度进行采样，该框架能够精确调整生成代码的复杂性，支持从简单函数级操作到复杂多文件场景的广泛任务。最终，论文通过微调广泛使用的基础模型，创建了EpiCoder系列模型，在多个基准测试中实现了最先进的性能，特别是在合成高度复杂的仓库级代码数据方面展示了显著潜力。

链接: https://arxiv.org/abs/2501.04694
作者: Yaoxiang Wang,Haoling Li,Xin Zhang,Jie Wu,Xiao Liu,Wenxiang Hu,Zhongxin Guo,Yangyu Huang,Ying Xin,Yujiu Yang,Jinsong Su,Qi Chen,Scarlett Li
机构: Xiamen University(厦门大学); Tsinghua University(清华大学); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 40 pages, 11 figures

点击查看摘要

Abstract:Effective instruction tuning is indispensable for optimizing code LLMs, aligning model behavior with user expectations and enhancing model performance in real-world applications. However, most existing methods focus on code snippets, which are limited to specific functionalities and rigid structures, restricting the complexity and diversity of the synthesized data. To address these limitations, we introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST). Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements, enabling the generation of more nuanced and diverse data. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features. This process enables the identification of more complex patterns and relationships within the code. By sampling subtrees with controlled depth and breadth, our framework allows precise adjustments to the complexity of the generated code, supporting a wide range of tasks from simple function-level operations to intricate multi-file scenarios. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels across multiple benchmarks. Notably, empirical evidence indicates that our approach shows significant potential in synthesizing highly complex repository-level code data. Further analysis elucidates the merits of this approach by rigorously assessing data complexity and diversity through software engineering principles and LLM-as-a-judge method.
zh

[NLP-1] URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

【速读】：该论文试图解决多模态数学推理中高质量链式思维（Chain-of-Thought, CoT）训练数据稀缺的问题，这一问题限制了现有模型在测试时实现高精度CoT推理和发挥其推理潜力。为解决这一问题，论文提出了一种三模块合成策略，包括CoT蒸馏（CoT distillation）、轨迹格式重写（trajectory-format rewriting）和格式统一（format unification），从而生成了一个高质量的多模态数学CoT推理指令微调数据集MMathCoT-1M。此外，论文还引入了一种数据合成策略，自动生成过程注释数据集DualMath-1.1M，专注于解释和逻辑，以增强模型在测试时的扩展能力。通过进一步在DualMath-1.1M上训练URSA-7B模型，论文实现了从CoT推理能力到强大监督能力的转变，并训练了URSA-RM-7B作为验证器，有效提升了URSA-7B在测试时的性能。

链接: https://arxiv.org/abs/2501.04686
作者: Ruilin Luo,Zhuofan Zheng,Yifan Wang,Yiyao Yu,Xinzhe Ni,Zicheng Lin,Jin Zeng,Yujiu Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 10 tables, 17 figures. The training data has been released. The code and model are currently undergoing internal review. They will be made available soon. Project url: this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.
zh

[NLP-2] owards System 2 Reasoning in LLM s: Learning How to Think With Meta Chain-of-Though

【速读】：该论文旨在解决传统链式思维（Chain-of-Thought, CoT）在推理过程中缺乏对底层推理机制显式建模的问题。为此，作者提出了一种新的框架——元链式思维（Meta-CoT），通过显式建模推理过程来扩展传统的链式思维。解决方案的关键在于通过过程监督（process supervision）、合成数据生成（synthetic data generation）和搜索算法（search algorithms）等方法生成元链式思维，并结合指令调优（instruction tuning）和强化学习（reinforcement learning）来训练模型，使其能够生成元链式思维。这一框架为在大型语言模型（LLMs）中实现更强大和类人推理提供了理论和实践基础。

链接: https://arxiv.org/abs/2501.04682
作者: Violet Xiang,Charlie Snell,Kanishk Gandhi,Alon Albalak,Anikait Singh,Chase Blagden,Duy Phung,Rafael Rafailov,Nathan Lile,Dakota Mahan,Louis Castricato,Jan-Philipp Franken,Nick Haber,Chelsea Finn
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.
zh

[NLP-3] Enhancing Financial VQA in Vision Language Models using Intermediate Structured Representations

【速读】：该论文试图解决从图表中准确提取信息的自动化模型所面临的挑战，特别是针对条形图（包括简单、堆叠和分组条形图）的结构特征进行信息提取。解决方案的关键在于对DEPLOT模型进行微调，该模型通过将图表图像转换为线性化表格（linearized table）来实现模态转换。通过在包含50,000张条形图的自定义数据集上进行微调，并使用相对映射相似度（RMS）和相对数值集相似度（RNSS）两个指标进行评估，研究证明了微调后的DEPLOT模型在信息提取方面的有效性。此外，研究还探讨了大型语言模型（LLMs）在图表推理中的表现，发现提供结构化的中间表格可以显著提升LLM的推理性能，相较于直接查询图像的方式具有明显优势。

链接: https://arxiv.org/abs/2501.04675
作者: Archita Srivastava,Abhas Kumar,Rajesh Kumar,Prabhakar Srinivasan
机构: Synechron, Bangalore, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chart interpretation is crucial for visual data analysis, but accurately extracting information from charts poses significant challenges for automated models. This study investigates the fine-tuning of DEPLOT, a modality conversion module that translates the image of a plot or chart to a linearized table, on a custom dataset of 50,000 bar charts. The dataset comprises simple, stacked, and grouped bar charts, targeting the unique structural features of these visualizations. The finetuned DEPLOT model is evaluated against its base version using a test set of 1,000 images and two metrics: Relative Mapping Similarity (RMS), which measures categorical mapping accuracy, and Relative Number Set Similarity (RNSS), which evaluates numerical interpretation accuracy. To further explore the reasoning capabilities of large language models (LLMs), we curate an additional set of 100 bar chart images paired with question answer sets. Our findings demonstrate that providing a structured intermediate table alongside the image significantly enhances LLM reasoning performance compared to direct image queries.
zh

[NLP-4] On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

【速读】：该论文旨在揭示语言模型（Language Models, LMs）在处理非西方语言时对西方文化相关实体表现出强烈偏见的根源。通过分析多个影响因素，包括预训练数据中实体的表示方式以及不同语言间语言现象的差异，论文探讨了这些偏见的具体成因。关键解决方案是引入了CAMeL-2，一个包含58,086个与阿拉伯和西方文化相关的实体以及367个掩码自然上下文的阿拉伯语-英语平行基准数据集。通过CAMeL-2的评估，论文发现语言模型在英语测试中表现出的文化间性能差距较阿拉伯语测试中有所减少。研究还指出，语言模型在阿拉伯语中处理高频实体时表现较差，这些实体可能具有多义词义，且与使用阿拉伯字母的非阿拉伯语言存在高词汇重叠。此外，论文揭示了基于频率的分词方法在阿拉伯语中加剧了这一问题，尤其是在更大的阿拉伯语词汇量下。

链接: https://arxiv.org/abs/2501.04662
作者: Tarek Naous,Wei Xu
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: this https URL
zh

[NLP-5] Assessing Language Comprehension in Large Language Models Using Construction Grammar

【速读】：该论文试图解决如何系统评估大语言模型（LLMs）在自然语言理解（NLU）方面的真实能力问题。由于LLMs在训练时使用了大量的网络规模数据，评估其对语言的“理解”尤为困难。为此，作者提出了一种基于构式语法（Construction Grammar, CxG）的评估方法。CxG通过分析语言元素（即构式，Cxns）所捕捉的意义，为构建有针对性的评估数据集提供了理论基础。这些数据集经过精心设计，包含不太可能出现在预训练数据中的例子，但对人类来说直观且易于理解，从而能够更准确和可靠地评估LLMs的语言理解能力。实验结果表明，尽管LLMs展示出一定的构式信息理解能力，但在处理与预训练数据差异较大的测试句子时，即使是包括GPT-01在内的最新模型也难以理解这些构式所传达的抽象意义。作者认为，这些情况更能真实反映LLMs在语义能力上的关键局限性。

链接: https://arxiv.org/abs/2501.04661
作者: Wesley Scivetti,Melissa Torgbi,Austin Blodgett,Mollie Shichman,Taylor Hudson,Claire Bonial,Harish Tayyar Madabushi
机构: Georgetown University(乔治城大学); University of Bath(巴斯大学); Army Research Lab(陆军研究实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models, despite their significant capabilities, are known to fail in surprising and unpredictable ways. Evaluating their true `understanding’ of language is particularly challenging due to the extensive web-scale data they are trained on. Therefore, we construct an evaluation to systematically assess natural language understanding (NLU) in LLMs by leveraging Construction Grammar (CxG), which provides insights into the meaning captured by linguistic elements known as constructions (Cxns). CxG is well-suited for this purpose because provides a theoretical basis to construct targeted evaluation sets. These datasets are carefully constructed to include examples which are unlikely to appear in pre-training data, yet intuitive and easy for humans to understand, enabling a more targeted and reliable assessment. Our experiments focus on downstream natural language inference and reasoning tasks by comparing LLMs’ understanding of the underlying meanings communicated through 8 unique Cxns with that of humans. The results show that while LLMs demonstrate some knowledge of constructional information, even the latest models including GPT-o1 struggle with abstract meanings conveyed by these Cxns, as demonstrated in cases where test sentences are dissimilar to their pre-training data. We argue that such cases provide a more accurate test of true language understanding, highlighting key limitations in LLMs’ semantic capabilities. We make our novel dataset and associated experimental data including prompts and model responses publicly available.
zh

[NLP-6] Multi-task retriever fine-tuning for domain-specific and efficient RAG NAACL2025

【速读】：该论文试图解决在现实世界中部署检索增强生成（Retrieval-Augmented Generation, RAG）应用时遇到的两个主要问题：一是检索到的信息通常是领域特定的，而由于微调大型语言模型（Large Language Models, LLMs）计算成本高昂，更可行的方案是通过微调检索器来提高输入到LLM的数据质量；二是随着更多应用部署在同一系统中，无法负担部署多个独立的检索器，且这些RAG应用通常检索不同类型的数据。论文提出的解决方案是通过在多种领域特定任务上对小型检索器编码器进行指令微调（instruction fine-tuning），从而部署一个能够服务于多种用例的编码器，实现低成本、可扩展性和速度。该编码器不仅能够泛化到领域外设置，还能在现实企业用例中处理未见过的检索任务。

链接: https://arxiv.org/abs/2501.04652
作者: Patrice Béchard,Orlando Marquez Ayala
机构: ServiceNow
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages, 2 figures. Submitted to NAACL 2025 Industry Track

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying Large Language Models (LLMs), as it can address typical limitations such as generating hallucinated or outdated information. However, when building real-world RAG applications, practical issues arise. First, the retrieved information is generally domain-specific. Since it is computationally expensive to fine-tune LLMs, it is more feasible to fine-tune the retriever to improve the quality of the data included in the LLM input. Second, as more applications are deployed in the same real-world system, one cannot afford to deploy separate retrievers. Moreover, these RAG applications normally retrieve different kinds of data. Our solution is to instruction fine-tune a small retriever encoder on a variety of domain-specific tasks to allow us to deploy one encoder that can serve many use cases, thereby achieving low-cost, scalability, and speed. We show how this encoder generalizes to out-of-domain settings as well as to an unseen retrieval task on real-world enterprise use cases.
zh

[NLP-7] FlairGPT : Repurposing LLM s for Interior Designs

【速读】：该论文试图解决室内设计中的复杂问题，即如何通过数据驱动的方法生成既美观又功能性强且符合客户需求的设计布局。传统的解决方案通常是针对特定房间或领域，且在设计过程中缺乏可解释性。论文的关键解决方案是探索如何利用大语言模型（LLMs）在结构化的工作流程中辅助室内设计。虽然LLMs目前无法直接生成完整的设计布局，但通过系统性地引导LLMs生成物体列表及相关约束条件，并将其转化为设计布局图，再通过现成的约束优化工具生成最终布局。该方法在多种设计配置中进行了基准测试，并与现有的基于LLM的方法和人类设计进行了对比，结果表明，LLMs在结构化使用下能够有效生成多样化的高质量布局，适用于大规模虚拟场景的创建。

链接: https://arxiv.org/abs/2501.04648
作者: Gabrielle Littlefair,Niladri Shekhar Dutt,Niloy J. Mitra
机构: University College London; Adobe Research
类目: Graphics (cs.GR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EUROGRAPHICS 2025

点击查看摘要

Abstract:Interior design involves the careful selection and arrangement of objects to create an aesthetically pleasing, functional, and harmonized space that aligns with the client’s design brief. This task is particularly challenging, as a successful design must not only incorporate all the necessary objects in a cohesive style, but also ensure they are arranged in a way that maximizes accessibility, while adhering to a variety of affordability and usage considerations. Data-driven solutions have been proposed, but these are typically room- or domain-specific and lack explainability in their design design considerations used in producing the final layout. In this paper, we investigate if large language models (LLMs) can be directly utilized for interior design. While we find that LLMs are not yet capable of generating complete layouts, they can be effectively leveraged in a structured manner, inspired by the workflow of interior designers. By systematically probing LLMs, we can reliably generate a list of objects along with relevant constraints that guide their placement. We translate this information into a design layout graph, which is then solved using an off-the-shelf constrained optimization setup to generate the final layouts. We benchmark our algorithm in various design configurations against existing LLM-based methods and human designs, and evaluate the results using a variety of quantitative and qualitative metrics along with user studies. In summary, we demonstrate that LLMs, when used in a structured manner, can effectively generate diverse high-quality layouts, making them a viable solution for creating large-scale virtual scenes. Project webpage at this https URL
zh

[NLP-8] Quantum-inspired Embeddings Projection and Similarity Metrics for Representation Learning

【速读】：该论文试图解决在表示学习（representation learning）中，如何有效地压缩嵌入（embedding）维度并保持向量之间的相似性关系的问题。解决方案的关键在于提出了一种量子启发的投影头（quantum-inspired projection head），该投影头将经典嵌入映射到希尔伯特空间（Hilbert space）中的量子态，并通过量子电路（quantum circuit）来降低嵌入维度。此外，论文还引入了一种量子启发的相似性度量（quantum-inspired similarity metric），以保持向量间的相似性关系。通过在BERT语言模型中集成该投影头，并在TREC 2019和TREC 2020深度学习基准上进行信息检索任务的评估，结果表明，该量子启发方法在参数数量减少32倍的情况下，仍能取得与经典方法相竞争的性能，尤其是在小数据集上表现尤为突出。

链接: https://arxiv.org/abs/2501.04591
作者: Ivan Kankeu,Stefan Gerd Fritsch,Gunnar Schönhoff,Elie Mounzer,Paul Lukowicz,Maximilian Kiefer-Emmanouilidis
机构: 未知
类目: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Over the last decade, representation learning, which embeds complex information extracted from large amounts of data into dense vector spaces, has emerged as a key technique in machine learning. Among other applications, it has been a key building block for large language models and advanced computer vision systems based on contrastive learning. A core component of representation learning systems is the projection head, which maps the original embeddings into different, often compressed spaces, while preserving the similarity relationship between vectors. In this paper, we propose a quantum-inspired projection head that includes a corresponding quantum-inspired similarity metric. Specifically, we map classical embeddings onto quantum states in Hilbert space and introduce a quantum circuit-based projection head to reduce embedding dimensionality. To evaluate the effectiveness of this approach, we extended the BERT language model by integrating our projection head for embedding compression. We compared the performance of embeddings, which were compressed using our quantum-inspired projection head, with those compressed using a classical projection head on information retrieval tasks using the TREC 2019 and TREC 2020 Deep Learning benchmarks. The results demonstrate that our quantum-inspired method achieves competitive performance relative to the classical method while utilizing 32 times fewer parameters. Furthermore, when trained from scratch, it notably excels, particularly on smaller datasets. This work not only highlights the effectiveness of the quantum-inspired approach but also emphasizes the utility of efficient, ad hoc low-entanglement circuit simulations within neural networks as a powerful quantum-inspired technique. Subjects: Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Quantum Physics (quant-ph) Cite as: arXiv:2501.04591 [cs.CL] (or arXiv:2501.04591v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.04591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] InfiGUIAgent : A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

【速读】：该论文试图解决现有基于多模态大语言模型（MLLMs）的图形用户界面（GUI）代理在多步推理和对文本注释依赖方面的局限性问题。现有的GUI代理在处理复杂任务时，往往难以进行有效的多步推理，且过度依赖文本注释，限制了其在自动化任务中的表现。为解决这一问题，论文提出了\textitInfiGUIAgent，一种基于MLLM的GUI代理，通过两阶段的监督微调管道进行训练。第一阶段增强基础技能，如GUI理解和基础推理能力；第二阶段通过合成数据集成层次推理和期望-反思推理技能，使代理具备原生推理能力。这一解决方案的关键在于通过分阶段的训练方法，显著提升了代理在GUI交互中的自动化任务处理能力，并在多个GUI基准测试中表现出色。

链接: https://arxiv.org/abs/2501.04575
作者: Yuhang Liu,Pengxiang Li,Zishu Wei,Congkai Xie,Xueyu Hu,Xinchen Xu,Shengyu Zhang,Xiaotian Han,Hongxia Yang,Fei Wu
机构: Zhejiang University(浙江大学); Dalian University of Technology(大连理工大学); Reallm Labs; ByteDance Inc(字节跳动); The Hong Kong Polytechnic University(香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 14 pages, 7 figures, work in progress

点击查看摘要

Abstract:Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textitInfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textitInfiGUIAgent achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \urlthis https URL.
zh

[NLP-10] Supervision-free Vision-Language Alignment

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在整合视觉和语言信息时面临的挑战，即需要大量高质量图像-文本对训练数据的问题。这些数据的收集既耗时又计算成本高昂。为解决这一问题，论文提出了一种名为SVP（Supervision-free Visual Projection）的新框架，该框架通过自生成描述（self-captioning）和预训练的定位模型（pre-trained grounding model）作为反馈机制，无需依赖人工标注的数据或偏好注释，即可增强视觉-语言对齐。SVP的关键在于利用模型内部的潜在信息，通过自监督的方式提升模型性能。实验结果表明，SVP在多个任务中显著提升了模型的表现，包括描述生成、指代、视觉问答、多任务处理、幻觉控制和物体召回等。

链接: https://arxiv.org/abs/2501.04568
作者: Giorgio Giannone,Ruoteng Li,Qianli Feng,Evgeny Perevodchikov,Rui Chen,Aleix Martinez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable potential in integrating visual and linguistic information, but their performance is often constrained by the need for extensive, high-quality image-text training data. Curation of these image-text pairs is both time-consuming and computationally expensive. To address this challenge, we introduce SVP (Supervision-free Visual Projection), a novel framework that enhances vision-language alignment without relying on curated data or preference annotation. SVP leverages self-captioning and a pre-trained grounding model as a feedback mechanism to elicit latent information in VLMs. We evaluate our approach across six key areas: captioning, referring, visual question answering, multitasking, hallucination control, and object recall. Results demonstrate significant improvements, including a 14% average improvement in captioning tasks, up to 12% increase in object recall, and substantial reduction in hallucination rates. Notably, a small VLM using SVP achieves hallucination reductions comparable to a model five times larger, while a VLM with initially poor referring capabilities more than doubles its performance, approaching parity with a model twice its size.
zh

[NLP-11] OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

【速读】：该论文旨在解决当前全模态学习（omnimodal learning）领域中的两个主要问题：一是缺乏开放的全模态数据集，二是实时情感语音生成（real-time emotional speech generation）的固有挑战。为解决这些问题，作者提出了openomni，一种两阶段训练方法。第一阶段通过全模态对齐（omnimodal alignment），将预训练的语音模型进一步在文本-图像任务上进行训练，以实现从视觉到语音的（近）零样本泛化（zero-shot generalization）。第二阶段通过轻量级解码器（lightweight decoder）在语音任务和偏好学习（preference learning）上进行训练，从而实现实时情感语音生成。实验表明，openomni在全模态、视觉-语言和语音-语言评估中均表现出显著提升，能够生成自然且情感丰富的对话和实时情感语音。

链接: https://arxiv.org/abs/2501.04561
作者: Run Luo,Ting-En Lin,Haonan Zhang,Yuchuan Wu,Xiong Liu,Min Yang,Yongbin Li,Longze Chen,Jiaming Li,Lei Zhang,Yangyi Chen,Hamid Alinejad-Rokny,Fei Huang
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences(中国科学院大学); Tongyi Laboratory(通义实验室); UIUC(伊利诺伊大学厄巴纳-香槟分校); New South Wales(新南威尔士大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.
zh

[NLP-12] rStar-Math: Small LLM s Can Master Math Reasoning with Self-Evolved Deep Thinking

【速读】：该论文旨在解决小型语言模型（SLMs）在数学推理能力上难以与大型模型（如OpenAI o1）竞争的问题。通过引入rStar-Math，论文展示了SLMs无需依赖从更强大模型中进行蒸馏（distillation），即可通过“深度思考”机制实现数学推理能力的显著提升。解决方案的关键在于以下三点创新：首先，提出了一种基于代码增强的链式思维（CoT）数据合成方法，通过蒙特卡洛树搜索（MCTS）生成经过验证的逐步推理轨迹，用于训练策略SLM；其次，开发了一种新的过程奖励模型训练方法，避免了简单的步骤级评分标注，从而生成了更有效的过程偏好模型（PPM）；最后，采用了一种自进化机制，策略SLM和PPM从零开始构建，并通过迭代进化逐步提升推理能力。通过这些创新，rStar-Math在MATH基准测试和USA Math Olympiad（AIME）上取得了显著的性能提升，达到了与顶尖高中学生相当的水平。

链接: https://arxiv.org/abs/2501.04519
作者: Xinyu Guan,Li Lyna Zhang,Yifei Liu,Ning Shang,Youran Sun,Yi Zhu,Fan Yang,Mao Yang
机构: Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising “deep thinking” through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at this https URL.
zh

[NLP-13] Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

【速读】：该论文试图解决在生成式模型（Generative Models）的推理阶段如何有效利用人类反馈的问题。传统的训练阶段反馈（如给定两个样本的选择偏好）无法直接应用于推理阶段。论文提出了一种新型反馈机制——caption reformulations（标题重述），通过训练模型来模仿基于人类注释的重述反馈。该方法的关键在于无需重新训练图像描述模型（Image Captioning Model），从而显著减少计算开销。具体而言，论文通过收集人类重述数据集来纠正生成描述中的错误，并将这些重述模型应用于现有图像描述模型的推理阶段，显著提升了描述质量，尤其是在原始描述质量较低的情况下。此外，该方法还被应用于非英语图像描述和风格迁移（Style Transfer）任务，取得了在德语图像描述和英语风格迁移任务上的最先进性能。

链接: https://arxiv.org/abs/2501.04513
作者: Uri Berger,Omri Abend,Lea Frermann,Gabriel Stanovsky
机构: School of Computer Science and Engineering, The Hebrew University of Jerusalem (希伯来大学计算机科学与工程学院); School of Computing and Information Systems, University of Melbourne (墨尔本大学计算与信息系统学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback – caption reformulations – and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.
zh

[NLP-14] Developing a Modular Compiler for a Subset of a C-like Language

【速读】：该论文旨在解决为高级语言构建编译器时面临的挑战，特别是如何实现一个模块化、内存高效且易于扩展的编译器。论文提出了一种模块化编译器（modular compiler）的开发方法，允许开发者根据需要添加或移除语言的子集，从而构建一个最小化且高效的编译器。解决方案的关键在于将开发过程划分为多个小步骤，每个步骤都生成一个功能完整的编译器，逐步扩展语言子集的支持。通过遵循模块化设计、代码可重用性和文档化的行业最佳实践，该编译器在功能效率、可维护性和可扩展性方面表现出色。此外，编译器在资源受限的单板计算机上进行了测试，进一步验证了其在内存受限设备上的高效性和适用性。

链接: https://arxiv.org/abs/2501.04503
作者: Debasish Dutta,Neeharika Sonowal,Irani Hazarika
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The paper introduces the development of a modular compiler for a subset of a C-like language, which addresses the challenges in constructing a compiler for high-level languages. This modular approach will allow developers to modify a language by adding or removing subsets as required, resulting in a minimal and memory-efficient compiler. The development process is divided into small, incremental steps, where each step yields a fully functioning compiler for an expanding subset of the language. The paper outlines the iterative developmental phase of the compiler, emphasizing progressive enhancements in capabilities and functionality. Adherence to industry best practices of modular design, code reusability, and documentation has enabled the resulting compiler’s functional efficiency, maintainability, and extensibility. The compiler proved to be effective not only in managing the language structure but also in developing optimized code, which demonstrates its practical usability. This was also further assessed using the compiler on a tiny memory-deficient single-board computer, again showing the compiler’s efficiency and suitability for resource-constrained devices.
zh

[NLP-15] PolInterviews – A Dataset of German Politician Public Broadcast Interviews

【速读】：该论文旨在解决政治沟通研究中缺乏高质量、结构化数据集的问题，特别是在德国政治背景下。为此，作者构建了一个新颖的公开广播访谈数据集，涵盖德国高层政治人物的访谈内容。解决方案的关键在于从YouTube获取访谈视频，进行转录、说话人识别（speaker identification）等处理，并以整洁、开放的格式存储。该数据集包含99个访谈、33位不同政治人物以及五种主要访谈形式，共计28,146个句子。这一数据集为研究议程设置（agenda-setting）、采访者动态（interviewer dynamics）以及政治人物的自我呈现（self-presentation）等政治沟通问题提供了重要资源。

链接: https://arxiv.org/abs/2501.04484
作者: Lukas Birkenmaier,Laureen Sieber,Felix Bergstein
机构: GESIS - Leibniz Institute for the Social Sciences, Mannheim, Germany (GESIS - 莱布尼茨社会科学研究所, 曼海姆, 德国); University of Chemnitz, Chemnitz, Germany (开姆尼茨大学, 开姆尼茨, 德国); University of Mannheim, Mannheim, Germany (曼海姆大学, 曼海姆, 德国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a novel dataset of public broadcast interviews featuring high-ranking German politicians. The interviews were sourced from YouTube, transcribed, processed for speaker identification, and stored in a tidy and open format. The dataset comprises 99 interviews with 33 different German politicians across five major interview formats, containing a total of 28,146 sentences. As the first of its kind, this dataset offers valuable opportunities for research on various aspects of political communication in the (German) political contexts, such as agenda-setting, interviewer dynamics, or politicians’ self-presentation.
zh

[NLP-16] When LLM s Struggle: Reference-less Translation Evaluation for Low-resource Languages

【速读】：该论文旨在解决低资源语言对（low-resource language pairs）的机器翻译质量估计（Quality Estimation, QE）问题，特别是在无参考（reference-less）情况下的评估。质量估计任务旨在为翻译输出提供一个质量评分（0-100），这是一个具有挑战性的跨语言理解任务。论文的关键解决方案包括在零样本（zero-shot）和少样本（few-shot）场景下对大型语言模型（LLMs）进行全面评估，并使用基于注释指南的新型提示（prompt）进行指令微调（instruction fine-tuning）。研究结果表明，基于提示的方法在性能上不如基于编码器的微调质量估计模型。此外，错误分析揭示了分词（tokenization）问题、音译（transliteration）和命名实体（named entities）导致的错误，并提出了在跨语言任务中对LLM预训练进行改进的必要性。论文还公开了数据和训练好的模型，以促进进一步研究。

链接: https://arxiv.org/abs/2501.04473
作者: Archchana Sindhujan,Diptesh Kanojia,Constantin Orasan,Shenbin Qian
机构: Institute for People-Centred AI and Centre for Translation Studies, University of Surrey, United Kingdom (萨里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.
zh

[NLP-17] Hidden Entity Detection from GitHub Leverag ing Large Language Models KDD2024

【速读】：该论文试图解决在缺乏大规模训练数据的特定场景下，如何利用大语言模型（LLMs）从GitHub仓库的文本内容中自动检测数据集和软件的问题。现有的命名实体识别方法主要依赖于大量训练数据，而该研究则探索了零样本学习（ZSL）和少样本学习（FSL）的潜力，通过利用LLMs在预训练阶段获得的能力来实现这一目标。解决方案的关键在于采用不同的FSL提示学习方法，以增强LLMs在识别仓库文本中数据集和软件提及的能力。此外，该研究还扩展了传统命名实体识别的范围，将资源如仓库和在线中心（通常以URL形式表示）纳入考虑，从而为自动实体检测提供了新的视角和方法。

链接: https://arxiv.org/abs/2501.04455
作者: Lu Gan,Martin Blum,Danilo Dessi,Brigitte Mathiak,Ralf Schenkel,Stefan Dietze
机构: GESIS – Leibniz Institute for the Social Sciences, Köln, Germany(德国科隆社会科学莱布尼茨研究所); Heinrich Heine University Düsseldorf, Germany(德国杜塞尔多夫海因里希·海涅大学); University of Trier, Trier, Germany(德国特里尔大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: accepted by KDD2024 workshop DL4KG

点击查看摘要

Abstract:Named entity recognition is an important task when constructing knowledge bases from unstructured data sources. Whereas entity detection methods mostly rely on extensive training data, Large Language Models (LLMs) have paved the way towards approaches that rely on zero-shot learning (ZSL) or few-shot learning (FSL) by taking advantage of the capabilities LLMs acquired during pretraining. Specifically, in very specialized scenarios where large-scale training data is not available, ZSL / FSL opens new opportunities. This paper follows this recent trend and investigates the potential of leveraging Large Language Models (LLMs) in such scenarios to automatically detect datasets and software within textual content from GitHub repositories. While existing methods focused solely on named entities, this study aims to broaden the scope by incorporating resources such as repositories and online hubs where entities are also represented by URLs. The study explores different FSL prompt learning approaches to enhance the LLMs’ ability to identify dataset and software mentions within repository texts. Through analyses of LLM effectiveness and learning strategies, this paper offers insights into the potential of advanced language models for automated entity detection.
zh

[NLP-18] End-to-End Bangla AI for Solving Math Olympiad Problem Benchmark: Leverag ing Large Language Model Using Integrated Approach

【速读】：该论文旨在解决大型语言模型（LLMs）在处理孟加拉语（Bangla）AI数学挑战时的性能问题，特别是在多语言环境下提升模型的推理精度。解决方案的关键包括：1）评估多种LLM配置，以确定最佳模型架构；2）通过特定数据集进行微调（fine-tuning），以增强模型对特定任务的适应性；3）采用检索增强生成（Retrieval-Augmented Generation, RAG）技术，结合外部知识库提升模型的推理能力；4）通过定制化提示（customized prompting）、数据集增强（dataset augmentation）和迭代推理（iterative reasoning）等方法，进一步提升模型在处理奥林匹克级别数学问题时的效率。这些关键措施共同作用，显著提升了模型在多语言环境下的数学推理能力。

链接: https://arxiv.org/abs/2501.04425
作者: H.M. Shadman Tabib,Jaber Ahmed Deedar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work introduces systematic approach for enhancing large language models (LLMs) to address Bangla AI mathematical challenges. Through the assessment of diverse LLM configurations, fine-tuning with specific datasets, and the implementation of Retrieval-Augmented Generation (RAG), we enhanced the model’s reasoning precision in a multilingual setting. Crucial discoveries indicate that customized prompting, dataset augmentation, and iterative reasoning improve the model’s efficiency regarding Olympiad-level mathematical challenges.
zh

[NLP-19] NSA: Neuro-symbolic ARC Challenge

【速读】：该论文试图解决的是在抽象推理语料库（Abstraction and Reasoning Corpus, ARC）中评估的通用推理能力问题，这一问题对机器学习模型和组合搜索方法都具有挑战性。论文提出了一种神经符号（neuro-symbolic）方法，结合了用于生成提案的Transformer和基于领域特定语言的组合搜索。关键解决方案在于利用Transformer缩小搜索空间，通过提出有潜力的搜索方向，使组合搜索能够在短时间内找到实际解决方案。此外，作者通过合成数据对Transformer进行预训练，并在测试时生成特定任务的训练任务以微调模型。这一方法在ARC评估集上的表现超越了现有最佳方法27%，并在ARC训练集上也表现出色。

链接: https://arxiv.org/abs/2501.04424
作者: Paweł Batorski,Jannik Brinkmann,Paul Swoboda
机构: Heinrich Heine Universität Düsseldorf(海因里希·海涅杜塞尔多夫大学); University of Mannheim(曼海姆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at this https URL.
zh

[NLP-20] SEO: Stochastic Experience Optimization for Large Language Models

【速读】：该论文试图解决如何为不同的大型语言模型（LLMs）找到适合的有用经验（useful experiences）以提升其在特定任务上的性能的问题。由于不清楚哪些经验适合特定的LLMs，自动寻找有用经验的方法难以确保所获经验的有效性。论文提出了一种称为随机经验优化（Stochastic Experience Optimization, SEO）的迭代方法，通过自然语言中的经验更新来找到优化的模型特定经验，而无需修改模型参数。SEO的关键在于提出了一种随机验证方法，以确保经验更新的方向，避免无效更新。实验结果表明，通过SEO优化的经验能够在三个LLMs的三个任务上实现一致的性能提升，并且这些经验能够泛化到分布外数据，提升LLMs在类似任务上的表现。

链接: https://arxiv.org/abs/2501.04393
作者: Jitao Xu,Hongyun Zhou,Lei Shen,Conghui Zhu,Jin Huang,Yitao Duan
机构: NetEase Youdao(网易有道), Beijing, China; Department of Computer Science and Technology, Tsinghua University(清华大学), Beijing, China; Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can benefit from useful experiences to improve their performance on specific tasks. However, finding helpful experiences for different LLMs is not obvious, since it is unclear what experiences suit specific LLMs. Previous studies intended to automatically find useful experiences using LLMs, while it is difficult to ensure the effectiveness of the obtained experience. In this paper, we propose Stochastic Experience Optimization (SEO), an iterative approach that finds optimized model-specific experience without modifying model parameters through experience update in natural language. In SEO, we propose a stochastic validation method to ensure the update direction of experience, avoiding unavailing updates. Experimental results on three tasks for three LLMs demonstrate that experiences optimized by SEO can achieve consistently improved performance. Further analysis indicates that SEO-optimized experience can generalize to out-of-distribution data, boosting the performance of LLMs on similar tasks.
zh

[NLP-21] melineKGQA: A Comprehensive Question-Answer Pair Generator for Temporal Knowledge Graphs

【速读】：该论文旨在解决时态知识图谱（Temporal Knowledge Graphs, TKGs）上的问答（Question Answering, QA）问题，特别是针对理解随时间演变的事实和关系的挑战。当前该领域的发展受到数据集有限和生成定制问答对困难的制约。论文提出了一种基于时间线-上下文关系的新型分类框架，并引入了TimelineKGQA，这是一个适用于任何时态知识图谱的通用时态问答生成器。解决方案的关键在于通过时间线-上下文关系对问题进行系统分类，并开发了一个开源的Python包，能够自动生成高质量的时态问答对，从而推动时态知识图谱问答领域的研究和应用。

链接: https://arxiv.org/abs/2501.04343
作者: Qiang Sun,Sirui Li,Du Huynh,Mark Reynolds,Wei Liu
机构: The University of Western Australia(西澳大利亚大学); Murdoch University(莫道克大学)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbfTimelineKGQA, a universal temporal QA generator applicable to any TKGs. The code is available at: \urlthis https URL as an open source Python package.
zh

[NLP-22] Understanding Before Reasoning : Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting

【速读】：该论文试图解决在大型语言模型（LLMs）中使用链式思维（Chain-of-Thought, CoT）提示时，当推理所需的关键信息隐含或缺失时，模型难以有效推理的问题。CoT强调推理步骤的顺序，但忽视了早期提取关键信息的重要性，导致在信息不明确的情况下推理效果不佳。

解决方案的关键在于提出了一种称为迭代总结预提示（Iterative Summarization Pre-Prompting, ISP^2）的方法。该方法首先从输入中提取实体及其描述，形成潜在的关键信息对；然后通过可靠性评分评估这些对，并将评分最低的两个对合并为一个新的实体描述。这一过程重复进行，直到获得唯一的关键信息对。最后，将该关键信息对与原始问题一起输入LLMs以生成答案。实验表明，ISP^2相比现有方法提升了7.1%的性能，并且能够灵活集成到多种推理框架中。

链接: https://arxiv.org/abs/2501.04341
作者: Dong-Hai Zhu,Yu-Jie Xiong,Jia-Chen Zhang,Xi-Jiong Xie,Chun-Ming Xia
机构: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science (上海工程技术大学电子与电气工程学院); School of Information Science and Engineering, Ningbo University (宁波大学信息科学与工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi-step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre-prompting method called Iterative Summarization Pre-Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest-ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre-prompting, offering flexible integration into diverse reasoning frameworks. The code is available at this https URL.
zh

[NLP-23] Who Does the Giant Number Pile Like Best: Analyzing Fairness in Hiring Contexts

【速读】：该论文旨在探讨基于大语言模型（LLMs）的招聘系统在生成式任务中的公平性问题，特别是在简历摘要生成和简历检索两个实际任务中的表现。研究通过构建合成简历数据集和筛选职位发布，分析了模型在不同人口统计群体中的行为差异以及对人口统计扰动的敏感性。研究发现，约10%的生成摘要中存在基于种族的差异，而基于性别的差异仅占1%。在简历检索任务中，所有评估的模型均显示出对人口统计群体的非均匀选择模式，并对性别和种族扰动表现出高度敏感性。值得注意的是，检索模型对非人口统计变化的敏感性也相当高，表明公平性问题可能部分源于模型的一般脆弱性。总体而言，研究结果表明，基于LLM的招聘系统，特别是在检索阶段，可能表现出显著的偏见，导致现实场景中的歧视性结果。解决方案的关键在于深入理解模型在不同任务中的行为差异，并针对性地改进模型的鲁棒性和公平性。

链接: https://arxiv.org/abs/2501.04316
作者: Preethi Seshadri,Seraphina Goldfarb-Tarrant
机构: UC Irvine(加州大学欧文分校); Cohere
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being deployed in high-stakes applications like hiring, yet their potential for unfair decision-making and outcomes remains understudied, particularly in generative settings. In this work, we examine the fairness of LLM-based hiring systems through two real-world tasks: resume summarization and retrieval. By constructing a synthetic resume dataset and curating job postings, we investigate whether model behavior differs across demographic groups and is sensitive to demographic perturbations. Our findings reveal that race-based differences appear in approximately 10% of generated summaries, while gender-based differences occur in only 1%. In the retrieval setting, all evaluated models display non-uniform selection patterns across demographic groups and exhibit high sensitivity to both gender and race-based perturbations. Surprisingly, retrieval models demonstrate comparable sensitivity to non-demographic changes, suggesting that fairness issues may stem, in part, from general brittleness issues. Overall, our results indicate that LLM-based hiring systems, especially at the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes in real-world contexts.
zh

[NLP-24] LLM 4SR: A Survey on Large Language Models for Scientific Research

【速读】：该论文旨在系统地探讨大型语言模型（LLMs）如何革新科学研究过程，特别是在假设发现、实验规划与实施、科学写作和同行评审这四个关键阶段中的独特作用。论文通过分析任务特定的方法和评估基准，展示了LLMs在科学研究中的变革潜力。解决方案的关键在于全面梳理LLMs在这些阶段中的应用，识别当前面临的挑战，并提出未来的研究方向，以激励和指导研究人员和实践者利用LLMs推动科学探索。相关资源可在指定仓库中获取。

链接: https://arxiv.org/abs/2501.04306
作者: Ziming Luo,Zonglin Yang,Zexin Xu,Wei Yang,Xinya Du
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: this https URL
zh

[NLP-25] Multimodal Graph Constrastive Learning and Prompt for ChartQA

【速读】：该论文试图解决图表问答（ChartQA）中的挑战，这些挑战主要源于图表元素的复杂分布以及隐含在底层数据中的模式。为了解决这些问题，论文提出了一种联合多模态场景图（multimodal scene graph），该图由视觉图（visual graph）和文本图（textual graph）两部分组成，分别用于捕捉图表中的结构信息和语义信息。为了统一不同模态之间的表示，论文引入了一种多模态图对比学习方法（multimodal graph contrastive learning），通过最大化跨模态图中表示同一对象的节点之间的相似性来学习统一的表示。此外，考虑到多模态大语言模型（Multimodal Large Language Models, MLLMs）在零样本场景中的需求，论文设计了链式思维提示（Chain-of-Thought, CoT）以减少幻觉现象。这些方法在ChartQA、OpenCQA和ChartX等公开基准测试中进行了验证，结果表明其有效提升了性能。

链接: https://arxiv.org/abs/2501.04303
作者: Yue Dai,Soyeon Caren Han,Wei Liu
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data. In this chapter, we have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns. Our proposed multimodal scene graph consists of two components: a visual graph and a textual graph, each designed to capture the structural and semantic information within the chart. To unify representations across these different modalities, we introduce a multimodal graph contrastive learning approach that learns unified representations by maximizing similarity between nodes representing the same object across multimodal graphs. The learned graph representations can be seamlessly incorporated into a transformer decoder as a soft prompt. Additionally, given the growing need for Multimodal Large Language Models (MLLMs) in zero-shot scenarios, we have designed Chain-of-Thought (CoT) prompts for MLLMs to reduce hallucinations. We tested both methods on public benchmarks such as ChartQA, OpenCQA, and ChartX, demonstrating improved performance and validating the effectiveness of our proposed methods. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.04303 [cs.CL] (or arXiv:2501.04303v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.04303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] IOLBENCH: Benchmarking LLM s on Linguistic Reasoning

【速读】：该论文试图解决当前深度神经网络（deep neural networks）在处理需要结构化、抽象思维的推理任务时的局限性问题，特别是在语言推理能力方面。尽管深度神经网络在许多领域取得了显著进展，但在处理复杂的语言推理任务时仍存在明显不足。论文通过引入IOLBENCH这一新的基准测试集，该数据集源自国际语言学奥林匹克（International Linguistics Olympiad, IOL）的问题，涵盖了语法（syntax）、形态学（morphology）、音系学（phonology）和语义学（semantics）等多个方面，旨在测试模型在元认知语言推理（metacognitive linguistic reasoning）方面的能力。这些任务要求模型从有限的示例中推断出语言规则和模式，挑战其组合泛化（compositional generalization）和规则抽象（rule abstraction）能力。通过对领先的大型语言模型（LLMs）进行广泛基准测试，论文揭示了当前模型在处理语言复杂性时的优势和持续存在的局限性，为未来开发具有类人推理能力的模型提供了重要见解。

链接: https://arxiv.org/abs/2501.04249
作者: Satyam Goyal,Soham Dan
机构: University of Michigan(密歇根大学); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the remarkable advancements and widespread applications of deep neural networks, their ability to perform reasoning tasks remains limited, particularly in domains requiring structured, abstract thought. In this paper, we investigate the linguistic reasoning capabilities of state-of-the-art large language models (LLMs) by introducing IOLBENCH, a novel benchmark derived from International Linguistics Olympiad (IOL) problems. This dataset encompasses diverse problems testing syntax, morphology, phonology, and semantics, all carefully designed to be self-contained and independent of external knowledge. These tasks challenge models to engage in metacognitive linguistic reasoning, requiring the deduction of linguistic rules and patterns from minimal examples. Through extensive benchmarking of leading LLMs, we find that even the most advanced models struggle to handle the intricacies of linguistic complexity, particularly in areas demanding compositional generalization and rule abstraction. Our analysis highlights both the strengths and persistent limitations of current models in linguistic problem-solving, offering valuable insights into their reasoning capabilities. By introducing IOLBENCH, we aim to foster further research into developing models capable of human-like reasoning, with broader implications for the fields of computational linguistics and artificial intelligence.
zh

[NLP-27] Agent Laboratory: Using LLM Agents as Research Assistants

【速读】：该论文试图解决科学发现过程中耗时长、成本高的问题，旨在通过引入一个基于大语言模型（LLM）的自主框架——Agent Laboratory，来加速科学发现、降低研究成本并提高研究质量。解决方案的关键在于该框架能够接受人类提供的研究想法，并自主完成文献综述、实验和报告撰写三个阶段，最终生成包括代码库和研究报告在内的全面研究成果。此外，该框架允许用户在每个阶段提供反馈和指导，从而显著提升研究的整体质量。通过部署多种先进的LLM并邀请研究人员进行评估，研究发现，Agent Laboratory不仅能够生成达到当前最优性能的机器学习代码，还能将研究成本降低84%，从而帮助研究人员将更多精力投入到创新构思而非低层次的编码和写作中。

链接: https://arxiv.org/abs/2501.04227
作者: Samuel Schmidgall,Yusheng Su,Ze Wang,Ximeng Sun,Jialian Wu,Xiaodong Yu,Jiang Liu,Zicheng Liu,Emad Barsoum
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages–literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.
zh

[NLP-28] Multimodal Multihop Source Retrieval for Web Question Answering

【速读】：该论文旨在解决多模态多跳问答（multi-modal multi-hop question answering, QA）中的学习和推理挑战。具体来说，研究聚焦于如何通过图像和文本模态的多源推理路径来找到支持事实以回答问题。解决方案的关键在于提出了一种基于句子语义结构的图推理网络（graph reasoning network），该网络能够学习多源推理路径，并通过图结构提升多模态多跳问答的性能。研究表明，图结构和邻接矩阵（adjacency matrix）作为任务相关的先验知识，能够有效提升检索性能。实验和可视化分析表明，图网络上的消息传播或整个图结构可以替代大规模的多模态变换器（multimodal transformers），并在轻量模型的基础上实现了4.6%的检索F1分数提升。

链接: https://arxiv.org/abs/2501.04173
作者: Navya Yarrabelly,Saloni Mittal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2010.03604 by other authors

点击查看摘要

Abstract:This work deals with the challenge of learning and reasoning over multi-modal multi-hop question answering (QA). We propose a graph reasoning network based on the semantic structure of the sentences to learn multi-source reasoning paths and find the supporting facts across both image and text modalities for answering the question. In this paper, we investigate the importance of graph structure for multi-modal multi-hop question answering. Our analysis is centered on WebQA. We construct a strong baseline model, that finds relevant sources using a pairwise classification task. We establish that, with the proper use of feature representations from pre-trained models, graph structure helps in improving multi-modal multi-hop question answering. We point out that both graph structure and adjacency matrix are task-related prior knowledge, and graph structure can be leveraged to improve the retrieval performance for the task. Experiments and visualized analysis demonstrate that message propagation over graph networks or the entire graph structure can replace massive multimodal transformers with token-wise cross-attention. We demonstrated the applicability of our method and show a performance gain of \textbf4.6 % retrieval F1score over the transformer baselines, despite being a very light model. We further demonstrated the applicability of our model to a large scale retrieval setting.
zh

[NLP-29] Reasoning -Enhanced Self-Training for Long-Form Personalized Text Generation

【速读】：该论文旨在解决个性化文本生成（Personalized Text Generation）中的关键问题，即如何使大语言模型（LLMs）更好地利用个性化上下文生成符合用户期望的输出。现有的LLMs在标准训练中通常无法充分学习到个性化上下文，导致生成的文本与用户的偏好、背景知识或写作风格不一致。为此，论文提出了一个名为“推理增强的自训练个性化文本生成框架”（Reasoning-Enhanced Self-Training for Personalized Text Generation, REST-PG）。该框架的关键在于通过生成推理路径来增强LLMs的推理能力，并采用期望最大化强化自训练（Expectation-Maximization Reinforced Self-Training）方法，基于模型自身的高奖励输出进行迭代训练。实验结果表明，REST-PG在LongLaMP基准测试中显著优于现有方法，平均相对性能提升了14.5%。

链接: https://arxiv.org/abs/2501.04167
作者: Alireza Salemi,Cheng Li,Mingyang Zhang,Qiaozhu Mei,Weize Kong,Tao Chen,Zhuowan Li,Michael Bendersky,Hamed Zamani
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Google DeepMind(谷歌DeepMind); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized text generation requires a unique ability of large language models (LLMs) to learn from context that they often do not encounter during their standard training. One way to encourage LLMs to better use personalized context for generating outputs that better align with the user’s expectations is to instruct them to reason over the user’s past preferences, background knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG), a framework that trains LLMs to reason over personal data during response generation. REST-PG first generates reasoning paths to train the LLM’s reasoning abilities and then employs Expectation-Maximization Reinforced Self-Training to iteratively train the LLM based on its own high-reward outputs. We evaluate REST-PG on the LongLaMP benchmark, consisting of four diverse personalized long-form text generation tasks. Our experiments demonstrate that REST-PG achieves significant improvements over state-of-the-art baselines, with an average relative performance gain of 14.5% on the benchmark.
zh

[NLP-30] MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在特定任务（如图表和图表理解）上表现不佳的问题，主要原因是缺乏任务特定的训练数据。现有的通用数据集无法捕捉这些任务所需的细节。论文提出的解决方案是MM-Gen，一种可扩展的方法，通过利用更强的模型生成任务特定的高质量合成文本。MM-Gen采用三阶段目标过程：将数据划分为子组、基于任务描述生成目标文本，以及过滤冗余和异常数据。通过使用MM-Gen生成的数据对VLMs进行微调，显著提升了模型性能，例如Llava-1.5在空间推理和图表理解任务上分别提升了29%和15%。与人工标注的数据相比，MM-Gen对原始模型的改进效果高达1.6倍，证明了其在增强任务特定VLM性能方面的有效性，并弥合了通用数据集与特定任务需求之间的差距。

链接: https://arxiv.org/abs/2501.04155
作者: Siddharth Joshi,Besmira Nushi,Vidhisha Balachandran,Varun Chandrasekaran,Vibhav Vineet,Neel Joshi,Baharan Mirzasoleiman
机构: Microsoft Research(微软研究院); UCLA(加州大学洛杉矶分校); UIUC(伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at this https URL.
zh

[NLP-31] Multilingual Open QA on the MIA Shared Task

【速读】：该论文旨在解决跨语言信息检索（Cross-lingual Information Retrieval, CLIR）中的问题，特别是在低资源语言（low-resource languages）环境下，如何在不需要额外监督或标注数据的情况下，有效地检索相关文本。论文提出了一种简单且有效的重排序方法，用于改进开放域问答中的段落检索。该方法通过使用一个零样本多语言问题生成模型（zero-shot multilingual question generation model）对检索到的段落进行重新评分，计算输入问题在目标语言下的条件概率，从而实现对不同语言段落的重新排序。该方法的优势在于其完全零样本（zero-shot）设置，无需任何训练，且可以与任何稀疏检索方法（如BM-25）结合使用，避免了获取昂贵的标注语料库的需求，特别适用于低资源语言的检索任务。

链接: https://arxiv.org/abs/2501.04153
作者: Navya Yarrabelly,Saloni Mittal,Ketan Todi,Kimihiro Hasegawa
机构: Language Technologies Institute (语言技术研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-lingual information retrieval (CLIR) ~\citeshi2021cross, asai2021one, jiang2020cross for example, can find relevant text in any language such as English(high resource) or Telugu (low resource) even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where we do not require any kind of additional supervision or labelled data for retrieval task and hence can work effectively for low-resource languages. \par We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retrieved passages with a zero-shot multilingual question generation model, which is a pre-trained language model, to compute the probability of the input question in the target language conditioned on a retrieved passage, which can be possibly in a different language. We evaluate our method in a completely zero shot setting and doesn’t require any training. Thus the main advantage of our method is that our approach can be used to re-rank results obtained by any sparse retrieval methods like BM-25. This eliminates the need for obtaining expensive labelled corpus required for the retrieval tasks and hence can be used for low resource languages. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2501.04153 [cs.CL] (or arXiv:2501.04153v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.04153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-32] “Yeah Right!” – Do LLM s Exhibit Multimodal Feature Transfer?

【速读】：该论文探讨了人类交流的多模态特性，特别是从口语到书面语的技能转移问题。研究重点在于评估语音+文本模型（speech+text models）和专门针对人类对话训练的文本模型（text models）在多模态技能转移中的表现，尤其是检测隐蔽欺骗性交流（covert deceptive communication）的能力。研究发现，语音+文本模型在无需特殊提示的情况下，相比单模态模型（unimodal LLMs）在此任务中具有优势。此外，专门针对人类对话训练的模型在此技能上也表现出优势。解决方案的关键在于利用多模态数据和人类对话训练的模型来提升对隐蔽欺骗性交流的检测能力。

链接: https://arxiv.org/abs/2501.04138
作者: Benjamin Reichman,Kartik Talamadupula
机构: Georgia Tech(佐治亚理工学院); Wand AI(Wand AI); Symbl.AI(Symbl.AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human communication is a multifaceted and multimodal skill. Communication requires an understanding of both the surface-level textual content and the connotative intent of a piece of communication. In humans, learning to go beyond the surface level starts by learning communicative intent in speech. Once humans acquire these skills in spoken communication, they transfer those skills to written communication. In this paper, we assess the ability of speech+text models and text models trained with special emphasis on human-to-human conversations to make this multimodal transfer of skill. We specifically test these models on their ability to detect covert deceptive communication. We find that with no special prompting speech+text LLMs have an advantage over unimodal LLMs in performing this task. Likewise, we find that human-to-human conversation-trained LLMs are also advantaged in this skill.
zh

[NLP-33] More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

【速读】：该论文试图解决大语言模型（LLMs）在进行少样本上下文学习（few-shot in-context learning, ICL）时，随着样本数量从少量增加到大量，模型性能趋于平稳甚至下降的问题。这一现象主要由两个原因引起：次优的负对数似然（negative log-likelihood, NLL）优化目标和增量数据噪声。为解决这些问题，论文提出了DR-ICL（Differentiated Learning and advantage-based Reweighting objectives）方法，其关键点在于全局上通过差异化学习优化NLL目标，确保多样本性能超越零样本水平；局部上则通过基于强化学习启发的累积优势动态调整多样本的权重，从而提升模型的泛化能力。此外，论文还开发了多任务数据集MICLB（Many-Shot ICL Benchmark），用于评估多样本ICL策略在不同任务中的表现。实验结果表明，采用DR-ICL增强的LLMs在多样本设置下显著提升了性能，涵盖领域内和领域外任务。

链接: https://arxiv.org/abs/2501.04070
作者: Xiaoqing Zhang,Ang Lv,Yuhan Liu,Flood Sung,Wei Liu,Shuo Shang,Xiuying Chen,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); MoonshotAI; Xiaomi AI Lab (小米人工智能实验室); University of Electronic Science and Technology of China (电子科技大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as the number of ICL demonstrations increases from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DR-ICL, a novel optimization method that enhances model performance through Differentiated Learning and advantage-based Reweighting objectives. Globally, DR-ICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby improving generalization. This approach allows the model to handle varying numbers of shots effectively, mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (MICLB)-a large-scale benchmark covering shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes. MICLB facilitates the evaluation of many-shot ICL strategies across seven prominent NLP tasks and 50 distinct datasets. Experimental results demonstrate that LLMs enhanced with DR-ICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and benchmark dataset hoping to facilitate further research in many-shot ICL.
zh

[NLP-34] he Power of Negative Zero: Datatype Customization for Quantized Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在部署过程中面临的高内存需求问题，特别是在低精度（3位和4位）量化场景下，浮点（FP）数据类型由于包含正负零而限制了其表示能力。为解决这一问题，论文提出了一种名为“冗余零重映射”（Redundant Zero Remapping, RaZeR）的扩展浮点数据类型方法。RaZeR通过将负零的浮点编码重新映射到一组预定义的特殊值，从而最大化利用浮点量化编码，并更好地拟合大语言模型的数值分布。该方案的关键在于通过精心选择特殊值，使得RaZeR在保持高计算效率的同时，优于传统的非对称整数（INT）量化方法。此外，RaZeR能够无缝集成到权重和KV缓存的量化算法中，包括带有裁剪和变换的高级方法，并显著提升模型精度。论文还实现了一个快速GEMV（通用矩阵向量乘法）内核，通过新颖的位级操作将4位RaZeR值高效转换为FP16，从而在现代GPU上显著提升了计算速度和LLM解码吞吐量。

链接: https://arxiv.org/abs/2501.04052
作者: Yuzong Chen,Xilai Dai,Chi-chih Chang,Yash Akhauri,Mohamed S. Abdelfattah
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: under submission

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks, quickly becoming one of the most prevalent AI workloads. Yet the substantial memory requirement of LLMs significantly hinders their deployment for end users. Post-training quantization (PTQ) serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of LLMs. Although the traditional integer (INT) datatype has received widespread adoption in PTQ methods, floating-point (FP) quantization has emerged as a viable alternative thanks to its effectiveness in fitting LLM numerical distributions. However, the FP datatype in sign-magnitude binary representation contains both positive and negative zero, which constrains its representation capability, particularly under low precision (3 and 4 bits). In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR), which remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit LLM numerical distributions. Through careful selection of special values, RaZeR outperforms conventional asymmetric INT quantization while achieving high computational efficiency. We demonstrate that RaZeR can be seamlessly integrated with quantization algorithms for both weights and KV-cache, including advanced methods with clipping and transformations, and consistently achieve better model accuracy. Additionally, we implement a fast GEMV kernel with fused dequantization that efficiently converts the 4-bit RaZeR value to FP16 through novel bit-level manipulation. On modern GPUs, our evaluation shows that RaZeR improves the GEMV speed by up to 7.56 \times compared to the FP16 implementation, while achieving up to 2.72 \times speedup in the LLM decoding throughput.
zh

[NLP-35] A Survey on Large Language Models with some Insights on their Capabilities and Limitations

【速读】：该论文旨在探讨大型语言模型（LLMs）在自然语言处理任务中的能力扩展及其背后的机制，特别是基于Transformer架构的模型如GPT和LLaMA。论文的核心问题包括：LLMs如何在不同任务中表现出泛化能力、规划和推理能力，以及这些涌现能力是否可以通过系统化的方法进行激发或增强。解决方案的关键在于分析LLMs的基础组件、扩展机制和架构策略，特别是数据量和计算资源的指数增长对模型性能的影响。此外，论文还探讨了CoT（Chain of Thought）和PoT（Plan of Thought）能力在LLMs中的表现，以及如何通过预训练数据影响这些能力的涌现。通过整合外部系统的LLM-modulo框架，论文进一步研究了LLMs在处理复杂动态任务时的潜力。这些分析旨在推动对LLMs能力和局限性的讨论，促进其在复杂环境中的负责任开发和应用。

链接: https://arxiv.org/abs/2501.04040
作者: Andrea Matarazzo,Riccardo Torlone
机构: Expedia Group; Roma Tre University (罗马第三大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 174 pages, to be submitted to a journal in a shorter version. arXiv admin note: text overlap with arXiv:2303.18223 , arXiv:2303.17564 , arXiv:2301.00234 , arXiv:2303.08774 , arXiv:2402.02315 , arXiv:2210.03493 , arXiv:2402.01817 , arXiv:2407.21783 , arXiv:2208.05051 by other authors

点击查看摘要

Abstract:The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.
zh

[NLP-36] Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation

【速读】：该论文试图解决从非侵入性脑信号（如脑电图，EEG）中解码语音的挑战，特别是针对噪声数据、有限数据集以及在复杂任务（如语音感知）中表现不佳的问题。解决方案的关键在于采用变分自编码器（VAEs）进行EEG数据增强，以提高数据质量，并应用一种在肌电图（EMG）任务中表现优异的序列到序列深度学习架构来处理EEG语音解码任务。此外，该研究还对该架构进行了调整，以适用于单词分类任务。实验结果表明，VAEs能够生成人工EEG数据用于数据增强，而序列到序列模型在生成句子任务中表现优于分类模型，尽管这两项任务仍然具有挑战性。这些发现为未来EEG语音感知解码研究奠定了基础，并可能扩展到无声或想象语音等语音生成任务。

链接: https://arxiv.org/abs/2501.04359
作者: Terrance Yu-Hao Chen,Yulin Chen,Pontus Soederhaell,Sadrishya Agrawal,Kateryna Shapovalenko
机构: Carnegie Mellon University(卡内基梅隆大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
备注: 19 pages, 15 figures, 2 tables

点击查看摘要

Abstract:Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the potential to advance brain-computer interfaces (BCIs), with applications in silent communication and assistive technologies for individuals with speech impairments. However, EEG-based speech decoding faces major challenges, such as noisy data, limited datasets, and poor performance on complex tasks like speech perception. This study attempts to address these challenges by employing variational autoencoders (VAEs) for EEG data augmentation to improve data quality and applying a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings of subjects listening to narrated speech, we preprocess the data and evaluate both classification and sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that VAEs have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence-to-sequence model achieves more promising performance in generating sentences compared to our classification model, though both remain challenging tasks. These findings lay the groundwork for future research on EEG speech perception decoding, with possible extensions to speech production tasks such as silent or imagined speech.
zh

[NLP-37] Circuit Complexity Bounds for Visual Autoregressive Model

【速读】：该论文旨在研究视觉自回归模型（Visual AutoRegressive, VAR）的表达能力，并为其电路复杂性（circuit complexity）建立界限。尽管VAR模型在图像生成领域表现出色，超越了扩散变换器（Diffusion Transformers）等先前技术，但其表达能力的局限性尚未得到严格的理论分析。论文的关键解决方案是通过证明VAR模型等价于一个具有隐藏维度 ( d \leq O(n) ) 和多项式精度 ( \mathrm{poly}(n) ) 的均匀阈值电路（uniform (\mathsf{TC}^0) threshold circuit），首次严格揭示了VAR模型在表达能力上的局限性。这一发现为理解VAR模型的内在约束提供了理论依据，并为未来开发更高效和表达能力更强的架构提供了指导。

链接: https://arxiv.org/abs/2501.04299
作者: Yekun Ke,Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song
机构: The University of Hong Kong(香港大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); The Simons Institute for the Theory of Computing at UC Berkeley(加州大学伯克利分校西蒙斯理论计算研究所)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the expressive ability of a specific model is essential for grasping its capacity limitations. Recently, several studies have established circuit complexity bounds for Transformer architecture. Besides, the Visual AutoRegressive (VAR) model has risen to be a prominent method in the field of image generation, outperforming previous techniques, such as Diffusion Transformers, in generating high-quality images. We investigate the circuit complexity of the VAR model and establish a bound in this study. Our primary result demonstrates that the VAR model is equivalent to a simulation by a uniform \mathsfTC^0 threshold circuit with hidden dimension d \leq O(n) and \mathrmpoly(n) precision. This is the first study to rigorously highlight the limitations in the expressive power of VAR models despite their impressive performance. We believe our findings will offer valuable insights into the inherent constraints of these models and guide the development of more efficient and expressive architectures in the future.
zh

计算机视觉

[CV-0] Planarian Neural Networks: Evolutionary Patterns from Basic Bilateria Shaping Modern Artificial Neural Network Architectures

【速读】：该论文试图解决人工神经网络（ANNs）在图像分类任务中预测准确率提升的问题。解决方案的关键在于借鉴了生物神经系统（特别是涡虫的神经系统）的架构，提出了一种新型的神经网络设计。涡虫的神经系统由大脑和两条神经索组成，这种独特的结构被认为可以为人工神经网络的性能提升提供有价值的启示。论文以ResNet为基础模型，通过在CIFAR-10和CIFAR-100数据集上的实验，验证了这种基于涡虫神经架构的神经网络在图像分类任务中具有比基准模型更高的预测准确率。研究结果表明，生物启发的神经网络架构在提升人工神经网络性能方面具有显著潜力。

链接: https://arxiv.org/abs/2501.04700
作者: Ziyuan Huang,Mark Newman,Maria Vaida,Srikar Bellur,Roozbeh Sadeghian,Andrew Siu,Hui Wang,Kevin Huggins
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:This study examined the viability of enhancing the prediction accuracy of artificial neural networks (ANNs) in image classification tasks by developing ANNs with evolution patterns similar to those of biological neural networks. ResNet is a widely used family of neural networks with both deep and wide variants; therefore, it was selected as the base model for our investigation. The aim of this study is to improve the image classification performance of ANNs via a novel approach inspired by the biological nervous system architecture of planarians, which comprises a brain and two nerve cords. We believe that the unique neural architecture of planarians offers valuable insights into the performance enhancement of ANNs. The proposed planarian neural architecture-based neural network was evaluated on the CIFAR-10 and CIFAR-100 datasets. Our results indicate that the proposed method exhibits higher prediction accuracy than the baseline neural network models in image classification tasks. These findings demonstrate the significant potential of biologically inspired neural network architectures in improving the performance of ANNs in a wide range of applications.
zh

[CV-1] EditAR: Unified Conditional Generation with Autoregressive Models

【速读】：该论文旨在解决可控图像生成和编辑领域中，现有方法（如基于扩散模型的方法）在特定任务上表现优异但难以构建统一模型的问题。为此，作者提出了EditAR，一种基于自回归模型（autoregressive model）的统一框架，能够处理多种条件图像生成任务，如图像编辑、深度到图像、边缘到图像和分割到图像等。该模型的核心创新在于其采用了统一的tokenized表示，并通过自回归的方式预测编辑后的图像token。此外，为了增强文本到图像的对齐效果，作者进一步提出从基础模型（foundation models）中蒸馏知识到自回归建模过程中。实验结果表明，EditAR在多个基准任务上表现出色，与现有的任务专用方法相比具有竞争力。

链接: https://arxiv.org/abs/2501.04699
作者: Jiteng Mu,Nuno Vasconcelos,Xiaolong Wang
机构: UC San Diego(加州大学圣地亚哥分校); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods. Project page: this https URL
zh

[CV-2] ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning

【速读】：该论文旨在解决多概念视频定制（Multi-Concept Video Customization, MCVC）中的两个关键挑战：1）身份解耦问题（identity decoupling problem），即现有定制方法在处理多个概念时不可避免地混合属性；2）高质量视频-实体对数据的稀缺性，这对训练能够良好表示和解耦多个概念的模型至关重要。为解决这些问题，论文提出了ConceptMaster框架，其核心创新在于引入了一种学习解耦多概念嵌入（decoupled multi-concept embeddings）的策略，这些嵌入以独立的方式注入到扩散模型中，从而有效保证多身份定制视频的质量，即使对于高度相似的视觉概念也能保持概念保真度。此外，论文还设计了一个数据构建管道，系统地收集精确的多概念视频-实体数据，以克服数据稀缺问题。通过从概念保真度、身份解耦能力和视频生成质量三个关键维度进行综合评估，ConceptMaster在多种概念组合场景下显著优于现有方法，为生成个性化且语义准确的多概念视频提供了新的途径。

链接: https://arxiv.org/abs/2501.04698
作者: Yuzhou Huang,Ziyang Yuan,Quande Liu,Qiulin Wang,Xintao Wang,Ruimao Zhang,Pengfei Wan,Di Zhang,Kun Gai
机构: Sun Yat-sen University(中山大学); Kuaishou Technology(快手科技); The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training such a model that represents and decouples various concepts well. To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos. Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To further overcome the scarcity of high-quality MCVC data, we carefully establish a data construction pipeline, which enables systematic collection of precise multi-concept video-entity data across diverse concepts. A comprehensive benchmark is designed to validate the effectiveness of our model from three critical dimensions: concept fidelity, identity decoupling ability, and video generation quality across six different concept composition scenarios. Extensive experiments demonstrate that our ConceptMaster significantly outperforms previous approaches for this task, paving the way for generating personalized and semantically accurate videos across multiple concepts.
zh

[CV-3] Grokking at the Edge of Numerical Stability

【速读】：该论文旨在解决深度学习中的“grokking”现象，即模型在经过长时间过拟合后突然出现的泛化能力。这一现象挑战了我们对深度学习的理解，尤其是延迟泛化的原因及其对正则化的依赖性。论文提出，在没有正则化的情况下，grokking任务会将模型推向数值稳定性的边缘，导致Softmax函数中出现浮点误差，称为Softmax Collapse (SC)。SC会阻止grokking的发生，而缓解SC则可以在没有正则化的情况下实现grokking。论文进一步发现，过拟合后，梯度会强烈对齐于所谓的“朴素损失最小化”（NLM）方向，这种梯度分量不会改变模型的预测，但通过缩放logits来减少损失，最终导致SC并停止进一步学习。为解决这些问题，论文提出了两个关键贡献：StableMax，一种新的激活函数，用于防止SC并实现无正则化的grokking；以及\perp Grad，一种训练算法，通过完全阻止NLM来促进grokking任务中的快速泛化。这些贡献为理解grokking的延迟泛化、对正则化的依赖以及现有grokking诱导方法的有效性提供了新的见解。

链接: https://arxiv.org/abs/2501.04697
作者: Lucas Prieto,Melih Barsbey,Pedro A.M. Mediano,Tolga Birdal
机构: Department of Computing, Imperial College London (帝国理工学院计算系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model’s predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and \perp Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at this https URL.
zh

[CV-4] st-Time Optimization for Domain Adaptive Open Vocabulary Segmentation

【速读】：该论文试图解决在零样本（zero-shot）和开放词汇（open-vocabulary）语义分割（OVSS）任务中，现有方法在高度领域特定数据集上表现不佳的问题。尽管当前的开放词汇方法在标准分割基准上表现出色，但在面对特定领域任务时，其性能仍不及有监督的方法。论文提出的解决方案Seg-TTO框架通过分割特定的测试时优化（test-time optimization）来弥补这一差距。其关键在于：1）在文本模态中，为每个类别学习多个嵌入（embeddings），以捕捉图像中的多样化概念；2）在视觉模态中，计算像素级损失并进行嵌入聚合操作，以保留空间结构。Seg-TTO作为一个即插即用模块，与三种最先进的OVSS方法集成，并在22个具有挑战性的OVSS任务中验证了其性能提升，确立了新的技术标杆。

链接: https://arxiv.org/abs/2501.04696
作者: Ulindu De Silva,Didula Samaraweera,Sasini Wanigathunga,Kavindu Kariyawasam,Kanchana Ranasinghe,Muzammal Naseer,Ranga Rodrigo
机构: University of Moratuwa(莫拉图瓦大学); Stony Brook University(石溪大学); Khalifa University(哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements across these establishing new state-of-the-art. Code: this https URL.
zh

[CV-5] Re-ranking the Context for Multimodal Retrieval Augmented Generation

【速读】：该论文旨在解决多模态检索增强生成（Retrieval-augmented Generation, RAG）系统在检索阶段选择相关上下文时面临的挑战，特别是如何从知识库中更准确地选择与用户查询相关的条目，以减少不相关信息的干扰。解决方案的关键在于引入了一种更先进的相关性评分（Relevancy Score, RS）度量方法，该方法基于先前工作中设计的评估RAG性能的指标，用于在检索过程中选择更相关的条目。与传统的基于嵌入（如CLIP-based embedding）和余弦相似度的检索方法相比，这种新方法能够自适应地选择最多k个条目，而不是固定数量的条目，从而显著提高了检索的准确性和生成响应的质量。通过在COCO数据集上的评估，该方法在相关上下文选择和生成响应准确性方面表现出显著的改进。

链接: https://arxiv.org/abs/2501.04695
作者: Matin Mortaheb,Mohammad A. Amir Khojastepour,Srimat T. Chakradhar,Sennur Ulukus
机构: University of Maryland(马里兰大学); NEC Laboratories America(美国NEC实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge to generate a response within a context with improved accuracy and reduced hallucinations. However, multi-modal RAG systems face unique challenges: (i) the retrieval process may select irrelevant entries to user query (e.g., images, documents), and (ii) vision-language models or multi-modal language models like GPT-4o may hallucinate when processing these entries to generate RAG output. In this paper, we aim to address the first challenge, i.e, improving the selection of relevant context from the knowledge-base in retrieval phase of the multi-modal RAG. Specifically, we leverage the relevancy score (RS) measure designed in our previous work for evaluating the RAG performance to select more relevant entries in retrieval process. The retrieval based on embeddings, say CLIP-based embedding, and cosine similarity usually perform poorly particularly for multi-modal data. We show that by using a more advanced relevancy measure, one can enhance the retrieval process by selecting more relevant pieces from the knowledge-base and eliminate the irrelevant pieces from the context by adaptively selecting up-to- k entries instead of fixed number of entries. Our evaluation using COCO dataset demonstrates significant enhancement in selecting relevant context and accuracy of the generated response.
zh

[CV-6] SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

【速读】：该论文致力于解决单张图像的三维物体重建（single-image 3D object reconstruction）问题。现有的方法主要分为两类：基于回归（regression-based）的建模和生成式建模（generative modeling）。回归方法能够高效推断可见表面，但在处理遮挡区域时表现不佳；生成式方法通过建模分布更好地处理不确定区域，但计算成本高且生成结果常与可见表面不对齐。为此，论文提出了SPAR3D，一种新颖的两阶段方法，旨在结合两者的优势。第一阶段使用轻量级的点扩散模型（point diffusion model）生成稀疏的三维点云（sparse 3D point clouds），具有快速的采样速度；第二阶段结合生成的点云和输入图像，生成高细节的网格（mesh）。这种两阶段设计能够在保持高计算效率和输出保真度的同时，对病态的单图像三维任务进行概率建模。此外，点云作为中间表示还支持用户交互编辑。实验表明，SPAR3D在多个数据集上优于现有方法，推理速度达到0.7秒。

链接: https://arxiv.org/abs/2501.04689
作者: Zixuan Huang,Mark Boss,Aaryaman Vasishta,James M. Rehg,Varun Jampani
机构: Stability AI; UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: this https URL
zh

[CV-7] DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

【速读】：该论文试图解决大型视觉-语言模型（LVLMs）在处理复杂视觉推理任务时面临的挑战，特别是由于文本和视觉数据之间的模态差异（modality gap）导致的过度依赖文本先验、幻觉现象以及复杂视觉推理能力有限的问题。现有的评估基准通常依赖于示意图或合成图像，以及不精确的机器生成解释，难以准确反映模型在真实世界复杂场景中的推理能力。

解决方案的关键在于提出了一个新的基准测试 DrivingVQA，该基准基于驾驶理论测试，包含 3,931 个专家设计的多选题和与推理过程相关的实体交织的解释。通过这一数据集，论文对 LVLMs 在复杂视觉场景中的推理能力进行了深入研究。实验表明，开源和专有的 LVLMs 在零样本设置下难以进行视觉链式推理（visual chain-of-thought reasoning）。论文进一步探讨了利用相关实体的训练策略，发现通过裁剪与这些实体相关的图像区域进行推理时，模型性能提升了高达 7%。

链接: https://arxiv.org/abs/2501.04671
作者: Charles Corbière,Simon Roburin,Syrielle Montariol,Antoine Bosselut,Alexandre Alahi
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs’ ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7% when reasoning over image tokens of cropped regions tied to these entities.
zh

[CV-8] Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLM s

【速读】：该论文旨在解决当前多模态大语言模型（MLLMs）在视觉匹配能力方面的系统性不足问题。尽管现有的多模态模型在视觉感知、推理能力和视觉-语言理解方面表现出色，但在视觉匹配任务中仍存在显著缺陷。为此，作者构建了一个多模态视觉匹配（MMVM）基准，用于公平评估超过30种不同的MLLMs。该基准基于15个开源数据集和互联网视频，并进行了手动标注。此外，作者设计了一个自动标注流程，生成了包含22万条视觉匹配数据的MMVM SFT数据集，并提出了CoLVA模型，该模型通过细粒度视觉专家和对象级对比学习（object-level contrastive learning）以及指令增强策略（instruction augmentation strategy）两项新技术设计，显著提升了视觉匹配能力。CoLVA在MMVM基准上的总体准确率达到51.06%，分别比GPT-4o和基线模型高出8.41%和23.58%，证明了MMVM SFT数据集和新技术的有效性。

链接: https://arxiv.org/abs/2501.04670
作者: Yikang Zhou,Tao Zhang,Shilin Xu,Shihao Chen,Qianyu Zhou,Yunhai Tong,Shunping Ji,Jiangning Zhang,Xiangtai Li,Lu Qi
机构: Wuhan University(武汉大学); Bytedance Seed(字节跳动种子); Peking University(北京大学); Zhejiang University(浙江大学); STJU(未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41% and 23.58% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at this https URL.
zh

[CV-9] Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

【速读】：该论文旨在解决虚拟试衣（virtual try-on）任务中的两个主要挑战：一是（人类，服装）配对训练数据的有限性；二是生成与目标服装完美匹配的纹理时，常常出现扭曲文字和纹理褪色的问题。为解决这些问题，论文提出了两个关键解决方案：首先，引入了一种服装提取模型，能够从单张穿着服装的人体图像中生成（人类，合成服装）配对数据，从而扩充虚拟试衣的训练数据；其次，提出了一种基于误差感知精炼的薛定谔桥（Error-Aware Refinement-based Schrödinger Bridge, EARSB）方法，通过弱监督误差分类器定位生成误差区域，并利用其置信度热图优化薛定谔桥的噪声调度，从而修正基础虚拟试衣模型的输出。实验结果表明，合成数据增强提升了现有模型的性能，而EARSB则显著改善了图像质量。

链接: https://arxiv.org/abs/2501.04666
作者: Nannan Li,Kevin J. Shih,Bryan A. Plummer
机构: Boston University(波士顿大学); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work explores ways to tackle these issues through both synthetic data as well as model refinement. We introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. We also propose an Error-Aware Refinement-based Schrödinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a base virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schrödinger Bridge’s noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB improves the overall image quality. In user studies, our model is preferred by the users in an average of 59% of cases.
zh

[CV-10] Discrete Wavelet Transform-Based Capsule Network for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像（HSI）分类中计算复杂度高的问题，特别是在使用胶囊网络（CapsNets）进行光谱-空间信息提取时，由于其全连接架构导致的高计算需求。为解决这一问题，论文提出了一种基于离散小波变换（DWT）的DWT-CapsNet方法。该方法的关键在于：1）在特征提取器中引入定制的注意力机制，结合DWT下采样层，缓解传统下采样操作中的信息丢失问题；2）提出了一种新的多尺度路由算法，通过剪枝大幅减少胶囊网络中的连接数量；3）设计了胶囊金字塔融合机制，以多粒度层次聚合光谱-空间关系，并在部分局部连接架构中进一步应用自注意力机制，突出有意义的关系。实验结果表明，该方法在保持较低计算需求的同时，达到了最先进的分类精度，适用于实际HSI分类应用。

链接: https://arxiv.org/abs/2501.04643
作者: Zhiqiang Gao,Jiaqi Wang,Hangchi Shen,Zhihao Dou,Xiangbo Zhang,Kaizhu Huang
机构: WKU(西肯塔基大学); Hainan Normal University(海南师范大学); Southwest University(西南大学); Duke University(杜克大学); Duke Kunshan University(杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 Pages; 9 Figure

点击查看摘要

Abstract:Hyperspectral image (HSI) classification is a crucial technique for remote sensing to build large-scale earth monitoring systems. HSI contains much more information than traditional visual images for identifying the categories of land covers. One recent feasible solution for HSI is to leverage CapsNets for capturing spectral-spatial information. However, these methods require high computational requirements due to the full connection architecture between stacked capsule layers. To solve this problem, a DWT-CapsNet is proposed to identify partial but important connections in CapsNet for a effective and efficient HSI classification. Specifically, we integrate a tailored attention mechanism into a Discrete Wavelet Transform (DWT)-based downsampling layer, alleviating the information loss problem of conventional downsampling operation in feature extractors. Moreover, we propose a novel multi-scale routing algorithm that prunes a large proportion of connections in CapsNet. A capsule pyramid fusion mechanism is designed to aggregate the spectral-spatial relationships in multiple levels of granularity, and then a self-attention mechanism is further conducted in a partially and locally connected architecture to emphasize the meaningful relationships. As shown in the experimental results, our method achieves state-of-the-art accuracy while keeping lower computational demand regarding running time, flops, and the number of parameters, rendering it an appealing choice for practical implementation in HSI classification.
zh

[CV-11] Disentangled Clothed Avatar Generation with Layered Representation

【速读】：该论文旨在解决生成具有解耦组件（如身体、头发和衣服）的着装虚拟角色（clothed avatar）的挑战。现有的方法在生成多样化的数字虚拟角色方面取得了成功，但在生成具有解耦组件的虚拟角色时仍面临困难。论文提出的解决方案是LayerAvatar，这是首个基于前馈扩散模型（feed-forward diffusion-based method）的生成方法。其关键创新在于提出了一种分层的UV特征平面表示（layered UV feature plane representation），其中不同组件分布在基于高斯分布的UV特征平面的不同层中，并带有相应的语义标签。这种表示支持高分辨率和实时渲染，以及包括可控手势和面部表情在内的丰富动画。此外，论文还训练了一个单阶段扩散模型，并引入了约束项来解决最内层人体严重遮挡的问题。实验结果表明，该方法在生成解耦的着装虚拟角色方面表现出色，并进一步探索了其在组件转移中的应用。

链接: https://arxiv.org/abs/2501.04631
作者: Weitian Zhang,Sijing Wu,Manwen Liao,Yichao Yan
机构: Shanghai Jiao Tong University (上海交通大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer. The project page is available at: this https URL
zh

[CV-12] FatesGS: Fast and Accurate Sparse-View Surface Reconstruction using Gaussian Splatting with Depth-Feature Consistency AAAI2025

【速读】：该论文旨在解决稀疏视角（sparse views）下的三维重建问题。现有基于高斯泼溅（Gaussian Splatting）的方法在密集视角下表现良好，但在稀疏视角下会出现过拟合训练视角、产生噪声浮点（noisy floaters）以及重建表面不完整等问题。为解决这些问题，论文提出了一种创新的稀疏视角重建框架，其关键解决方案包括利用单目深度排序信息（monocular depth ranking information）来监督局部区域内的深度分布一致性，并通过平滑损失（smoothness loss）增强分布的连续性。此外，通过多视角投影特征优化深度绝对位置，从而实现更精细的表面重建。实验结果表明，该方法在DTU和BlendedMVS数据集上显著优于现有方法，且速度提升了60倍至200倍，无需昂贵的预训练即可实现快速且精细的网格重建。

链接: https://arxiv.org/abs/2501.04628
作者: Han Huang,Yulun Wu,Chao Deng,Ge Gao,Ming Gu,Yu-Shen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025. Project page: this https URL

点击查看摘要

Abstract:Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multi-view reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.
zh

[CV-13] Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion

【速读】：该论文试图解决基于扩散模型（diffusion models）的文本到图像生成（Text-to-Image, T2I）在视频编辑应用中存在的帧间不一致性问题。现有的方法通过时间层微调或基于推理的时间传播来解决这一问题，但这些方法存在训练成本高或时间一致性有限的问题。为此，论文提出了一种通用且高效的适配器（General and Efficient Adapter, GE-Adapter），通过结合时空一致性和语义一致性，利用双向DDIM反演（Baliteral DDIM inversion）来解决这些问题。该框架的关键组件包括：(1) 基于帧的时间一致性块（Frame-based Temporal Consistency Blocks, FTC Blocks），用于捕捉帧特定特征并通过时间感知损失函数实现平滑的帧间过渡；(2) 通道依赖的空间一致性块（Channel-dependent Spatial Consistency Blocks, SCD Blocks），采用双边滤波器减少噪声和伪影，增强空间一致性；(3) 基于标记的语义一致性模块（Token-based Semantic Consistency Module, TSC Module），通过共享提示标记和帧特定标记保持语义对齐。该方法显著提高了感知质量、文本-图像对齐和时间一致性，并在MSR-VTT数据集上展示了其有效性，为文本到视频编辑提供了实用的解决方案。

链接: https://arxiv.org/abs/2501.04606
作者: Yangfan He,Sida Li,Kun Li,Jianhui Wang,Binxu Li,Tianyu Shi,Jun Yin,Miao Zhang,Xueqian Wang
机构: University of Minnesota(明尼苏达大学); Peking University(北京大学); Xiamen University Malaysia(厦门大学马来西亚分校); University of Electronic Science and Technology of China(电子科技大学); Stanford University(斯坦福大学); University of Toronto(多伦多大学); Tsinghua University(清华大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions via temporally-aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; and (3) Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment using shared prompt tokens and frame-specific tokens. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves enhanced fidelity and frame-to-frame coherence, offering a practical solution for T2V editing.
zh

[CV-14] FrontierNet: Learning Visual Cues to Explore

【速读】：该论文旨在解决自主机器人在未知环境中探索时，现有方法（如基于前沿的方法）过度依赖3D地图操作的问题。这些方法受限于地图质量，并且常常忽略了视觉线索中的有价值信息。论文提出了一种仅基于2D视觉线索的高效自主探索系统，核心解决方案是开发了一个名为FrontierNet的学习模型。FrontierNet通过单目深度先验增强的RGB图像，能够（i）检测前沿区域，并（ii）预测这些区域的信息增益。该方案为现有的依赖3D地图的探索系统提供了一种替代方案，通过广泛的仿真和实际实验验证，其在早期探索阶段的效率提升了16%。

链接: https://arxiv.org/abs/2501.04597
作者: Boyang Sun,Hanzhi Chen,Stefan Leutenegger,Cesar Cadena,Marc Pollefeys,Hermann Blum
机构: ETH Zurich(苏黎世联邦理工学院); Technical University of Munich(慕尼黑工业大学); Microsoft(微软); Uni Bonn(波恩大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for tasks such as mapping, object discovery, and environmental assessment. Existing methods, such as frontier-based methods, rely heavily on 3D map operations, which are limited by map quality and often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a image-only frontier-based exploration system, with FrontierNet as a core component developed in this work. FrontierNet is a learning-based model that (i) detects frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent exploration systems, achieving a 16% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments.
zh

[CV-15] Identity-Preserving Video Dubbing Using Motion Warping

【速读】：该论文试图解决视频配音（video dubbing）中存在的身份特征保留不足的问题。现有方法虽然能够根据音频准确生成口型，但往往无法有效保留参考视频中人物的独特纹理和结构细节，主要原因是未能充分捕捉音频线索与参考身份视觉属性之间的微妙交互。为解决这一问题，论文提出了IPTalker框架，其核心在于一种基于Transformer的对齐机制，能够动态捕捉并建模音频特征与参考图像之间的对应关系，从而实现精确的、身份感知的音视频整合。此外，通过运动变形策略进一步优化结果，并通过专门的细化过程减少遮挡伪影，增强口型和皮肤细节等细粒度纹理的保留。实验表明，IPTalker在真实性、口型同步和身份保留方面均优于现有方法，确立了高质量、身份一致视频配音的新标准。

链接: https://arxiv.org/abs/2501.04586
作者: Runzhen Liu,Qinjie Lin,Yunfei Liu,Lijian Lin,Ye Zhu,Yu Li,Chuhua Xian,Fa-Ting Hong
机构: Department of Computer Science and Engineering, South China University of Technology (华南理工大学计算机科学与工程学院); Department of Computer Science, Northwestern University (西北大学计算机科学系); Vistring Lab, IDEA (Vistring实验室, IDEA); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (香港科技大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.
zh

[CV-16] Boosting Salient Object Detection with Knowledge Distillated from Large Foundation Models

【速读】：该论文旨在解决显著目标检测（Salient Object Detection, SOD）领域中手动标注伪标签（pseudo labels）耗时且成本高的问题。传统方法依赖于精确的像素级标注，这在实际应用中效率低下。为此，论文提出了一种低成本、高精度的标注方法，通过利用大规模基础模型（large foundation models）生成伪标签。具体而言，作者采用弱监督（weakly supervised）方法，通过文本提示（textual prompts）引导大模型生成伪标签。由于大模型在图像显著区域的聚焦效果不佳，作者进一步通过手动标注部分文本来微调模型。基于这一方法，论文引入了一个新的数据集BDS-TR，该数据集在规模和场景多样性上优于现有的DUTS-TR数据集，为SOD研究提供了更全面的基础数据。此外，论文还提出了一种基于动态上采样（dynamic upsampling）的边缘解码器（edge decoder），能够在逐步恢复图像特征分辨率的同时聚焦于目标边缘。实验结果表明，该方法在多个基准数据集上显著优于现有的最先进方法，甚至超越了一些全监督的SOD方法。

链接: https://arxiv.org/abs/2501.04582
作者: Miaoyang He,Shuyong Gao,Tsui Qin Mok,Weifeng Ge,Wengqiang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model’s applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.
zh

[CV-17] Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision AAAI2025

【速读】：该论文试图解决图像压缩模型在适应性和泛化性方面的挑战，特别是在解码后的比特流通常仅服务于人类或机器需求，而无法为未见过的视觉任务保留信息的问题。解决方案的关键在于引入了多模态预训练模型（multimodal pre-training models）的监督，并结合自适应多目标优化（adaptive multi-objective optimization），以支持人类视觉感知和机器视觉的双重需求，同时使用单一比特流（Unified and Generalized Image Coding for Machine, UG-ICM）。具体而言，论文通过引入对比语言-图像预训练模型（Contrastive Language-Image Pre-training, CLIP）作为训练约束，以提升模型的泛化能力，并通过全局到实例级别的CLIP监督来获取层次化语义信息，使模型能够更好地适应不同粒度信息的任务。此外，论文还提出了一种条件解码策略（conditional decoding strategy），根据人类或机器的偏好进行解码，使得同一比特流能够生成适应不同需求的版本。最终，UG-ICM通过自监督的方式进行训练，无需依赖特定下游模型和任务，实验结果表明其在多种未见过的机器分析任务中表现出显著的性能提升，同时生成视觉上令人满意的图像。

链接: https://arxiv.org/abs/2501.04579
作者: Kangsheng Yin,Quan Liu,Xuelin Shen,Yulin He,Wenhan Yang,Shiqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, 10 figures, publised to AAAI 2025

点击查看摘要

Abstract:The image compression model has long struggled with adaptability and generalization, as the decoded bitstream typically serves only human or machine needs and fails to preserve information for unseen visual tasks. Therefore, this paper innovatively introduces supervision obtained from multimodal pre-training models and incorporates adaptive multi-objective optimization tailored to support both human visual perception and machine vision simultaneously with a single bitstream, denoted as Unified and Generalized Image Coding for Machine (UG-ICM). Specifically, to get rid of the reliance between compression models with downstream task supervision, we introduce Contrastive Language-Image Pre-training (CLIP) models into the training constraint for improved generalization. Global-to-instance-wise CLIP supervision is applied to help obtain hierarchical semantics that make models more generalizable for the tasks relying on the information of different granularity. Furthermore, for supporting both human and machine visions with only a unifying bitstream, we incorporate a conditional decoding strategy that takes as conditions human or machine preferences, enabling the bitstream to be decoded into different versions for corresponding preferences. As such, our proposed UG-ICM is fully trained in a self-supervised manner, i.e., without awareness of any specific downstream models and tasks. The extensive experiments have shown that the proposed UG-ICM is capable of achieving remarkable improvements in various unseen machine analytics tasks, while simultaneously providing perceptually satisfying images.
zh

[CV-18] Learnable Scaled Gradient Descent for Guaranteed Robust Tensor PCA

【速读】：该论文旨在解决鲁棒张量主成分分析（Robust Tensor Principal Component Analysis, RTPCA）中计算效率低下的问题，特别是在处理多维数据时，现有方法依赖于计算成本较高的张量核范数（Tensor Nuclear Norm, TNN），限制了其在实际应用中的可扩展性。为解决这一问题，论文首次在张量奇异值分解（tensor Singular Value Decomposition, t-SVD）框架下探索了一种高效的缩放梯度下降（Scaled Gradient Descent, SGD）方法，并提出了RTPCA-SGD方法。该方法的理论分析表明，在适当的参数选择下，RTPCA-SGD能够以恒定速率线性收敛到真实的低秩张量，且收敛速度不受条件数的影响。此外，论文还提出了一种可学习的自监督深度展开模型，以增强方法的实际适用性。实验结果表明，RTPCA-SGD在合成和真实数据集上均表现出优越的性能，并且在计算效率上优于传统的RTPCA-TNN方法。

链接: https://arxiv.org/abs/2501.04565
作者: Lanlan Feng,Ce Zhu,Yipeng Liu,Saiprasad Ravishankar,Longxiu Huang
机构: University of Electronic Science and Technology of China (UESTC) (电子科技大学); Michigan State University (MSU) (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust tensor principal component analysis (RTPCA) aims to separate the low-rank and sparse components from multi-dimensional data, making it an essential technique in the signal processing and computer vision fields. Recently emerging tensor singular value decomposition (t-SVD) has gained considerable attention for its ability to better capture the low-rank structure of tensors compared to traditional matrix SVD. However, existing methods often rely on the computationally expensive tensor nuclear norm (TNN), which limits their scalability for real-world tensors. To address this issue, we explore an efficient scaled gradient descent (SGD) approach within the t-SVD framework for the first time, and propose the RTPCA-SGD method. Theoretically, we rigorously establish the recovery guarantees of RTPCA-SGD under mild assumptions, demonstrating that with appropriate parameter selection, it achieves linear convergence to the true low-rank tensor at a constant rate, independent of the condition number. To enhance its practical applicability, we further propose a learnable self-supervised deep unfolding model, which enables effective parameter learning. Numerical experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed methods while maintaining competitive computational efficiency, especially consuming less time than RTPCA-TNN.
zh

[CV-19] Combining YOLO and Visual Rhythm for Vehicle Counting

【速读】：该论文旨在解决基于视频的车辆检测与计数中计算复杂度高的问题。传统方法通常包括初始检测和后续跟踪两个步骤，这些步骤应用于所有视频帧，导致计算负担显著增加。为解决这一问题，论文提出了一种更为高效的方法，其关键在于省略了跟踪步骤，仅专注于在关键视频帧中检测车辆。该方法结合了YOLO（You Only Look Once）用于车辆检测，以及Visual Rhythm（视觉节奏）技术，后者通过生成时空图像来聚焦于包含有用信息的帧。实验结果表明，该方法在真实视频数据集上的平均计数准确率约为99.15%，且处理速度比基于跟踪的方法快三倍。

链接: https://arxiv.org/abs/2501.04534
作者: Victor Nascimento Ribeiro,Nina S. T. Hirata
机构: University of São Paulo - USP (圣保罗大学); Institute of Mathematics and Statistics (数学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2023

点击查看摘要

Abstract:Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient method for vehicle detection and counting. The proposed approach eliminates the need for a tracking step and focuses solely on detecting vehicles in key video frames, thereby increasing its efficiency. To achieve this, we developed a system that combines YOLO, for vehicle detection, with Visual Rhythm, a way to create time-spatial images that allows us to focus on frames that contain useful information. Additionally, this method can be used for counting in any application involving unidirectional moving targets to be detected and identified. Experimental analysis using real videos shows that the proposed method achieves mean counting accuracy around 99.15% over a set of videos, with a processing speed three times faster than tracking based approaches.
zh

[CV-20] owards Fair Class-wise Robustness: Class Optimal Distribution Adversarial Training

【速读】：该论文试图解决对抗训练（Adversarial Training）在提升深度神经网络（Deep Neural Networks）鲁棒性时存在的“鲁棒公平性”（robust fairness）问题，即不同类别之间的鲁棒性存在显著差异。现有方法主要通过类别加权（class-wise reweighted methods）来缓解这一问题，但这些方法缺乏严格的理论分析，且权重空间的探索有限，主要依赖于启发式算法或直觉来计算权重。此外，由于权重和模型参数的优化是解耦的，这些方法无法保证优化方向的一致性，可能导致次优的权重分配和模型性能。

为解决这些问题，论文提出了一种新的最小-最大训练框架，称为“类别最优分布对抗训练”（Class Optimal Distribution Adversarial Training, CODAT）。该框架采用分布鲁棒优化（Distributionally Robust Optimization）来全面探索类别权重空间，从而在理论上保证找到最优权重。此外，论文推导了内部最大化问题的闭式最优解，并得到了一个确定性等价目标函数，为权重和模型参数的联合优化提供了理论基础。同时，论文提出了一个公平弹性系数（fairness elasticity coefficient），用于评估算法在鲁棒性和鲁棒公平性方面的表现。实验结果表明，该方法能有效提升模型的鲁棒公平性，并在多个数据集上优于现有最先进的方法。

链接: https://arxiv.org/abs/2501.04527
作者: Hongxin Zhi,Hongtao Yu,Shaome Li,Xiuming Zhao,Yiteng Wu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial training has proven to be a highly effective method for improving the robustness of deep neural networks against adversarial attacks. Nonetheless, it has been observed to exhibit a limitation in terms of robust fairness, characterized by a significant disparity in robustness across different classes. Recent efforts to mitigate this problem have turned to class-wise reweighted methods. However, these methods suffer from a lack of rigorous theoretical analysis and are limited in their exploration of the weight space, as they mainly rely on existing heuristic algorithms or intuition to compute weights. In addition, these methods fail to guarantee the consistency of the optimization direction due to the decoupled optimization of weights and the model parameters. They potentially lead to suboptimal weight assignments and consequently, a suboptimal model. To address these problems, this paper proposes a novel min-max training framework, Class Optimal Distribution Adversarial Training (CODAT), which employs distributionally robust optimization to fully explore the class-wise weight space, thus enabling the identification of the optimal weight with theoretical guarantees. Furthermore, we derive a closed-form optimal solution to the internal maximization and then get a deterministic equivalent objective function, which provides a theoretical basis for the joint optimization of weights and model parameters. Meanwhile, we propose a fairness elasticity coefficient for the evaluation of the algorithm with regard to both robustness and robust fairness. Experimental results on various datasets show that the proposed method can effectively improve the robust fairness of the model and outperform the state-of-the-art approaches.
zh

[CV-21] MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration

【速读】：该论文试图解决Transformer网络在图像修复任务中由于Softmax-attention的二次计算复杂度（quadratic computational complexity）而导致的计算效率问题，特别是在高分辨率图像处理中的应用受限。为了解决这一问题，作者提出了一种基于泰勒展开（Taylor expansion）的新型Transformer变体，称为MB-TaylorFormer V2。该模型通过泰勒展开近似Softmax-attention，并利用范数保持映射（norm-preserving mapping）来近似一阶泰勒展开的余项，从而将计算复杂度降低到线性。此外，该模型引入了多分支架构（multi-branch architecture）和多尺度块嵌入（multi-scale patch embedding），具有四个显著优势：1）不同大小的感受野；2）多层次的语义信息；3）灵活的感受野形状；4）加速训练和推理速度。实验结果表明，MB-TaylorFormer V2在多种图像修复任务（如图像去雾、去雨、去雪、运动去模糊和去噪）中均达到了最先进的性能，且计算开销极低。

链接: https://arxiv.org/abs/2501.04486
作者: Zhi Jin,Yuwei Qiu,Kaihao Zhang,Hongdong Li,Wenhan Luo
机构: School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区智能系统工程学校); Guangdong Provincial Key Laboratory of Fire Science and Technology (广东省火灾科学与技术重点实验室); School of Computer Science and Technology, Shenzhen Campus of Harbin Institute of Technology (哈尔滨工业大学深圳校区计算机科学与技术学院); School of Engineering and Computer Science, Australian National University (澳大利亚国立大学工程与计算机科学学院); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at this https URL.
zh

[CV-22] Rethinking High-speed Image Reconstruction Framework with Spike Camera AAAI2025

【速读】：该论文试图解决在低光条件下从尖峰相机（spike camera）生成的尖峰流中重建高质量图像的问题。传统基于学习的方法通常依赖于合成数据集进行训练，但在处理低光环境下产生的噪声尖峰时表现不佳，导致在真实数据集上的性能进一步下降。这种现象主要是由于噪声建模不足以及合成数据集与真实数据集之间的领域差距，导致重建的图像纹理不清晰、噪声过多且亮度不足。

解决方案的关键在于引入了一种新颖的尖峰到图像重建框架SpikeCLIP，该框架超越了传统的训练范式。通过利用CLIP模型强大的文本和图像对齐能力，SpikeCLIP结合了捕获场景的文本描述和未配对的高质量数据集作为监督信号。实验结果表明，SpikeCLIP在真实低光数据集U-CALTECH和U-CIFAR上显著增强了重建图像的纹理细节和亮度平衡，并且重建的图像与下游任务所需的广泛视觉特征良好对齐，确保了在挑战性环境中更稳健和多功能的表现。

链接: https://arxiv.org/abs/2501.04477
作者: Kang Chen,Yajing Zheng,Tiejun Huang,Zhaofei Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model’s powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.
zh

[CV-23] A Histologic Dataset of Normal and Atypical Mitotic Figures on Human Breast Cancer (AMi-Br)

【速读】：该论文旨在解决如何通过组织学肿瘤切片中非典型有丝分裂象（atypical mitotic figures, AMFs）的密度评估作为乳腺癌等肿瘤类型的独立预后标志物的问题。非典型有丝分裂象是细胞周期调控基因突变的指示器，可能导致肿瘤细胞的染色体异常（非整倍体）。为了促进基于模式识别的进一步研究，作者首次公开了一个包含非典型和正常有丝分裂象的数据集（AMi-Br）。该数据集通过整合两个流行的有丝分裂象数据集（MIDOG 2021和TUPAC），并采用三位专家的多数投票对所有有丝分裂象进行分类，最终包含3,720个有丝分裂象，其中832个为非典型有丝分裂象（22.4%），2,888个为正常有丝分裂象（77.6%）。作者还提供了基线分类实验，通过蒙特卡洛交叉验证和不同策略应对类别不平衡问题，验证了数据集的一致性。实验结果显示，使用图像块级数据集分割时，平均平衡准确率可达0.806，而使用患者级分割时可达0.713。

链接: https://arxiv.org/abs/2501.04467
作者: Christof A. Bertram,Viktoria Weiss,Taryn A. Donovan,Sweta Banerjee,Thomas Conrad,Jonas Ammeling,Robert Klopfleisch,Christopher Kaltenecker,Marc Aubreville
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Assessment of the density of mitotic figures (MFs) in histologic tumor sections is an important prognostic marker for many tumor types, including breast cancer. Recently, it has been reported in multiple works that the quantity of MFs with an atypical morphology (atypical MFs, AMFs) might be an independent prognostic criterion for breast cancer. AMFs are an indicator of mutations in the genes regulating the cell cycle and can lead to aberrant chromosome constitution (aneuploidy) of the tumor cells. To facilitate further research on this topic using pattern recognition, we present the first ever publicly available dataset of atypical and normal MFs (AMi-Br). For this, we utilized two of the most popular MF datasets (MIDOG 2021 and TUPAC) and subclassified all MFs using a three expert majority vote. Our final dataset consists of 3,720 MFs, split into 832 AMFs (22.4%) and 2,888 normal MFs (77.6%) across all 223 tumor cases in the combined set. We provide baseline classification experiments to investigate the consistency of the dataset, using a Monte Carlo cross-validation and different strategies to combat class imbalance. We found an averaged balanced accuracy of up to 0.806 when using a patch-level data set split, and up to 0.713 when using a patient-level split.
zh

[CV-24] A novel Facial Recognition technique with Focusing on Masked Faces

【速读】：该论文试图解决在佩戴口罩和不佩戴口罩的情况下识别同一人脸的问题，这一问题在安全、访问控制和公共安全等领域至关重要。传统的人脸识别系统在佩戴口罩的情况下识别准确率显著下降，因此需要开发能够在这种条件下可靠识别个体的方法。解决方案的关键在于提出了一个名为“Masked-Unmasked Face Matching Model (MUFM)”的模型，该模型采用迁移学习（transfer learning）技术，利用Visual Geometry Group (VGG16)模型提取显著的面部特征，并使用K-Nearest Neighbors (K-NN)算法进行分类。此外，该模型通过余弦相似度（cosine similarity）度量来比较同一人佩戴口罩和不佩戴口罩的面部图像。这一方法在识别佩戴口罩和不佩戴口罩的同一人方面具有创新性，有效解决了传统系统在这一场景下的局限性。

链接: https://arxiv.org/abs/2501.04444
作者: Dana A Abdullah,Dana Rasul Hamad,Hakem Beitollahi,Ismail Y Maolood,Abdulhady Abas Abdullah,Aso Khaleel Ameen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recognizing the same faces with and without masks is important for ensuring consistent identification in security, access control, and public safety. This capability is crucial in scenarios like law enforcement, healthcare, and surveillance, where accurate recognition must be maintained despite facial occlusion. This research focuses on the challenge of recognizing the same faces with and without masks by employing cosine similarity as the primary technique. With the increased use of masks, traditional facial recognition systems face significant accuracy issues, making it crucial to develop methods that can reliably identify individuals in masked conditions. For that reason, this study proposed Masked-Unmasked Face Matching Model (MUFM). This model employs transfer learning using the Visual Geometry Group (VGG16) model to extract significant facial features, which are subsequently classified utilizing the K-Nearest Neighbors (K-NN) algorithm. The cosine similarity metric is employed to compare masked and unmasked faces of the same individuals. This approach represents a novel contribution, as the task of recognizing the same individual with and without a mask using cosine similarity has not been previously addressed. By integrating these advanced methodologies, the research demonstrates effective identification of individuals despite the presence of masks, addressing a significant limitation in traditional systems. Using data is another essential part of this work, by collecting and preparing an image dataset from three different sources especially some of those data are real provided a comprehensive power of this research. The image dataset used were already collected in three different datasets of masked and unmasked for the same faces.
zh

[CV-25] RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark

【速读】：该论文试图解决合成孔径雷达（SAR）领域中旋转目标检测（Rotated Object Detection）的进展滞后问题，主要原因是缺乏大规模标注数据集。现有的弱监督模型（Weakly Supervised Model）在预测目标角度时精度有限，导致生成伪旋转框（Pseudo-Rotated Boxes）的效率低下。论文的关键解决方案是提出了一种单位圆约束解析器（Unit Cycle Resolver, UCR），通过引入单位圆约束损失（Unit Circle Constraint Loss）来改进角度预测的准确性。UCR不仅提升了现有弱监督方法的性能，还在光学遥感数据集（如DOTA-v1.0）上超越了全监督模型。此外，借助UCR，论文还标注并发布了迄今为止最大的多类别旋转SAR目标检测数据集RSAR。实验结果表明，UCR显著提高了角度预测的精度。

链接: https://arxiv.org/abs/2501.04440
作者: Xin Zhang,Xue Yang,Yuxuan Li,Jian Yang,Ming-Ming Cheng,Xiang Li
机构: VCIP, CS, Nankai University (南开大学); Shanghai AI Laboratory (上海人工智能实验室); NKIARI, Shenzhen Futian (深圳福田国家科技信息资源中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object’s angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: this https URL.
zh

[CV-26] FADIT: Invertible Face Anonymization via Disentangled Identity Transform

【速读】：该论文旨在解决面部匿名化（Face Anonymization）过程中传统方法（如模糊化和像素化）在图像质量和安全性方面的不足。传统方法虽然能够有效去除面部识别特征，但会导致图像质量显著下降，并且容易受到深度重建攻击（deep reconstruction attacks）的影响。为此，论文提出了一种名为iFADIT（Invertible Face Anonymization via Disentangled Identity Transformation）的新框架。该框架的核心在于结合解耦架构（disentanglement architecture）和基于流的模型（flow-based model），前者将身份信息与非识别属性分离，后者通过密钥控制以可逆的方式将解耦的身份信息转换为匿名化版本。通过预训练的StyleGAN模型，匿名化后的面部图像能够保持高质量和逼真的细节。此外，该框架还设计了专门的密钥机制和双阶段训练策略，以确保匿名化的可逆性、安全性和多样性。实验结果表明，该方法在匿名性、可逆性、安全性、多样性和可解释性方面优于现有方法。

链接: https://arxiv.org/abs/2501.04390
作者: Lin Yuan,Kai Liang,Xiong Li,Tao Wu,Nannan Wang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face anonymization aims to conceal the visual identity of a face to safeguard the individual’s privacy. Traditional methods like blurring and pixelation can largely remove identifying features, but these techniques significantly degrade image quality and are vulnerable to deep reconstruction attacks. Generative models have emerged as a promising solution for anonymizing faces while preserving a natural this http URL, many still face limitations in visual quality and often overlook the potential to recover the original face from the anonymized version, which can be valuable in specific contexts such as image forensics. This paper proposes a novel framework named iFADIT, an acronym for Invertible Face Anonymization via Disentangled Identity this http URL framework features a disentanglement architecture coupled with a secure flow-based model: the former decouples identity information from non-identifying attributes, while the latter transforms the decoupled identity into an anonymized version in an invertible manner controlled by a secret key. The anonymized face can then be reconstructed based on a pre-trained StyleGAN that ensures high image quality and realistic facial details. Recovery of the original face (aka de-anonymization) is possible upon the availability of the matching secret, by inverting the anonymization process based on the same set of model parameters. Furthermore, a dedicated secret-key mechanism along with a dual-phase training strategy is devised to ensure the desired properties of face anonymization. Qualitative and quantitative experiments demonstrate the superiority of the proposed approach in anonymity, reversibility, security, diversity, and interpretability over competing methods.
zh

[CV-27] On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

【速读】：该论文旨在解决视觉自回归模型（Visual Autoregressive Models, VAR）在图像生成领域中计算效率低下的问题。具体而言，现有的VAR模型在[Tian et al., NeurIPS 2024]中提出的算法时间复杂度为O(n^4)，这在计算上是不高效的。论文通过细粒度复杂性分析，探讨了VAR模型的计算极限和效率标准，并提出了关键贡献：确定了在何种条件下VAR计算可以实现次二次时间复杂度。特别地，论文建立了一个关于VAR注意力机制中输入矩阵范数的临界阈值，超过该阈值时，假设强指数时间假设（Strong Exponential Time Hypothesis, SETH）成立，则次四次时间复杂度的算法是不可能的。为了验证理论发现，论文提出了基于低秩近似的高效构造方法，这些方法符合所推导的标准。该研究从理论角度首次探讨了VAR模型的计算效率问题，为推进VAR框架中可扩展且高效的图像生成提供了新的见解。

链接: https://arxiv.org/abs/2501.04377
作者: Yekun Ke,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song
机构: The University of Hong Kong(香港大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Tsinghua University(清华大学); The Simons Institute for the Theory of Computing at UC Berkeley(加州大学伯克利分校西蒙斯理论计算研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Visual Autoregressive ( \mathsfVAR ) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine “next-scale prediction” paradigm. However, the state-of-the-art algorithm of \mathsfVAR models in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes O(n^4) time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of \mathsfVAR Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which \mathsfVAR computations can achieve sub-quadratic time complexity. Specifically, we establish a critical threshold for the norm of input matrices used in \mathsfVAR attention mechanisms. Above this threshold, assuming the Strong Exponential Time Hypothesis ( \mathsfSETH ) from fine-grained complexity theory, a sub-quartic time algorithm for \mathsfVAR models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the \mathsfVAR model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in \mathsfVAR frameworks.
zh

[CV-28] Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

【速读】：该论文试图解决深度伪造检测中的泛化问题（generalization problem）。研究表明，尽管大多数先前工作认为泛化差距（generalization gap）是由不同伪造方法之间的差异引起的，但本文发现，即使在与伪造无关的因素发生变化时，泛化问题仍然存在。论文指出，检测器容易过拟合于两种偏差：位置偏差（position bias）和内容偏差（content bias）。位置偏差指检测器倾向于依赖图像中的特定位置（如中心区域），而内容偏差则指检测器可能错误地利用与伪造无关的信息（如背景和头发）进行检测。为解决这些问题，论文提出了两种基于Transformer潜在空间的干预方法：shuffling分支和mixing分支。shuffling分支通过重新排列图像中的token及其位置嵌入来打破位置偏差，同时保持局部相关性；mixing分支则通过随机选择和混合同一mini-batch内具有相同标签的两张图像的token来重组内容信息。在学习过程中，通过对比损失（contrastive losses）和分歧损失（divergence losses）在特征空间和logit空间中对齐不同分支的输出，以获得无偏的特征表示和分类器。实验结果表明，该方法在广泛使用的评估数据集上具有显著的有效性。

链接: https://arxiv.org/abs/2501.04376
作者: Xinghe Fu,Zhiyuan Yan,Taiping Yao,Shen Chen,Xi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generalization problem is broadly recognized as a critical challenge in detecting deepfakes. Most previous work believes that the generalization gap is caused by the differences among various forgery methods. However, our investigation reveals that the generalization issue can still occur when forgery-irrelevant factors shift. In this work, we identify two biases that detectors may also be prone to overfitting: position bias and content bias, as depicted in Fig. 1. For the position bias, we observe that detectors are prone to lazily depending on the specific positions within an image (e.g., central regions even no forgery). As for content bias, we argue that detectors may potentially and mistakenly utilize forgery-unrelated information for detection (e.g., background, and hair). To intervene these biases, we propose two branches for shuffling and mixing with tokens in the latent space of transformers. For the shuffling branch, we rearrange the tokens and corresponding position embedding for each image while maintaining the local correlation. For the mixing branch, we randomly select and mix the tokens in the latent space between two images with the same label within the mini-batch to recombine the content information. During the learning process, we align the outputs of detectors from different branches in both feature space and logit space. Contrastive losses for features and divergence losses for logits are applied to obtain unbiased feature representation and classifiers. We demonstrate and verify the effectiveness of our method through extensive experiments on widely used evaluation datasets.
zh

[CV-29] Instructive3D: Editing Large Reconstruction Models with Text Instructions WACV2025

【速读】：该论文试图解决基于大重建模型（Large Reconstruction Models, LRMs）生成高质量3D模型时缺乏细粒度控制的问题。具体而言，现有的LRMs虽然能够从单张物体图像生成3D模型，但在修改或编辑细节（如添加标准设计图案、改变颜色和反射率）方面存在不足，这限制了其在增强现实、动画和游戏等领域的应用。为了解决这一问题，论文提出了Instructive3D模型，其关键创新在于将3D对象的生成与细粒度编辑集成到一个单一模型中，并通过用户文本提示（text prompts）来实现编辑。这一解决方案的核心是通过在3D对象的三平面潜在空间表示（triplane latent space representation）中添加一个适配器（adapter），该适配器基于文本提示执行扩散过程（diffusion process），从而在不生成编辑后的3D对象的情况下实现几何一致的修改。这种方法不仅提高了3D对象的生成精度和多样性，还避免了传统方法中生成精确编辑图像和3D对象对的高计算成本。

链接: https://arxiv.org/abs/2501.04374
作者: Kunal Kathare,Ankit Dhiman,K Vikas Gowda,Siddharth Aravindan,Shubham Monga,Basavaraja Shanthappa Vandrotti,Lokesh R Boregowda
机构: Samsung R&D Institute India - Bangalore(三星印度研发研究所 - 班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025. First two authors contributed equally

点击查看摘要

Abstract:Transformer based methods have enabled users to create, modify, and comprehend text and image data. Recently proposed Large Reconstruction Models (LRMs) further extend this by providing the ability to generate high-quality 3D models with the help of a single object image. These models, however, lack the ability to manipulate or edit the finer details, such as adding standard design patterns or changing the color and reflectance of the generated objects, thus lacking fine-grained control that may be very helpful in domains such as augmented reality, animation and gaming. Naively training LRMs for this purpose would require generating precisely edited images and 3D object pairs, which is computationally expensive. In this paper, we propose Instructive3D, a novel LRM based model that integrates generation and fine-grained editing, through user text prompts, of 3D objects into a single model. We accomplish this by adding an adapter that performs a diffusion process conditioned on a text prompt specifying edits in the triplane latent space representation of 3D object models. Our method does not require the generation of edited 3D objects. Additionally, Instructive3D allows us to perform geometrically consistent modifications, as the edits done through user-defined text prompts are applied to the triplane latent representation thus enhancing the versatility and precision of 3D objects generated. We compare the objects generated by Instructive3D and a baseline that first generates the 3D object meshes using a standard LRM model and then edits these 3D objects using text prompts when images are provided from the Objaverse LVIS dataset. We find that Instructive3D produces qualitatively superior 3D objects with the properties specified by the edit prompts.
zh

[CV-30] FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection ICASSP2025

【速读】：该论文旨在解决多模态3D目标检测（Multimodal 3D Object Detection）中由于3D点云与2D像素的粗粒度融合导致的维度不匹配问题，从而提升融合性能。为此，论文提出了一个名为FGU3R的多模态框架，其核心解决方案包括两个关键组件：首先，提出了一种高效的伪原始卷积（Pseudo-Raw Convolution, PRConv）特征提取器，用于同步调制多模态特征，并通过多模态交互在不同类型的关键点上聚合特征；其次，设计了一种跨注意力自适应融合（Cross-Attention Adaptive Fusion, CAAF）机制，通过跨注意力变体对同质的3D感兴趣区域（Region of Interest, RoI）特征进行细粒度自适应融合。这两个组件共同实现了基于统一3D表示的细粒度融合，显著提升了多模态3D目标检测的性能。实验在KITTI和nuScenes数据集上验证了该方法的有效性。

链接: https://arxiv.org/abs/2501.04373
作者: Guoxin Zhang,Ziying Song,Lin Liu,Zhonghong Ou
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (北京邮电大学网络与交换技术国家重点实验室); School of Computer Science and Technology, Beijing Jiaotong University (北京交通大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Multimodal 3D object detection has garnered considerable interest in autonomous driving. However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely, which leads to sub-optimal fusion performance. In this paper, we propose a multimodal framework FGU3R to tackle the issue mentioned above via unified 3D representation and fine-grained fusion, which consists of two important components. First, we propose an efficient feature extractor for raw and pseudo points, termed Pseudo-Raw Convolution (PRConv), which modulates multimodal features synchronously and aggregates the features from different types of points on key points based on multimodal interaction. Second, a Cross-Attention Adaptive Fusion (CAAF) is designed to fuse homogeneous 3D RoI (Region of Interest) features adaptively via a cross-attention variant in a fine-grained manner. Together they make fine-grained fusion on unified 3D representation. The experiments conducted on the KITTI and nuScenes show the effectiveness of our proposed method.
zh

[CV-31] DeFusion: An Effective Decoupling Fusion Network for Multi-Modal Pregnancy Prediction

【速读】：该论文试图解决在体外受精胚胎移植（IVF-ET）中，如何有效整合时间序列胚胎图像和父母生育力表指标这两种多模态信息，以提高妊娠预测性能的问题。当前机器学习模型无法充分利用这两种模态之间的互补信息。论文提出的解决方案是设计了一种名为DeFusion的解耦融合网络（Decoupling Fusion Network），其关键在于引入了解耦融合模块（decoupling fusion module），该模块将不同模态的信息解耦为相关和不相关信息，从而实现更精细的融合。具体而言，论文通过时空位置编码（spatial-temporal position encoding）融合时间序列胚胎图像，并使用表变换器（table transformer）提取生育力表指标信息。实验结果表明，该模型在IVF-ET妊娠预测任务上优于现有方法，并在眼病预测数据集上表现出良好的泛化能力。

链接: https://arxiv.org/abs/2501.04353
作者: Xueqiang Ouyang,Jia Wei,Wenjie Huo,Xiaocong Wang,Rui Li,Jianlong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Temporal embryo images and parental fertility table indicators are both valuable for pregnancy prediction in \textbfin vitro fertilization embryo transfer (IVF-ET). However, current machine learning models cannot make full use of the complementary information between the two modalities to improve pregnancy prediction performance. In this paper, we propose a Decoupling Fusion Network called DeFusion to effectively integrate the multi-modal information for IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion module that decouples the information from the different modalities into related and unrelated information, thereby achieving a more delicate fusion. And we fuse temporal embryo images with a spatial-temporal position encoding, and extract fertility table indicator information with a table transformer. To evaluate the effectiveness of our model, we use a new dataset including 4046 cases collected from Southern Medical University. The experiments show that our model outperforms state-of-the-art methods. Meanwhile, the performance on the eye disease prediction dataset reflects the model’s good generalization. Our code and dataset are available at this https URL.
zh

[CV-32] Online Gaussian Test-Time Adaptation of Vision-Language Models

【速读】：该论文试图解决在线测试时适应（Online Test-Time Adaptation, OTTA）在视觉-语言模型（Vision-Language Models, VLMs）中的应用问题，特别是现有方法依赖于数据集特定的超参数，限制了其在未见任务上的适应性。为了解决这一问题，论文提出了一种名为在线高斯适应（Online Gaussian Adaptation, OGA）的新方法。OGA通过使用高斯分布对视觉特征的可能性进行建模，并将零样本先验（zero-shot priors）整合到一个可解释的最大后验估计（Maximum A Posteriori, MAP）框架中，且在所有数据集上使用固定的超参数。实验表明，OGA在大多数数据集和运行中优于现有最先进的方法。此外，论文还指出，将OTTA与流行的少样本技术结合使用（这是一个实际但被先前研究忽视的场景）具有显著优势。最后，论文强调了现有OTTA评估协议的不足，建议通过增加运行次数和引入新的定量指标（如提出的期望尾部准确率，Expected Tail Accuracy, ETA）来改进评估实践。

链接: https://arxiv.org/abs/2501.04352
作者: Clément Fuchs,Maxime Zanella,Christophe De Vleeschouwer
机构: UCLouvain, Belgium(比利时鲁汶大学); UMons, Belgium(比利时蒙斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at this https URL .
zh

[CV-33] Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLM s

【速读】：该论文试图解决长视频理解（Long-form video understanding）中由于上下文窗口有限而难以分析时间分散但空间集中的关键时刻的问题。解决方案的关键是引入了VideoMindPalace框架，该框架受“记忆宫殿”（Mind Palace）启发，通过以下方式组织关键信息：(i) 手-物体跟踪与交互（hand-object tracking and interaction），(ii) 表示特定区域重复活动的聚类活动区域（clustered activity zones），以及(iii) 环境布局映射（environment layout mapping）。这些方法使得大语言模型（LLMs）能够通过自然语言解析提供基于时空和三维上下文的深入见解。此外，论文还提出了Video MindPalace Benchmark（VMB）来评估类人推理能力，包括空间定位、时间推理和布局感知的顺序理解。通过在VMB和现有视频问答数据集上的评估，VideoMindPalace在时空一致性和类人推理方面表现出显著提升，推动了视觉语言模型（VLMs）在长视频分析中的能力。

链接: https://arxiv.org/abs/2501.04336
作者: Zeyi Huang,Yuyang Ji,Xiaofang Wang,Nikhil Mehta,Tong Xiao,Donghyun Lee,Sigmund Vanvalkenburgh,Shengxin Zha,Bolin Lai,Licheng Yu,Ning Zhang,Yong Jae Lee,Miao Liu
机构: University of Wisconsin-Madison; Meta; UIUC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the “Mind Palace”, which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
zh

[CV-34] An Efficient Adaptive Compression Method for Human Perception and Machine Vision Tasks

【速读】：该论文试图解决现有神经图像压缩（NIC）和神经视频压缩（NVC）方法在机器视觉任务中表现不佳的问题。现有的压缩方法主要针对人类视觉感知进行优化，但随着人工智能的快速发展，许多图像和视频将用于各种机器视觉任务，因此现有方法无法在机器视觉任务中取得竞争优势。

解决方案的关键在于提出了一个高效的适应性压缩（EAC）方法，该方法包含两个核心模块：1）适应性压缩机制，能够从潜在特征中自适应地选择若干子集，以平衡多个机器视觉任务（如分割和检测）和人类视觉的优化；2）任务特定的适配器，采用参数高效的delta-tuning策略，激发下游分析网络以应对特定的机器视觉任务。通过这两个模块，EAC方法能够在优化比特率成本的同时提升机器视觉任务的性能。该方法能够与现有的NIC（如Ballé2018和Cheng2020）和NVC（如DVC和FVC）方法无缝集成，并在多个基准数据集（如VOC2007、ILSVRC2012、VOC2012、COCO、UCF101和DAVIS）上验证了其有效性。

链接: https://arxiv.org/abs/2501.04329
作者: Lei Liu,Zhenghao Chen,Zhihao Hu,Dong Xu
机构: School of Computer Science and Engineering, Beihang University (北京航空航天大学); School of Information and Physical Sciences, University of Newcastle (纽卡斯尔大学); Department of Computer Science, University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While most existing neural image compression (NIC) and neural video compression (NVC) methodologies have achieved remarkable success, their optimization is primarily focused on human visual perception. However, with the rapid development of artificial intelligence, many images and videos will be used for various machine vision tasks. Consequently, such existing compression methodologies cannot achieve competitive performance in machine vision. In this work, we introduce an efficient adaptive compression (EAC) method tailored for both human perception and multiple machine vision tasks. Our method involves two key modules: 1), an adaptive compression mechanism, that adaptively selects several subsets from latent features to balance the optimizations for multiple machine vision tasks (e.g., segmentation, and detection) and human vision. 2), a task-specific adapter, that uses the parameter-efficient delta-tuning strategy to stimulate the comprehensive downstream analytical networks for specific machine vision tasks. By using the above two modules, we can optimize the bit-rate costs and improve machine vision performance. In general, our proposed EAC can seamlessly integrate with existing NIC (i.e., Ballé2018, and Cheng2020) and NVC (i.e., DVC, and FVC) methods. Extensive evaluation on various benchmark datasets (i.e., VOC2007, ILSVRC2012, VOC2012, COCO, UCF101, and DAVIS) shows that our method enhances performance for multiple machine vision tasks while maintaining the quality of human vision.
zh

[CV-35] Edit as You See: Image-guided Video Editing via Masked Motion Modeling

【速读】：该论文试图解决图像引导的视频编辑（image-guided video editing）问题，即在视频编辑过程中，用户仅需在初始帧中指示目标对象并提供参考的RGB图像，而无需依赖文本提示（text prompts）。现有研究主要集中在文本引导的视频编辑上，而图像引导的视频编辑研究相对较少。为了解决这一问题，论文提出了一种新颖的图像引导视频编辑扩散模型（Image-guided Video Editing Diffusion model, IVEDiff）。该模型基于图像编辑模型，并配备了可学习的运动模块（motion modules），以保持编辑视频的时间一致性。关键解决方案包括：1）引入基于自监督学习（self-supervised learning）的掩码运动建模微调策略（masked motion modeling fine-tuning strategy），以增强运动模块捕捉帧间运动动态的能力，同时保留基础图像编辑模型对帧内语义关联的建模能力；2）提出光流引导的运动参考网络（optical-flow-guided motion reference network），以确保编辑视频帧间信息的准确传播，减少无效信息的误导影响。通过这些创新，IVEDiff能够生成时间上平滑且高质量的编辑视频，并有效处理各种编辑对象。

链接: https://arxiv.org/abs/2501.04325
作者: Zhi-Lin Huang,Yixuan Liu,Chujun Qin,Zhongdao Wang,Dong Zhou,Dong Li,Emad Barsoum
机构: AMD; Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have significantly facilitated text-guided video editing. However, there is a relative scarcity of research on image-guided video editing, a method that empowers users to edit videos by merely indicating a target object in the initial frame and providing an RGB image as reference, without relying on the text prompts. In this paper, we propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff for the image-guided video editing. IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video. Inspired by self-supervised learning concepts, we introduce a masked motion modeling fine-tuning strategy that empowers the motion module’s capabilities for capturing inter-frame motion dynamics, while preserving the capabilities for intra-frame semantic correlations modeling of the base image editing model. Moreover, an optical-flow-guided motion reference network is proposed to ensure the accurate propagation of information between edited video frames, alleviating the misleading effects of invalid information. We also construct a benchmark to facilitate further research. The comprehensive experiments demonstrate that our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
zh

[CV-36] Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

【速读】：该论文试图解决在边缘设备上运行多模态视觉语言模型（Multimodal Vision Language Models, VLMs）时面临的挑战，特别是如何在保持语言能力的同时增强多模态能力，而无需进行大量训练或牺牲模型性能。为解决这一问题，论文提出了一个创新的框架——Efficient Vision Language Models with Elastic Visual Experts (Eve)。其关键解决方案在于通过在训练的多个阶段战略性地引入可适应的视觉专家（Elastic Visual Experts），从而在保持语言能力的同时增强多模态能力。这种平衡的方法使得Eve在仅1.8B参数的情况下，在多模态和语言任务中均表现出显著提升，尤其在3B参数以下的配置中，Eve在语言基准测试中表现优异，并在VLM基准测试中达到了68.87%的最新水平，其多模态准确率甚至超过了更大的7B LLaVA-1.5模型。

链接: https://arxiv.org/abs/2501.04322
作者: Miao Rang,Zhenni Bi,Chuanjian Liu,Yehui Tang,Kai Han,Yunhe Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.
zh

[CV-37] DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models KR

【速读】：该论文旨在解决文本到图像扩散模型（text-to-image diffusion models）在低比特量化（quantization）过程中面临的图像质量和文本-图像对齐（text-image alignment）下降的问题。现有的量化方法在将权重和激活值压缩为低比特格式时，往往难以同时保持高质量的图像生成和准确的文本-图像对齐，尤其是在8比特以下的量化中。论文通过从分布角度分析量化挑战，发现激活值中的异常值（activation outliers）对图像质量有重要影响，并且交叉注意力分数（cross-attention scores）的特定模式显著影响文本-图像对齐。为解决这些问题，论文提出了分布感知分组量化（Distribution-aware Group Quantization, DGQ）方法，该方法通过自适应处理像素级和通道级的异常值来保持图像质量，并应用特定提示的对数量化尺度（prompt-specific logarithmic quantization scales）来维持文本-图像对齐。DGQ在MS-COCO和PartiPrompts等数据集上表现出色，首次成功实现了无需额外微调权重量化参数的低比特量化。

链接: https://arxiv.org/abs/2501.04304
作者: Hyogon Ryu,NaHyeon Park,Hyunjung Shim
机构: Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit( 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters.
zh

[CV-38] H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在自动驾驶视频理解中的泛化能力不足问题。由于自动驾驶场景中的视频通常包含复杂的时空运动，现有的MLLMs在处理这些动态场景时表现受限。为解决这一问题，论文提出了一种新颖的分层Mamba适应框架（Hierarchical Mamba Adaptation, H-MBA）。该框架的关键在于其包含两个模块：上下文Mamba（Context Mamba, C-Mamba）和查询Mamba（Query Mamba, Q-Mamba）。C-Mamba通过多种结构状态空间模型有效捕捉不同时间分辨率下的多粒度视频上下文，而Q-Mamba则将当前帧灵活转换为可学习的查询，并选择性地将多粒度视频上下文整合到查询中。通过这种自适应整合多尺度时间分辨率的视频上下文，H-MBA显著提升了自动驾驶视频理解任务的性能，例如在风险物体检测任务中，其mIoU（平均交并比）比现有最优方法提升了5.5%。

链接: https://arxiv.org/abs/2501.04302
作者: Siran Chen,Yuxiao Luo,Yue Ma,Yu Qiao,Yali Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
zh

[CV-39] ADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning

【速读】：该论文试图解决在多任务学习（MTL）场景中，传统全微调（full fine-tuning）方法因模型规模增大而导致的计算资源消耗过高的问题。特别是在多任务学习中，训练复杂度随任务数量增加而显著上升，使得全微调在实际应用中变得不切实际。为此，论文提出了一种名为TADFormer（Task-Adaptive Dynamic transFormer）的参数高效微调（PEFT）框架，旨在通过动态考虑任务特定的输入上下文，实现细粒度的任务感知特征适应。TADFormer的关键创新在于引入了参数高效的任务适应提示（parameter-efficient prompting）和动态任务过滤器（Dynamic Task Filter, DTF），以捕捉基于输入上下文的任务信息。实验结果表明，TADFormer在密集场景理解任务中实现了更高的准确性，同时将可训练参数数量减少了最多8.4倍，且在参数效率和准确性上优于现有的PEFT方法。

链接: https://arxiv.org/abs/2501.04293
作者: Seungmin Baek,Soyul Lee,Hayeon Jo,Hyesong Choi,Dongbo Min
机构: Ewha Womans University (梨花女子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transfer learning paradigm has driven substantial advancements in various vision tasks. However, as state-of-the-art models continue to grow, classical full fine-tuning often becomes computationally impractical, particularly in multi-task learning (MTL) setup where training complexity increases proportional to the number of tasks. Consequently, recent studies have explored Parameter-Efficient Fine-Tuning (PEFT) for MTL architectures. Despite some progress, these approaches still exhibit limitations in capturing fine-grained, task-specific features that are crucial to MTL. In this paper, we introduce Task-Adaptive Dynamic transFormer, termed TADFormer, a novel PEFT framework that performs task-aware feature adaptation in the fine-grained manner by dynamically considering task-specific input contexts. TADFormer proposes the parameter-efficient prompting for task adaptation and the Dynamic Task Filter (DTF) to capture task information conditioned on input contexts. Experiments on the PASCAL-Context benchmark demonstrate that the proposed method achieves higher accuracy in dense scene understanding tasks, while reducing the number of trainable parameters by up to 8.4 times when compared to full fine-tuning of MTL models. TADFormer also demonstrates superior parameter efficiency and accuracy compared to recent PEFT methods.
zh

[CV-40] ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning

【速读】：该论文试图解决压缩感知 MRI（Compressed Sensing MRI）在加速 MRI 采集过程中，如何通过减少 k 空间测量并算法重建缺失数据的问题。传统方法通常依赖于强先验或学习到的统计模型，但往往忽略了临床可用的元数据（metadata），如患者人口统计信息、成像参数和切片特定信息。这些元数据包含有关解剖结构和采集协议的有意义线索，可能进一步约束重建问题。

解决方案的关键在于提出了一种名为 ContextMRI 的文本条件扩散模型（text-conditioned diffusion model），该模型将细粒度的元数据整合到重建过程中。具体而言，模型直接在最小处理的复值 MRI 图像上训练像素空间扩散模型（pixel-space diffusion model），并在推理过程中将元数据转换为结构化文本提示，通过 CLIP 文本嵌入（CLIP text embeddings）输入模型。通过将先验条件与元数据结合，该方法实现了更精确的重建，并在多个数据集、加速因子和欠采样模式中表现出一致的性能提升。实验表明，提高元数据的保真度（如切片位置、对比度、患者年龄、性别和病理信息）能够系统地提升重建性能。该研究揭示了利用临床上下文解决逆问题的潜力，并为元数据驱动的 MRI 重建开辟了新的方向。

链接: https://arxiv.org/abs/2501.04284
作者: Hyungjin Chung,Dohun Lee,Zihui Wu,Byung-Hoon Kim,Katherine L. Bouman,Jong Chul Ye
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知; 5. 未知; 6. 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Compressed sensing MRI seeks to accelerate MRI acquisition processes by sampling fewer k-space measurements and then reconstructing the missing data algorithmically. The success of these approaches often relies on strong priors or learned statistical models. While recent diffusion model-based priors have shown great potential, previous methods typically ignore clinically available metadata (e.g. patient demographics, imaging parameters, slice-specific information). In practice, metadata contains meaningful cues about the anatomy and acquisition protocol, suggesting it could further constrain the reconstruction problem. In this work, we propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process. We train a pixel-space diffusion model directly on minimally processed, complex-valued MRI images. During inference, metadata is converted into a structured text prompt and fed to the model via CLIP text embeddings. By conditioning the prior on metadata, we unlock more accurate reconstructions and show consistent gains across multiple datasets, acceleration factors, and undersampling patterns. Our experiments demonstrate that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance. This work highlights the untapped potential of leveraging clinical context for inverse problems and opens a new direction for metadata-driven MRI reconstruction.
zh

[CV-41] Enhancing Scene Classification in Cloudy Image Scenarios: A Collaborative Transfer Method with Information Regulation Mechanism using Optical Cloud-Covered and SAR Remote Sensing Images

【速读】：该论文试图解决在遥感场景分类中，由于云层污染导致的光学信息丢失和特征分布变化问题，这些问题影响了从源域（无云光学数据）到目标域（含云光学数据和合成孔径雷达数据）的模型迁移的可靠性和稳定性。解决方案的关键在于提出了一种多模态数据协同的迁移方法，该方法包括两个核心部分：(1) 基于知识蒸馏（knowledge distillation）的协同迁移策略，能够有效实现跨异构数据的先验知识迁移；(2) 信息调节机制（Information Regulation Mechanism, IRM），用于解决迁移过程中的模态不平衡问题。IRM通过辅助模型衡量每种模态的贡献差异，并在目标模型学习过程中自动平衡样本级别的模态信息利用。实验结果表明，该方法在云覆盖场景下的性能优于其他解决方案，并验证了IRM的重要性和局限性。

链接: https://arxiv.org/abs/2501.04283
作者: Yuze Wang,Rong Xiao,Haifeng Li,Mariana Belgiu,Chao Tao
机构: School of Geosciences and Info-Physics, Central South University, Changsha 410083, China(中南大学地球科学与信息物理学院); Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, Enschede,7500 AE, The Netherlands(特温特大学地理信息科学与地球观测学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at this https URL
zh

[CV-42] Open set label noise learning with robust sample selection and margin-guided module

【速读】：该论文试图解决在深度神经网络（DNNs）训练过程中遇到的开放集标签噪声（open set label noise）问题。开放集标签噪声指的是训练数据中存在一些样本的真实类别标签位于已知标签空间之外，而传统的标签噪声处理方法主要针对封闭集标签噪声（closed set label noise），即噪声样本的真实类别标签仍在已知标签空间内。为了解决这一问题，论文提出了一种基于鲁棒样本选择和边界引导模块（Robust Sample Selection and Margin-Guided Module, RSS-MGM）的方法。该方法的关键在于：首先，通过结合小损失选择或高置信度样本选择，鲁棒样本选择模块能够获取更多的干净样本；其次，设计了边界函数来区分开放集和封闭集标签噪声；最后，针对不同类型的样本选择不同的处理方法，以充分利用数据的先验信息并优化整个模型。实验结果表明，该方法在处理开放集标签噪声时优于现有的多种标签噪声学习方法。

链接: https://arxiv.org/abs/2501.04269
作者: Yuandi Zhao,Qianxi Xia,Yang Sun,Zhijie Wen,Liyan Ma,Shihui Ying
机构: Civil Aviation University of China (中国民航大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the remarkable success of deep neural networks (DNNs) in computer vision is largely due to large-scale, high-quality labeled datasets. Training directly on real-world datasets with label noise may result in overfitting. The traditional method is limited to deal with closed set label noise, where noisy training data has true class labels within the known label space. However, there are some real-world datasets containing open set label noise, which means that some samples belong to an unknown class outside the known label space. To address the open set label noise problem, we introduce a method based on Robust Sample Selection and Margin-Guided Module (RSS-MGM). Firstly, unlike the prior clean sample selection approach, which only select a limited number of clean samples, a robust sample selection module combines small loss selection or high-confidence sample selection to obtain more clean samples. Secondly, to efficiently distinguish open set label noise and closed set ones, margin functions are designed to filter open-set data and closed set data. Thirdly, different processing methods are selected for different types of samples in order to fully utilize the data’s prior information and optimize the whole model. Furthermore, extensive experimental results with noisy labeled data from benchmark datasets and real-world datasets, such as CIFAR-100N-C, CIFAR80N-O, WebFG-469, and Food101N, indicate that our approach outperforms many state-of-the-art label noise learning methods. Especially, it can more accurately divide open set label noise samples and closed set ones.
zh

[CV-43] Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

【速读】：该论文试图解决机器人操作中的零样本泛化（zero-shot generalization）问题，即在不同的机器人、任务和环境中的泛化能力仍然是一个重大挑战。解决方案的关键在于提出了Robotic Programmer (RoboPro)，这是一个机器人基础模型（robotic foundation model），能够感知视觉信息并遵循自由形式的指令，通过生成策略代码（policy code）以零样本方式执行机器人操作。为了应对收集机器人任务运行时代码数据的低效和高成本问题，论文提出了Video2Code方法，利用现成的视觉-语言模型（vision-language model）和代码领域的大语言模型（code-domain large language model），从大量野外视频中合成可执行代码。实验结果表明，RoboPro在模拟器和真实环境中的机器人操作任务上均达到了最先进的零样本性能，特别是在RLBench上的零样本成功率比GPT-4o高出11.6%，甚至可与强监督训练基线相媲美。此外，RoboPro对API格式和技能集的变化表现出较强的鲁棒性。

链接: https://arxiv.org/abs/2501.04268
作者: Senwei Xie,Hongyu Wang,Zhanqi Xiao,Ruiping Wang,Xilin Chen
机构: Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS)(中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.
zh

[CV-44] Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images ICASSP2025

【速读】：该论文试图解决在胸部CT图像（chest CT images）中进行连续自监督学习（continual self-supervised learning, CSSL）时，如何有效捕捉先前学习知识与新信息之间的关系，并减少预训练过程中数据干扰的问题。解决方案的关键在于引入增强的深度弹性重放（DER, Deep Elastic Replay）机制，通过保持重放缓冲区（replay buffer）中数据的多样性和代表性，降低数据干扰风险，从而使模型能够学习到更丰富且鲁棒的特征表示。此外，论文还结合了混合策略（mixup strategy）和特征蒸馏（feature distillation），进一步提升了模型学习有意义表示的能力。通过在不同成像条件下获取的胸部CT图像上进行验证，该方法展示了优于现有技术的性能。

链接: https://arxiv.org/abs/2501.04217
作者: Ren Tasai,Guang Li,Ren Togo,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Kenji Hirata,Takahiro Ogawa,Kohsuke Kudo,Miki Haseyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:We propose a novel continual self-supervised learning method (CSSL) considering medical domain knowledge in chest CT images. Our approach addresses the challenge of sequential learning by effectively capturing the relationship between previously learned knowledge and new information at different stages. By incorporating an enhanced DER into CSSL and maintaining both diversity and representativeness within the rehearsal buffer of DER, the risk of data interference during pretraining is reduced, enabling the model to learn more richer and robust feature representations. In addition, we incorporate a mixup strategy and feature distillation to further enhance the model’s ability to learn meaningful representations. We validate our method using chest CT images obtained under two different imaging conditions, demonstrating superior performance compared to state-of-the-art methods.
zh

[CV-45] UPAQ: A Framework for Real-Time and Energy-Efficient 3D Object Detection in Autonomous Vehicles

【速读】：该论文旨在解决自动驾驶车辆（AVs）中3D物体检测器在资源受限的嵌入式平台上运行时的效率问题。传统的3D物体检测器虽然比2D检测器提供更全面的预测，但其内存占用和计算资源消耗较大。为此，论文提出了一种名为UPAQ的新框架，通过半结构化模式剪枝（semi-structured pattern pruning）和量化（quantization）技术，显著提升了基于LiDAR点云和相机的3D物体检测器的效率。实验结果表明，UPAQ在Jetson Orin Nano嵌入式平台上，相比现有的模型压缩框架，在Pointpillar和SMOKE模型上分别实现了最高5.62倍和5.13倍的模型压缩率，1.97倍和1.86倍的推理速度提升，以及2.07倍和1.87倍的能耗降低。

链接: https://arxiv.org/abs/2501.04213
作者: Abhishek Balasubramaniam,Febin P Sunny,Sudeep Pasricha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To enhance perception in autonomous vehicles (AVs), recent efforts are concentrating on 3D object detectors, which deliver more comprehensive predictions than traditional 2D object detectors, at the cost of increased memory footprint and computational resource usage. We present a novel framework called UPAQ, which leverages semi-structured pattern pruning and quantization to improve the efficiency of LiDAR point-cloud and camera-based 3D object detectors on resource-constrained embedded AV platforms. Experimental results on the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to 5.62x and 5.13x model compression rates, up to 1.97x and 1.86x boost in inference speed, and up to 2.07x and 1.87x reduction in energy consumption compared to state-of-the-art model compression frameworks, on the Pointpillar and SMOKE models respectively.
zh

[CV-46] Recognition-Oriented Low-Light Image Enhancement based on Global and Pixelwise Optimization

【速读】：该论文旨在解决在低光照条件下图像识别模型性能不佳的问题。尽管深度学习领域取得了显著进展，但在低光照环境下进行图像识别仍然具有挑战性。现有的低光照图像增强方法主要关注提升人眼视觉的可见性，而非专门针对识别模型的性能优化。为此，论文提出了一种新颖的低光照图像增强方法，该方法包含两个关键模块：全局增强模块（Global Enhance Module）和像素级调整模块（Pixelwise Adjustment Module）。全局增强模块用于调整输入图像的整体亮度和色彩平衡，而像素级调整模块则在像素级别上细化图像特征。这些模块经过训练，能够有效增强输入图像，从而提升下游识别模型的性能。值得注意的是，该方法可以作为前端滤波器应用，无需重新训练下游识别模型即可提升低光照条件下的识别性能。实验结果表明，该方法显著提高了预训练识别模型在低光照条件下的性能，验证了其有效性。

链接: https://arxiv.org/abs/2501.04210
作者: Seitaro Ono,Yuka Ogino,Takahiro Toizumi,Atsushi Ito,Masato Tsukada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: accepted to VISAPP2025

点击查看摘要

Abstract:In this paper, we propose a novel low-light image enhancement method aimed at improving the performance of recognition models. Despite recent advances in deep learning, the recognition of images under low-light conditions remains a challenge. Although existing low-light image enhancement methods have been developed to improve image visibility for human vision, they do not specifically focus on enhancing recognition model performance. Our proposed low-light image enhancement method consists of two key modules: the Global Enhance Module, which adjusts the overall brightness and color balance of the input image, and the Pixelwise Adjustment Module, which refines image features at the pixel level. These modules are trained to enhance input images to improve downstream recognition model performance effectively. Notably, the proposed method can be applied as a frontend filter to improve low-light recognition performance without requiring retraining of downstream recognition models. Experimental results demonstrate that our method improves the performance of pretrained recognition models under low-light conditions and its effectiveness.
zh

[CV-47] LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition ICASSP2025

【速读】：该论文试图解决视觉语音识别（VSR，Visual Speech Recognition）模型在现实场景中因数据集局限性导致的鲁棒性问题。现有数据集主要包含稳定的视频记录，唇部运动变化有限，导致模型在实际应用中面对变化时表现不佳。为解决这一问题，论文提出了一种名为LipGen的新框架，其关键解决方案包括两个方面：首先，通过利用语音驱动的合成视觉数据来增强模型的鲁棒性，从而缓解现有数据集的局限性；其次，引入了一个辅助任务，结合了视素分类（viseme classification）和注意力机制，以更高效地整合时间信息，使模型能够聚焦于语音的相关片段，从而提升其判别能力。该方法在“Lip Reading in the Wild”（LRW）数据集上表现出优于当前最先进技术的性能，尤其在具有挑战性的条件下优势更为显著。

链接: https://arxiv.org/abs/2501.04204
作者: Bowen Hao,Dongliang Zhou,Xiaojie Li,Xingyu Zhang,Liang Xie,Jianlong Wu,Erwei Yin
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); National Institute of Defense Technology Innovation, Academy of Military Sciences, Beijing, China (国防科技创新研究院, 军事科学院, 北京); Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China (天津人工智能创新中心, 天津)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This paper has been accepted for presentation at ICASSP 2025

点击查看摘要

Abstract:Visual speech recognition (VSR), commonly known as lip reading, has garnered significant attention due to its wide-ranging practical applications. The advent of deep learning techniques and advancements in hardware capabilities have significantly enhanced the performance of lip reading models. Despite these advancements, existing datasets predominantly feature stable video recordings with limited variability in lip movements. This limitation results in models that are highly sensitive to variations encountered in real-world scenarios. To address this issue, we propose a novel framework, LipGen, which aims to improve model robustness by leveraging speech-driven synthetic visual data, thereby mitigating the constraints of current datasets. Additionally, we introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms. This approach facilitates the efficient integration of temporal information, directing the model’s focus toward the relevant segments of speech, thereby enhancing discriminative capabilities. Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset and exhibits even more pronounced advantages under challenging conditions.
zh

[CV-48] Generative Dataset Distillation Based on Self-knowledge Distillation ICASSP2025

【速读】：该论文试图解决数据集蒸馏（Dataset Distillation）过程中预测logits对齐精度不足的问题。数据集蒸馏是一种通过将大规模数据集压缩为更小、更高效的版本来降低模型训练成本和复杂度的技术。论文提出了一种新颖的生成式数据集蒸馏方法，通过集成自知识蒸馏（self-knowledge distillation）来实现合成数据与原始数据之间更精确的分布匹配，从而捕捉数据的整体结构和关系。解决方案的关键在于引入了一个标准化步骤，即在执行分布匹配之前对logits进行标准化处理，以确保logits范围的一致性。实验结果表明，该方法在蒸馏性能上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.04202
作者: Longzhen Li,Guang Li,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Dataset distillation is an effective technique for reducing the cost and complexity of model training while maintaining performance by compressing large datasets into smaller, more efficient versions. In this paper, we present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data, thereby capturing the overall structure and relationships within the data. To further improve the accuracy of alignment, we introduce a standardization step on the logits before performing distribution matching, ensuring consistency in the range of logits. Through extensive experiments, we demonstrate that our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
zh

[CV-49] MedicalNarratives: Connecting Medical Vision and Language with Localized Narratives

【速读】：该论文旨在解决医学领域中缺乏足够规模的数据集来同时训练语义（semantic）和密集（dense）任务的问题。现有的数据集通常无法同时支持这两种任务的训练，导致模型在医学图像和文本理解方面的性能受限。为此，作者提出了MedicalNarratives数据集，该数据集从医学教学视频中收集，类似于“Think-Aloud”研究中的数据，并受到Localized Narratives的启发，通过同步记录教师的语音和鼠标光标移动来生成图像-文本对。MedicalNarratives包含470万对图像-文本数据，其中100万样本带有密集注释（如轨迹和边界框）。通过该数据集，作者训练了基于CLIP架构的GenMedClip模型，并在涵盖12个医学领域的新构建的医学影像基准测试中展示了其优于现有最先进模型的性能。

链接: https://arxiv.org/abs/2501.04184
作者: Wisdom O. Ikezogwo,Kevin Zhang,Mehmet Saygin Seyfioglu,Fatemeh Ghezloo,Linda Shapiro,Ranjay Krishna
机构: University of Washington(华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose MedicalNarratives, a dataset curated from medical pedagogical videos similar in nature to data collected in Think-Aloud studies and inspired by Localized Narratives, which collects grounded image-text data by curating instructors’ speech and mouse cursor movements synchronized in time. MedicalNarratives enables pretraining of both semantic and dense objectives, alleviating the need to train medical semantic and dense tasks disparately due to the lack of reasonably sized datasets. Our dataset contains 4.7M image-text pairs from videos and articles, with 1M samples containing dense annotations in the form of traces and bounding boxes. To evaluate the utility of MedicalNarratives, we train GenMedClip based on the CLIP architecture using our dataset spanning 12 medical domains and demonstrate that it outperforms previous state-of-the-art models on a newly constructed medical imaging benchmark that comprehensively evaluates performance across all modalities. Data, demo, code and models available at this https URL
zh

[CV-50] Benchmarking Large and Small MLLM s

【速读】：该论文试图解决的问题是大型多模态语言模型（Large Multimodal Language Models, MLLMs）在部署过程中面临的挑战，包括推理速度慢、计算成本高以及难以在设备端应用等问题。同时，论文还探讨了小型多模态语言模型（Small MLLMs）在特定场景下的性能表现及其与大型模型的能力边界。解决方案的关键在于通过系统性和全面的评估，对小型和大型MLLMs进行基准测试，涵盖通用能力（如物体识别、时序推理和多模态理解）以及实际应用领域（如工业和汽车领域）。评估结果表明，小型MLLMs在特定场景下可以达到与大型模型相当的性能，但在需要深度推理或细致理解的复杂任务中表现明显不足。此外，论文还识别了小型和大型MLLMs的共同失败案例，揭示了当前最先进模型在某些领域中的局限性。这些发现旨在为研究社区提供指导，推动MLLMs的质量边界，提升其在不同应用中的可用性和有效性。

链接: https://arxiv.org/abs/2501.04150
作者: Xuelu Feng,Yunsheng Li,Dongdong Chen,Mei Gao,Mengchen Liu,Junsong Yuan,Chunming Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.
zh

[CV-51] Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

【速读】：该论文旨在解决当前细粒度三维生成（fine-grained 3D generation）方法在细节丰富性和创造性方面的局限性。现有方法要么缺乏精细的细节，要么只能模仿现有物体，无法生成全新的、具有物种特异性细节的三维物体。论文通过将二维细粒度理解提升到三维空间，利用多视角扩散（multi-view diffusion）技术，并将部件潜在变量（part latents）建模为连续分布，从而实现了通过插值和采样生成全新的、合理的部件。此外，自监督特征一致性损失（self-supervised feature consistency loss）确保了这些未见部件的稳定生成。该方法的创新之处在于首次实现了能够生成超越现有样本的、具有物种特异性细节的全新三维物体。尽管论文以鸟类为例展示了该方法，但其框架可广泛应用于其他领域。

链接: https://arxiv.org/abs/2501.04144
作者: Kam Woh Ng,Jing Yang,Jia Wei Sii,Jiankang Deng,Chee Seng Chan,Yi-Zhe Song,Tao Xiang,Xiatian Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 20 pages

点击查看摘要

Abstract:In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects – we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at this https URL.
zh

[CV-52] Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

【速读】：该论文旨在解决从第一人称视角（egocentric）视频中准确识别关键步骤（keystep recognition）的挑战。由于第一人称视频具有动态背景、频繁运动和遮挡等特点，传统的识别方法难以有效处理这些复杂情况。论文提出了一种灵活的图学习框架（graph-learning framework），通过构建图结构来捕捉视频片段之间的长期依赖关系，并在训练过程中利用第一人称视频与第三人称视角（exocentric）视频的对齐信息，以提升推理效果。具体而言，每个第一人称视频片段被视为图中的一个节点，若可用，第三人称视频片段也被作为额外节点加入图中。通过定义节点之间的连接策略，将关键步骤识别任务转化为图上的节点分类问题。实验结果表明，该框架在Ego-Exo4D数据集上的准确率显著优于现有方法，且所构建的图结构稀疏且计算高效。此外，论文还探讨了多模态特征（如叙述、深度和物体类别标签）在异构图中的贡献及其对关键步骤识别性能的影响。

链接: https://arxiv.org/abs/2501.04121
作者: Julia Lee Romero,Kyle Min,Subarna Tripathi,Morteza Karimzadeh
机构: University of Colorado Boulder(科罗拉多大学博尔德分校); Intel Labs(英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Egocentric videos capture scenes from a wearer’s viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.
zh

[CV-53] NeRFs are Mirror Detectors: Using Structural Similarity for Multi-View Mirror Scene Reconstruction with 3D Surface Primitives

【速读】：该论文试图解决神经辐射场（Neural Radiance Fields, NeRF）在处理包含镜面反射的场景时面临的挑战。镜面反射会导致场景表示中的严重不一致性，而现有的方法要么局限于重建单个反射物体，要么依赖于用户提供的额外注释来指导重建，限制了实际应用的广泛性。论文提出的解决方案NeRF-MD的关键在于将NeRF视为镜面检测器，通过两个阶段的训练过程来重建包含镜面的场景。首先，使用深度重投影损失训练标准NeRF以初步估计场景几何结构，并通过检测光度不一致性区域来识别镜面。随后，在第二阶段联合优化辐射场和镜面几何结构，以提升重建质量。该方法无需先验注释即可实现镜面的检测和场景的一致重建，展示了其相对于基线方法和镜面感知方法的优势。

链接: https://arxiv.org/abs/2501.04074
作者: Leif Van Holland,Michael Weinmann,Jan U. Müller,Patrick Stotko,Reinhard Klein
机构: University of Bonn(波恩大学); Delft University of Technology(代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While neural radiance fields (NeRF) led to a breakthrough in photorealistic novel view synthesis, handling mirroring surfaces still denotes a particular challenge as they introduce severe inconsistencies in the scene representation. Previous attempts either focus on reconstructing single reflective objects or rely on strong supervision guidance in terms of additional user-provided annotations of visible image regions of the mirrors, thereby limiting the practical usability. In contrast, in this paper, we present NeRF-MD, a method which shows that NeRFs can be considered as mirror detectors and which is capable of reconstructing neural radiance fields of scenes containing mirroring surfaces without the need for prior annotations. To this end, we first compute an initial estimate of the scene geometry by training a standard NeRF using a depth reprojection loss. Our key insight lies in the fact that parts of the scene corresponding to a mirroring surface will still exhibit a significant photometric inconsistency, whereas the remaining parts are already reconstructed in a plausible manner. This allows us to detect mirror surfaces by fitting geometric primitives to such inconsistent regions in this initial stage of the training. Using this information, we then jointly optimize the radiance field and mirror geometry in a second training stage to refine their quality. We demonstrate the capability of our method to allow the faithful detection of mirrors in the scene as well as the reconstruction of a single consistent scene representation, and demonstrate its potential in comparison to baseline and mirror-aware approaches.
zh

[CV-54] RadGPT : Constructing 3D Image-Text Tumor Datasets

【速读】：该论文旨在解决放射科医生在处理大量CT扫描时生成肿瘤相关报告的耗时和复杂性问题。为此，研究团队提出了RadGPT，一种基于解剖学感知的视觉-语言AI代理（Anatomy-Aware Vision-Language AI Agent），用于从CT扫描中生成详细的肿瘤报告。RadGPT的关键解决方案包括：首先对肿瘤（包括良性囊肿和恶性肿瘤）及其周围解剖结构进行分割，然后将这些信息转化为结构化报告和叙述性报告。这些报告提供了肿瘤的大小、形状、位置、密度、体积以及与周围血管和器官的相互作用等信息。通过广泛的评估，RadGPT在未见过的医院数据上表现出色，尤其是在小肿瘤（2厘米）检测方面具有高灵敏度/特异性（如肝肿瘤为80/73%，肾肿瘤为92/78%，胰腺肿瘤为77/77%），且对大肿瘤的灵敏度达到89%至97%。此外，RadGPT生成了17个公共数据集的报告，并通过放射科医生的审查和优化，确保了报告的准确性，创建了首个公开可用的图像-文本3D医学数据集，包含超过180万文本标记和270万张图像。

链接: https://arxiv.org/abs/2501.04678
作者: Pedro R. A. S. Bassi,Mehmet Can Yavuz,Kang Wang,Xiaoxi Chen,Wenxuan Li,Sergio Decherchi,Andrea Cavalli,Yang Yang,Alan Yuille,Zongwei Zhou
机构: Johns Hopkins University(约翰斯·霍普金斯大学); University of Bologna(博洛尼亚大学); Italian Institute of Technology(意大利技术研究院); University of California, San Francisco(加州大学旧金山分校); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT can produce accurate reports, with high sensitivity/specificity for small tumor (2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation. RadGPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports’ accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and three pancreatic sub-segments annotated per-voxel; (2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present individual analyses of multiple tumors–rare in human-made reports. Importantly, 948 of the reports are for early-stage tumors. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.04678 [eess.IV] (or arXiv:2501.04678v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.04678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-55] HyFusion: Enhanced Reception Field Transformer for Hyperspectral Image Fusion

【速读】：该论文试图解决高光谱图像（HSI）融合中的挑战，即从高分辨率多光谱图像（HR-MSI）和低分辨率高光谱图像（LR-HSI）中重建高分辨率高光谱图像（HR-HSI）。这一任务的关键在于克服现有方法在感受野（receptive field）受限和特征利用不足方面的局限性，从而提升重建质量。论文提出的解决方案HyFusion框架通过以下关键创新来应对这些问题：首先，将HR-MSI和LR-HSI输入拼接形成准融合草案，保留互补的空间和光谱细节；其次，引入增强感受野模块（Enhanced Reception Field Block, ERFB），结合移位窗口注意力机制和密集连接，扩展感受野并有效捕捉长程依赖关系，同时重用特征以减少信息损失；最后，通过双耦合网络（Dual-Coupled Network, DCN）动态提取LR-HSI和HR-MSI中的高频光谱和空间特征，确保跨域融合的高效性。这些创新显著提升了数据利用效率，使HyFusion在HR-MSI/LR-HSI融合任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.04665
作者: Chia-Ming Lee,Yu-Fan Lin,Yu-Hao Ho,Li-Wei Kang,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Taiwan Normal University (国立台湾师范大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IGARSS 2025

点击查看摘要

Abstract:Hyperspectral image (HSI) fusion addresses the challenge of reconstructing High-Resolution HSIs (HR-HSIs) from High-Resolution Multispectral images (HR-MSIs) and Low-Resolution HSIs (LR-HSIs), a critical task given the high costs and hardware limitations associated with acquiring high-quality HSIs. While existing methods leverage spatial and spectral relationships, they often suffer from limited receptive fields and insufficient feature utilization, leading to suboptimal performance. Furthermore, the scarcity of high-quality HSI data highlights the importance of efficient data utilization to maximize reconstruction quality. To address these issues, we propose HyFusion, a novel framework designed to enhance the receptive field and enable effective feature map reusing, thereby maximizing data utilization. First, HR-MSI and LR-HSI inputs are concatenated to form a quasi-fused draft, preserving complementary spatial and spectral details. Next, the Enhanced Reception Field Block (ERFB) is introduced, combining shifting-window attention and dense connections to expand the receptive field, effectively capturing long-range dependencies and reusing features to reduce information loss, thereby boosting data efficiency. Finally, the Dual-Coupled Network (DCN) dynamically extracts high-frequency spectral and spatial features from LR-HSI and HR-MSI, ensuring efficient cross-domain fusion. Extensive experiments demonstrate that HyFusion achieves state-of-the-art performance in HR-MSI/LR-HSI fusion, significantly improving reconstruction quality while maintaining a compact model size and computational efficiency. By integrating enhanced receptive fields and feature map reusing, HyFusion provides a practical and effective solution for HSI fusion in resource-constrained scenarios, setting a new benchmark in hyperspectral imaging. Our code will be publicly available.
zh

[CV-56] Comprehensive Examination of Unrolled Networks for Linear Inverse Problems

【速读】：该论文旨在解决展开网络（unrolled networks）在设计过程中面临的复杂性和计算成本高的问题。展开网络在计算机视觉和计算成像任务中表现出色，但在适应新应用时，设计者需要面对大量的设计决策，如选择优化算法、定义损失函数、确定卷积层数等。这些决策的评估通常需要耗时的模拟训练和调优过程，导致探索多种选项并确定最优配置的过程既耗时又计算密集。论文的主要目标是通过统一展开网络中的一些思想和方法论，减少用户需要做出的设计选择，并通过全面的消融研究讨论每个设计决策的影响，提出基于研究结果的实用建议。解决方案的关键在于通过系统化的方法论和实证研究，简化设计流程并提高网络设计的效率。

链接: https://arxiv.org/abs/2501.04608
作者: Eric Chen,Xi Chen,Arian Maleki,Shirin Jalali
机构: Columbia(哥伦比亚大学); Rutgers(罗格斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages, 10 figures. Project Page: this https URL

点击查看摘要

Abstract:Unrolled networks have become prevalent in various computer vision and imaging tasks. Although they have demonstrated remarkable efficacy in solving specific computer vision and computational imaging tasks, their adaptation to other applications presents considerable challenges. This is primarily due to the multitude of design decisions that practitioners working on new applications must navigate, each potentially affecting the network’s overall performance. These decisions include selecting the optimization algorithm, defining the loss function, and determining the number of convolutional layers, among others. Compounding the issue, evaluating each design choice requires time-consuming simulations to train, fine-tune the neural network, and optimize for its performance. As a result, the process of exploring multiple options and identifying the optimal configuration becomes time-consuming and computationally demanding. The main objectives of this paper are (1) to unify some ideas and methodologies used in unrolled networks to reduce the number of design choices a user has to make, and (2) to report a comprehensive ablation study to discuss the impact of each of the choices involved in designing unrolled networks and present practical recommendations based on our findings. We anticipate that this study will help scientists and engineers design unrolled networks for their applications and diagnose problems within their networks efficiently.
zh

[CV-57] SplineFormer: An Explainable Transformer-Based Approach for Autonomous Endovascular Navigation

【速读】：该论文试图解决在微创手术中，如何准确预测导丝（guidewire）在血管内导航时的动态形状变化问题。传统分割方法在实时形状预测方面存在不足，难以应对高度动态的环境。为此，作者提出了SplineFormer，一种基于Transformer的架构，专门用于以可解释的方式预测导丝的连续、平滑形状。该解决方案的关键在于利用Transformer的能力，有效捕捉导丝的复杂弯曲和扭转，并将其表示为样条（spline），以提高预测的准确性和平滑性。通过将SplineFormer集成到端到端的机器人导航系统中，实验结果表明，该模型能够自主执行血管内导航，并在真实机器人上成功插管头臂动脉（brachiocephalic artery）的成功率达到50%。

链接: https://arxiv.org/abs/2501.04515
作者: Tudor Jianu,Shayan Doust,Mengyun Li,Baoru Huang,Tuong Do,Hoan Nguyen,Karl Bates,Tung D. Ta,Sebastiano Fichera,Pierre Berthet-Rayne,Anh Nguyen
机构: Department of Computer Science, University of Liverpool, UK(利物浦大学计算机科学系); University of Information Technology, Vietnam(越南信息技术大学); Faculty of Health and Life Sciences, University of Liverpool, UK(利物浦大学健康与生命科学学院); University of Tokyo, Japan(东京大学); Department of Mechanical, Materials and Aerospace Engineering, University of Liverpool, UK(利物浦大学机械、材料与航空航天工程系); Honorary Fellow, University of Liverpool, UK(利物浦大学荣誉研究员); 3IA Cote d’Azur, Sophia Antipolis, France(法国索菲亚安提波利斯3IA科特迪瓦)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Endovascular navigation is a crucial aspect of minimally invasive procedures, where precise control of curvilinear instruments like guidewires is critical for successful interventions. A key challenge in this task is accurately predicting the evolving shape of the guidewire as it navigates through the vasculature, which presents complex deformations due to interactions with the vessel walls. Traditional segmentation methods often fail to provide accurate real-time shape predictions, limiting their effectiveness in highly dynamic environments. To address this, we propose SplineFormer, a new transformer-based architecture, designed specifically to predict the continuous, smooth shape of the guidewire in an explainable way. By leveraging the transformer’s ability, our network effectively captures the intricate bending and twisting of the guidewire, representing it as a spline for greater accuracy and smoothness. We integrate our SplineFormer into an end-to-end robot navigation system by leveraging the condensed information. The experimental results demonstrate that our SplineFormer is able to perform endovascular navigation autonomously and achieves a 50% success rate when cannulating the brachiocephalic artery on the real robot.
zh

[CV-58] he Role of Machine Learning in Congenital Heart Disease Diagnosis: Datasets Algorithms and Insights

【速读】：该论文试图解决的问题是先天性心脏病（Congenital Heart Disease, CHD）的早期检测和管理。尽管已经识别出许多影响其发病的风险因素，但在不同人群中对其发病机制和管理的全面理解仍然有限。论文通过系统综述和元分析的方法，回顾了2018年至2024年间发表的432篇文献，重点分析了74篇学术著作，探讨了机器学习在先天性心脏病识别中的应用。解决方案的关键在于利用机器学习算法（Machine Learning Algorithms）和患者数据，开发数据驱动的解决方案，以实现早期检测和识别。论文还详细调查了机器学习专家在先天性心脏病识别中使用的数据集，并识别了应用机器学习技术时面临的关键挑战和机遇。

链接: https://arxiv.org/abs/2501.04493
作者: Khalil Khan,Farhan Ullah,Ikram Syed,Irfan Ullah
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Congenital heart disease is among the most common fetal abnormalities and birth defects. Despite identifying numerous risk factors influencing its onset, a comprehensive understanding of its genesis and management across diverse populations remains limited. Recent advancements in machine learning have demonstrated the potential for leveraging patient data to enable early congenital heart disease detection. Over the past seven years, researchers have proposed various data-driven and algorithmic solutions to address this challenge. This paper presents a systematic review of congential heart disease recognition using machine learning, conducting a meta-analysis of 432 references from leading journals published between 2018 and 2024. A detailed investigation of 74 scholarly works highlights key factors, including databases, algorithms, applications, and solutions. Additionally, the survey outlines reported datasets used by machine learning experts for congenital heart disease recognition. Using a systematic literature review methodology, this study identifies critical challenges and opportunities in applying machine learning to congenital heart disease.
zh

[CV-59] Rapid Automated Mapping of Clouds on Titan With Instance Segmentation

【速读】：该论文试图解决的是在土星最大卫星 Titan（土卫六）上，利用深度学习模型自动化分析云层图像数据的问题。传统上，这些图像数据主要通过人工方式进行解析，耗时且效率低下。论文提出了一种基于迁移学习（transfer learning）训练的 Mask R-CNN（Mask Region-based Convolutional Neural Network）模型，用于对 Cassini 航天器拍摄的 Titan 云层图像进行实例分割（instance segmentation）。这一方法能够自动化地提取云层的定量特征，如面积和质心，显著提高了分析效率。论文还展示了该方法的准确性，与地球和其他行星上的云层识别研究相当，并指出迁移学习在处理 Titan 特定挑战时的优势。通过对比人工与算法驱动的分析效率，论文强调了机器学习在行星科学中的广泛应用潜力，尤其是在未来大量图像数据的分析中。

链接: https://arxiv.org/abs/2501.04459
作者: Zachary Yahn,Douglas M Trent,Ethan Duncan,Benoît Seignovert,John Santerre,Conor Nixon
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Despite widespread adoption of deep learning models to address a variety of computer vision tasks, planetary science has yet to see extensive utilization of such tools to address its unique problems. On Titan, the largest moon of Saturn, tracking seasonal trends and weather patterns of clouds provides crucial insights into one of the most complex climates in the Solar System, yet much of the available image data are still analyzed in a conventional way. In this work, we apply a Mask R-CNN trained via transfer learning to perform instance segmentation of clouds in Titan images acquired by the Cassini spacecraft - a previously unexplored approach to a big data problem in planetary science. We demonstrate that an automated technique can provide quantitative measures for clouds, such as areas and centroids, that may otherwise be prohibitively time-intensive to produce by human mapping. Furthermore, despite Titan specific challenges, our approach yields accuracy comparable to contemporary cloud identification studies on Earth and other worlds. We compare the efficiencies of human-driven versus algorithmic approaches, showing that transfer learning provides speed-ups that may open new horizons for data investigation for Titan. Moreover, we suggest that such approaches have broad potential for application to similar problems in planetary science where they are currently under-utilized. Future planned missions to the planets and remote sensing initiatives for the Earth promise to provide a deluge of image data in the coming years that will benefit strongly from leveraging machine learning approaches to perform the analysis.
zh

[CV-60] A Unified Framework for Foreground and Anonymization Area Segmentation in CT and MRI Data

【速读】：该论文旨在解决自监督学习（Self-Supervised Learning, SSL）在3D医学影像数据预处理中的关键挑战，特别是数据隐私和计算效率问题。解决方案的核心在于开发了一个开源工具包，该工具包包含两个主要组件：一个用于分割前景区域以优化数据采样并减少训练时间的网络，以及一个用于识别匿名化区域的网络，以防止基于重建的自监督学习方法中的错误监督。实验结果表明，该工具包在匿名化方法和前景分割任务中表现出极高的鲁棒性，平均Dice分数分别超过98.5和99.5，显著支持了3D医学影像（包括CT和MRI）中的自监督学习应用。

链接: https://arxiv.org/abs/2501.04361
作者: Michal Nohel,Constantin Ulrich,Jonathan Suprijadi,Tassilo Wald,Klaus Maier-Hein
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:This study presents an open-source toolkit to address critical challenges in preprocessing data for self-supervised learning (SSL) for 3D medical imaging, focusing on data privacy and computational efficiency. The toolkit comprises two main components: a segmentation network that delineates foreground regions to optimize data sampling and thus reduce training time, and a segmentation network that identifies anonymized regions, preventing erroneous supervision in reconstruction-based SSL methods. Experimental results demonstrate high robustness, with mean Dice scores exceeding 98.5 across all anonymization methods and surpassing 99.5 for foreground segmentation tasks, highlighting the efficacy of the toolkit in supporting SSL applications in 3D medical imaging for both CT and MRI images. The weights and code is available at this https URL.
zh

[CV-61] GRAPHITE: Graph-Based Interpretable Tissue Examination for Enhanced Explainability in Breast Cancer Histopathology

【速读】：该论文试图解决医学组织病理学中深度学习模型的可解释性问题，以增强其在癌症诊断中的临床可信度和应用性。深度学习模型的“黑箱”特性限制了其在临床中的广泛采用。为此，作者提出了GRAPHITE（基于图的可解释组织检查框架），这是一种后处理可解释性框架，专门用于乳腺癌组织微阵列（TMA）分析。GRAPHITE采用多尺度方法，提取不同放大倍数下的组织图像块，构建层次图，并利用图注意力网络（GAT）和尺度注意力机制（SAN）来捕捉尺度依赖的特征。通过在140个肿瘤TMA核心和4个良性全切片图像上训练模型，并在53个病理学家标注的TMA样本上进行测试，GRAPHITE在平均精度（mAP）、接收者操作特征曲线下面积（AUROC）和阈值鲁棒性（ThR）等指标上均优于传统的可解释AI方法。此外，GRAPHITE在临床决策支持方面表现出色，具有最高的决策曲线下面积（AUDC），表明其在不同阈值下均能提供可靠的决策支持。这些结果表明，GRAPHITE在计算病理学中具有临床应用的潜力，能够提供与病理学家诊断推理一致的可视化解释，支持精准医学的发展。

链接: https://arxiv.org/abs/2501.04206
作者: Raktim Kumar Mondol,Ewan K. A. Millar,Peter H. Graham,Lois Browne,Arcot Sowmya,Erik Meijering
机构: University of New South Wales (新南威尔士大学); NSW Health Pathology (新南威尔士州健康病理学); St George Hospital (圣乔治医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 Pages, 9 Figures, 1 Tables

点击查看摘要

Abstract:Explainable AI (XAI) in medical histopathology is essential for enhancing the interpretability and clinical trustworthiness of deep learning models in cancer diagnosis. However, the black-box nature of these models often limits their clinical adoption. We introduce GRAPHITE (Graph-based Interpretable Tissue Examination), a post-hoc explainable framework designed for breast cancer tissue microarray (TMA) analysis. GRAPHITE employs a multiscale approach, extracting patches at various magnification levels, constructing an hierarchical graph, and utilising graph attention networks (GAT) with scalewise attention (SAN) to capture scale-dependent features. We trained the model on 140 tumour TMA cores and four benign whole slide images from which 140 benign samples were created, and tested it on 53 pathologist-annotated TMA samples. GRAPHITE outperformed traditional XAI methods, achieving a mean average precision (mAP) of 0.56, an area under the receiver operating characteristic curve (AUROC) of 0.94, and a threshold robustness (ThR) of 0.70, indicating that the model maintains high performance across a wide range of thresholds. In clinical utility, GRAPHITE achieved the highest area under the decision curve (AUDC) of 4.17e+5, indicating reliable decision support across thresholds. These results highlight GRAPHITE’s potential as a clinically valuable tool in computational pathology, providing interpretable visualisations that align with the pathologists’ diagnostic reasoning and support precision medicine.
zh

[CV-62] Machine Learning for Identifying Grain Boundaries in Scanning Electron Microscopy (SEM) Images of Nanoparticle Superlattices

【速读】：该论文旨在解决纳米颗粒超晶格（nanoparticle superlattices）微观结构特征（如晶界、晶格缺陷和孔隙）的自动化定量分析问题。传统的手动分析方法不仅耗时且容易出错，难以高效处理大规模数据集。为此，作者提出了一种基于机器学习的工作流程，用于自动化扫描电子显微镜（SEM）图像中的晶粒分割。该解决方案的关键在于结合信号处理技术（如Radon变换）和无监督学习方法（如凝聚层次聚类），将原始像素数据转换为可解释的超晶格取向数值表示，从而实现无需人工标注数据的晶粒识别与分割。该工作流程在噪声图像和边缘情况下表现出较强的鲁棒性，处理速度达到每分钟四张图像，适用于大规模数据集，并为材料设计和分析中的数据驱动决策提供了有力工具。

链接: https://arxiv.org/abs/2501.04172
作者: Aanish Paruchuri,Carl Thrasher,A. J. Hart,Robert Macfarlane,Arthi Jayaraman
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Nanoparticle superlattices consisting of ordered arrangements of nanoparticles exhibit unique optical, magnetic, and electronic properties arising from nanoparticle characteristics as well as their collective behaviors. Understanding how processing conditions influence the nanoscale arrangement and microstructure is critical for engineering materials with desired macroscopic properties. Microstructural features such as grain boundaries, lattice defects, and pores significantly affect these properties but are challenging to quantify using traditional manual analyses as they are labor-intensive and prone to errors. In this work, we present a machine learning workflow for automating grain segmentation in scanning electron microscopy (SEM) images of nanoparticle superlattices. This workflow integrates signal processing techniques, such as Radon transforms, with unsupervised learning methods like agglomerative hierarchical clustering to identify and segment grains without requiring manually annotated data. In the workflow we transform the raw pixel data into explainable numerical representation of superlattice orientations for clustering. Benchmarking results demonstrate the workflow’s robustness against noisy images and edge cases, with a processing speed of four images per minute on standard computational hardware. This efficiency makes the workflow scalable to large datasets and makes it a valuable tool for integrating data-driven models into decision-making processes for material design and analysis. For example, one can use this workflow to quantify grain size distributions at varying processing conditions like temperature and pressure and using that knowledge adjust processing conditions to achieve desired superlattice orientations and grain sizes.
zh

[CV-63] Deep Learning for Ophthalmology: The State-of-the-Art and Future Trends

【速读】：该论文探讨了人工智能（AI），特别是深度学习（DL）在眼科领域，尤其是后段眼病（posterior segment eye diseases）的诊断和治疗中的前沿应用。论文旨在解决如何利用AI技术提升这些疾病的诊断准确性、优化治疗策略以及改善患者整体护理的问题。解决方案的关键在于采用先进的深度学习架构，如卷积神经网络（CNNs）、注意力机制（attention mechanisms）和基于Transformer的模型（transformer-based models），并通过这些技术处理多样化的数据，提高算法的透明度，并有效整合多模态数据（multimodal data）。此外，论文强调了在临床实践中整合AI解决方案时面临的挑战，包括确保数据多样性、提升算法透明度以及有效利用多模态数据，并呼吁通过协作努力克服这些障碍，充分发挥AI在眼科护理中的潜力。

链接: https://arxiv.org/abs/2501.04073
作者: Duy M. H. Nguyen,Hasan Md Tusfiqur Alam,Tai Nguyen,Devansh Srivastav,Hans-Juergen Profitlich,Ngan Le,Daniel Sonntag
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); International Max Planck Research School for Intelligent Systems (IMPRS-IS)(马克斯·普朗克智能系统国际研究院); Department of Computer Science, University of Stuttgart, Germany(斯图加特大学计算机科学系); Department of Computer Science and Computer Engineering, University of Arkansas, USA(阿肯色大学计算机科学与计算机工程系); Department of Applied Artificial Intelligence, Oldenburg University, Germany(奥尔登堡大学应用人工智能系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: First version

点击查看摘要

Abstract:The emergence of artificial intelligence (AI), particularly deep learning (DL), has marked a new era in the realm of ophthalmology, offering transformative potential for the diagnosis and treatment of posterior segment eye diseases. This review explores the cutting-edge applications of DL across a range of ocular conditions, including diabetic retinopathy, glaucoma, age-related macular degeneration, and retinal vessel segmentation. We provide a comprehensive overview of foundational ML techniques and advanced DL architectures, such as CNNs, attention mechanisms, and transformer-based models, highlighting the evolving role of AI in enhancing diagnostic accuracy, optimizing treatment strategies, and improving overall patient care. Additionally, we present key challenges in integrating AI solutions into clinical practice, including ensuring data diversity, improving algorithm transparency, and effectively leveraging multimodal data. This review emphasizes AI’s potential to improve disease diagnosis and enhance patient care while stressing the importance of collaborative efforts to overcome these barriers and fully harness AI’s impact in advancing eye care.
zh

人工智能

[AI-0] Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding

链接: https://arxiv.org/abs/2501.04693
作者: Joshua Jones,Oier Mees,Carmelo Sferrazza,Kyle Stachowicz,Pieter Abbeel,Sergey Levine
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities – including vision, touch, and audio – to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.

[AI-1] Knowledge Retrieval Based on Generative AI

链接: https://arxiv.org/abs/2501.04635
作者: Te-Lun Yang,Jyi-Shane Liu,Yuen-Hsien Tseng,Jyh-Shing Roger Jang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 13 figures, 1 table

点击查看摘要

Abstract:This study develops a question-answering system based on Retrieval-Augmented Generation (RAG) using Chinese Wikipedia and Lawbank as retrieval sources. Using TTQA and TMMLU+ as evaluation datasets, the system employs BGE-M3 for dense vector retrieval to obtain highly relevant search results and BGE-reranker to reorder these results based on query relevance. The most pertinent retrieval outcomes serve as reference knowledge for a Large Language Model (LLM), enhancing its ability to answer questions and establishing a knowledge retrieval system grounded in generative AI. The system’s effectiveness is assessed through a two-stage evaluation: automatic and assisted performance evaluations. The automatic evaluation calculates accuracy by comparing the model’s auto-generated labels with ground truth answers, measuring performance under standardized conditions without human intervention. The assisted performance evaluation involves 20 finance-related multiple-choice questions answered by 20 participants without financial backgrounds. Initially, participants answer independently. Later, they receive system-generated reference information to assist in answering, examining whether the system improves accuracy when assistance is provided. The main contributions of this research are: (1) Enhanced LLM Capability: By integrating BGE-M3 and BGE-reranker, the system retrieves and reorders highly relevant results, reduces hallucinations, and dynamically accesses authorized or public knowledge sources. (2) Improved Data Privacy: A customized RAG architecture enables local operation of the LLM, eliminating the need to send private data to external servers. This approach enhances data security, reduces reliance on commercial services, lowers operational costs, and mitigates privacy risks. Comments: 8 pages, 13 figures, 1 table Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.04635 [cs.IR] (or arXiv:2501.04635v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2501.04635 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

链接: https://arxiv.org/abs/2501.04614
作者: Daniele Molino,Francesco Di Feola,Eliodoro Faiella,Deborah Fazzini,Domiziana Santucci,Linlin Shen,Valerio Guarrasi,Paolo Soda
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial Intelligence is revolutionizing medical practice, enhancing diagnostic accuracy and healthcare delivery. However, its adaptation in medical settings still faces significant challenges, related to data availability and privacy constraints. Synthetic data has emerged as a promising solution to mitigate these issues, addressing data scarcity while preserving privacy. Recently, Latent Diffusion Models have emerged as a powerful tool for generating high-quality synthetic data. Meanwhile, the integration of different modalities has gained interest, emphasizing the need of models capable of handle multimodal medical this http URL approaches struggle to integrate complementary information and lack the ability to generate modalities simultaneously. To address this challenge, we present MedCoDi-M, a 6.77-billion-parameter model, designed for multimodal medical data generation, that, following Foundation Model paradigm, exploits contrastive learning and large quantity of data to build a shared latent space which capture the relationships between different data modalities. Further, we introduce the Multi-Prompt training technique, which significantly boosts MedCoDi-M’s generation under different settings. We extensively validate MedCoDi-M: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we assess the utility of MedCoDi-M in addressing key challenges in the medical field, such as anonymization, data scarcity and imbalance learning. The results are promising, demonstrating the applicability of MedCoDi-M in medical contexts. Project page is at this https URL.

[AI-3] Federated-Continual Dynamic Segmentation of Histopathology guided by Barlow Continuity

链接: https://arxiv.org/abs/2501.04588
作者: Niklas Babendererde,Haozhe Zhu,Moritz Fuchs,Jonathan Stieber,Anirban Mukhopadhyay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated- and Continual Learning have been established as approaches to enable privacy-aware learning on continuously changing data, as required for deploying AI systems in histopathology images. However, data shifts can occur in a dynamic world, spatially between institutions and temporally, due to changing data over time. This leads to two issues: Client Drift, where the central model degrades from aggregating data from clients trained on shifted data, and Catastrophic Forgetting, from temporal shifts such as changes in patient populations. Both tend to degrade the model’s performance of previously seen data or spatially distributed training. Despite both problems arising from the same underlying problem of data shifts, existing research addresses them only individually. In this work, we introduce a method that can jointly alleviate Client Drift and Catastrophic Forgetting by using our proposed Dynamic Barlow Continuity that evaluates client updates on a public reference dataset and uses this to guide the training process to a spatially and temporally shift-invariant model. We evaluate our approach on the histopathology datasets BCSS and Semicol and prove our method to be highly effective by jointly improving the dice score as much as from 15.8% to 71.6% in Client Drift and from 42.5% to 62.8% in Catastrophic Forgetting. This enables Dynamic Learning by establishing spatio-temporal shift-invariance.

[AI-4] A 65 nm Bayesian Neural Network Accelerator with 360 fJ/Sample In-Word GRNG for AI Uncertainty Estimation

链接: https://arxiv.org/abs/2501.04577
作者: Zephan M. Enciso,Boyang Cheng,Likai Pei,Jianbo Liu,Steven Davis,Ningyuan Cao,Michael Niemier
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 7 pages, 12 figures

点击查看摘要

Abstract:Uncertainty estimation is an indispensable capability for AI-enabled, safety-critical applications, e.g. autonomous vehicles or medical diagnosis. Bayesian neural networks (BNNs) use Bayesian statistics to provide both classification predictions and uncertainty estimation, but they suffer from high computational overhead associated with random number generation and repeated sample iterations. Furthermore, BNNs are not immediately amenable to acceleration through compute-in-memory architectures due to the frequent memory writes necessary after each RNG operation. To address these challenges, we present an ASIC that integrates 360 fJ/Sample Gaussian RNG directly into the SRAM memory words. This integration reduces RNG overhead and enables fully-parallel compute-in-memory operations for BNNs. The prototype chip achieves 5.12 GSa/s RNG throughput and 102 GOp/s neural network throughput while occupying 0.45 mm2, bringing AI uncertainty estimation to edge computation.

[AI-5] Cyber-Physical Steganography in Robotic Motion Control

链接: https://arxiv.org/abs/2501.04541
作者: Ching-Chun Chang,Yijie Lin,Isao Echizen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Steganography, the art of information hiding, has continually evolved across visual, auditory and linguistic domains, adapting to the ceaseless interplay between steganographic concealment and steganalytic revelation. This study seeks to extend the horizons of what constitutes a viable steganographic medium by introducing a steganographic paradigm in robotic motion control. Based on the observation of the robot’s inherent sensitivity to changes in its environment, we propose a methodology to encode messages as environmental stimuli influencing the motions of the robotic agent and to decode messages from the resulting motion trajectory. The constraints of maximal robot integrity and minimal motion deviation are established as fundamental principles underlying secrecy. As a proof of concept, we conduct experiments in simulated environments across various manipulation tasks, incorporating robotic embodiments equipped with generalist multimodal policies.

[AI-6] owards a Problem-Oriented Domain Adaptation Framework for Machine Learning

链接: https://arxiv.org/abs/2501.04528
作者: Philipp Spitzer,Dominik Martin,Laurin Eichberger,Niklas Kühl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Domain adaptation is a sub-field of machine learning that involves transferring knowledge from a source domain to perform the same task in the target domain. It is a typical challenge in machine learning that arises, e.g., when data is obtained from various sources or when using a data basis that changes over time. Recent advances in the field offer promising methods, but it is still challenging for researchers and practitioners to determine if domain adaptation is suitable for a given problem – and, subsequently, to select the appropriate approach. This article employs design science research to develop a problem-oriented framework for domain adaptation, which is matured in three evaluation episodes. We describe a framework that distinguishes between five domain adaptation scenarios, provides recommendations for addressing each scenario, and offers guidelines for determining if a problem falls into one of these scenarios. During the multiple evaluation episodes, the framework is tested on artificial and real-world datasets and an experimental study involving 100 participants. The evaluation demonstrates that the framework has the explanatory power to capture any domain adaptation problem effectively. In summary, we provide clear guidance for researchers and practitioners who want to employ domain adaptation but lack in-depth knowledge of the possibilities.

[AI-7] CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection

链接: https://arxiv.org/abs/2501.04510
作者: Ruijun Feng,Hammond Pearce,Pietro Liguori,Yulei Sui
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have been proposed as powerful tools for detecting software vulnerabilities, where task-specific fine-tuning is typically employed to provide vulnerability-specific knowledge to the LLMs for this purpose. However, traditional full-parameter fine-tuning is inefficient for modern, complex LLMs, which contain billions of parameters. Soft prompt tuning has been suggested as a more efficient alternative for fine-tuning LLMs in general cases. However, pure soft prompt tuning treats source code as plain text, losing structural information inherent in source code. Meanwhile, graph-enhanced soft prompt tuning methods, which aim to address this issue, are unable to preserve the rich semantic information within code graphs, as they are primarily designed for general graph-related tasks and focus more on adjacency information. They also fail to ensure computational efficiency while accounting for graph-text interactions. This paper, therefore, introduces a new code graph-enhanced, structure-aware soft prompt tuning method for vulnerability detection, referred to as CGP-Tuning. It employs innovative type-aware embeddings to capture the rich semantic information within code graphs, along with a novel and efficient cross-modal alignment module that achieves linear computational cost while incorporating graph-text interactions. The proposed CGP-Tuning is evaluated on the latest DiverseVul dataset and the most recent open-source code LLMs, CodeLlama and CodeGemma. Experimental results demonstrate that CGP-Tuning outperforms the best state-of-the-art method by an average of 3.5 percentage points in accuracy, without compromising its vulnerability detection capabilities for long source code. Comments: 14 pages, 5 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.04510 [cs.SE] (or arXiv:2501.04510v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.04510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] Integrating remote sensing data assimilation deep learning and large language model for interactive wheat breeding yield prediction

链接: https://arxiv.org/abs/2501.04487
作者: Guofeng Yang,Nanfei Jin,Wenjie Ai,Zhonghua Zheng,Yuhong He,Yong He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Yield is one of the core goals of crop breeding. By predicting the potential yield of different breeding materials, breeders can screen these materials at various growth stages to select the best performing. Based on unmanned aerial vehicle remote sensing technology, high-throughput crop phenotyping data in breeding areas is collected to provide data support for the breeding decisions of breeders. However, the accuracy of current yield predictions still requires improvement, and the usability and user-friendliness of yield forecasting tools remain suboptimal. To address these challenges, this study introduces a hybrid method and tool for crop yield prediction, designed to allow breeders to interactively and accurately predict wheat yield by chatting with a large language model (LLM). First, the newly designed data assimilation algorithm is used to assimilate the leaf area index into the WOFOST model. Then, selected outputs from the assimilation process, along with remote sensing inversion results, are used to drive the time-series temporal fusion transformer model for wheat yield prediction. Finally, based on this hybrid method and leveraging an LLM with retrieval augmented generation technology, we developed an interactive yield prediction Web tool that is user-friendly and supports sustainable data updates. This tool integrates multi-source data to assist breeding decision-making. This study aims to accelerate the identification of high-yield materials in the breeding process, enhance breeding efficiency, and enable more scientific and smart breeding decisions.

[AI-9] Research on environment perception and behavior prediction of intelligent UAV based on semantic communication

链接: https://arxiv.org/abs/2501.04480
作者: Kechong Ren,Li Gao,Qi Guan
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The convergence of drone delivery systems, virtual worlds, and blockchain has transformed logistics and supply chain management, providing a fast, and environmentally friendly alternative to traditional ground transportation methods;Provide users with a real-world experience, virtual service providers need to collect up-to-the-minute delivery information from edge devices. To address this challenge, 1) a reinforcement learning approach is introduced to enable drones with fast training capabilities and the ability to autonomously adapt to new virtual scenarios for effective resource allocation.2) A semantic communication framework for meta-universes is proposed, which utilizes the extraction of semantic information to reduce the communication cost and incentivize the transmission of information for meta-universe services.3) In order to ensure that user information security, a lightweight authentication and key agreement scheme is designed between the drone and the user by introducing blockchain technology. In our experiments, the drone adaptation performance is improved by about 35%, and the local offloading rate can reach 90% with the increase of the number of base stations. The semantic communication system proposed in this paper is compared with the Cross Entropy baseline model. Introducing blockchain technology the throughput of the transaction is maintained at a stable value with different number of drones.

[AI-10] Hybrid Artificial Intelligence Strategies for Drone Navigation

链接: https://arxiv.org/abs/2501.04472
作者: Rubén San-Segundo,Lucía Angulo,Manuel Gil-Martín,David Carramiñana,Ana M. Bernardos
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Objective: This paper describes the development of hybrid artificial intelligence strategies for drone navigation. Methods: The navigation module combines a deep learning model with a rule-based engine depending on the agent state. The deep learning model has been trained using reinforcement learning. The rule-based engine uses expert knowledge to deal with specific situations. The navigation module incorporates several strategies to explain the drone decision based on its observation space, and different mechanisms for including human decisions in the navigation process. Finally, this paper proposes an evaluation methodology based on defining several scenarios and analyzing the performance of the different strategies according to metrics adapted to each scenario. Results: Two main navigation problems have been studied. For the first scenario (reaching known targets), it has been possible to obtain a 90% task completion rate, reducing significantly the number of collisions thanks to the rule-based engine. For the second scenario, it has been possible to reduce 20% of the time required to locate all the targets using the reinforcement learning model. Conclusions: Reinforcement learning is a very good strategy to learn policies for drone navigation, but in critical situations, it is necessary to complement it with a rule-based module to increase task success rate.

[AI-11] Effect of Information Technology on Job Creation to Support Economic: Case Studies of Graduates in Universities (2023-2024) of the KRG of Iraq

链接: https://arxiv.org/abs/2501.04438
作者: Azhi Kh. Bapir,Ismail Y. Maolood,Dana A Abdullah,Aso K. Ameen,Abdulhady Abas Abdullah
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The aim of this study is to assess the impact of information technology (IT) on university graduates in terms of employment development, which will aid in economic issues. This study uses a descriptive research methodology and a quantitative approach to understand variables. The focus of this study is to ascertain how graduates of Kurdistan regional universities might use IT to secure employment and significantly contribute to the nation’s economic revival. The sample size was established by the use of judgmental sampling procedure and consisted of 314 people. The researcher prepared the questionnaire to collect data, and then SPSS statistical software, version 22, and Excel 2010 were used to modify, compile, and tabulate the results. The study’s outcome showed that information technology is incredibly inventive, has a promising future, and makes life much easier for everyone. It also proved that a deep academic understanding of information technology and its constituent parts helps graduates of Kurdistan Regional University find suitable careers. More importantly, though, anyone looking for work or a means of support will find great benefit from possessing credentials and understanding of IT. The study’s final finding was that information technology has actively advanced the country’s economy. Not only is IT helping to boost youth employment, but it is also turning into a worthwhile investment for economic growth.

[AI-12] Integrating LLM s with ITS: Recent Advances Potentials Challenges and Future Directions

链接: https://arxiv.org/abs/2501.04437
作者: Doaa Mahmud,Hadeel Hajmohamed,Shamma Almentheri,Shamma Alqaydi,Lameya Aldhaheri,Ruhul Amin Khalil,Nasir Saeed
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: Accepted for publication in IEEE Transactions on Intelligent Transportation Systems

点击查看摘要

Abstract:Intelligent Transportation Systems (ITS) are crucial for the development and operation of smart cities, addressing key challenges in efficiency, productivity, and environmental sustainability. This paper comprehensively reviews the transformative potential of Large Language Models (LLMs) in optimizing ITS. Initially, we provide an extensive overview of ITS, highlighting its components, operational principles, and overall effectiveness. We then delve into the theoretical background of various LLM techniques, such as GPT, T5, CTRL, and BERT, elucidating their relevance to ITS applications. Following this, we examine the wide-ranging applications of LLMs within ITS, including traffic flow prediction, vehicle detection and classification, autonomous driving, traffic sign recognition, and pedestrian detection. Our analysis reveals how these advanced models can significantly enhance traffic management and safety. Finally, we explore the challenges and limitations LLMs face in ITS, such as data availability, computational constraints, and ethical considerations. We also present several future research directions and potential innovations to address these challenges. This paper aims to guide researchers and practitioners through the complexities and opportunities of integrating LLMs in ITS, offering a roadmap to create more efficient, sustainable, and responsive next-generation transportation systems.

[AI-13] Federated Fine-Tuning of LLM s: Framework Comparison and Research Directions

链接: https://arxiv.org/abs/2501.04436
作者: Na Yan,Yang Su,Yansha Deng,Robert Schober
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets, enabling task-specific adaptation while preserving data privacy. However, fine-tuning the extensive parameters in LLMs is particularly challenging in resource-constrained federated scenarios due to the significant communication and computational costs. To gain a deeper understanding of how these challenges can be addressed, this article conducts a comparative analysis three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues: 1) FedLLMs, where clients upload model parameters or gradients to enable straightforward and effective fine-tuning; 2) KD-FedLLMs, which leverage KD for efficient knowledge sharing via logits; and 3) Split-FedLLMs, which split the LLMs into two parts, with one part executed on the client and the other one on the server, to balance the computational load. Each framework is evaluated based on key performance metrics, including model accuracy, communication overhead, and client-side computational load, offering insights into their effectiveness for various federated fine-tuning scenarios. Through this analysis, we identify framework-specific optimization opportunities to enhance the efficiency of FedLLMs and discuss broader research directions, highlighting open opportunities to better adapt FedLLMs for real-world applications. A use case is presented to demonstrate the performance comparison of these three frameworks under varying configurations and settings.

[AI-14] A Digital Shadow for Modeling Studying and Preventing Urban Crime

链接: https://arxiv.org/abs/2501.04435
作者: Juan Palma-Borda,Eduardo Guzmán,María-Victoria Belmonte
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Crime is one of the greatest threats to urban security. Around 80 percent of the world’s population lives in countries with high levels of criminality. Most of the crimes committed in the cities take place in their urban environments. This paper presents the development and validation of a digital shadow platform for modeling and simulating urban crime. This digital shadow has been constructed using data-driven agent-based modeling and simulation techniques, which are suitable for capturing dynamic interactions among individuals and with their environment. Our approach transforms and integrates well-known criminological theories and the expert knowledge of law enforcement agencies (LEA), policy makers, and other stakeholders under a theoretical model, which is in turn combined with real crime, spatial (cartographic) and socio-economic data into an urban model characterizing the daily behavior of citizens. The digital shadow has also been instantiated for the city of Malaga, for which we had over 300,000 complaints available. This instance has been calibrated with those complaints and other geographic and socio-economic information of the city. To the best of our knowledge, our digital shadow is the first for large urban areas that has been calibrated with a large dataset of real crime reports and with an accurate representation of the urban environment. The performance indicators of the model after being calibrated, in terms of the metrics widely used in predictive policing, suggest that our simulated crime generation matches the general pattern of crime in the city according to historical data. Our digital shadow platform could be an interesting tool for modeling and predicting criminal behavior in an urban environment on a daily basis and, thus, a useful tool for policy makers, criminologists, sociologists, LEAs, etc. to study and prevent urban crime.

[AI-15] Dual-Force: Enhanced Offline Diversity Maximization under Imitation Constraints

链接: https://arxiv.org/abs/2501.04426
作者: Pavel Kolev,Marin Vlastelica,Georg Martius
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:While many algorithms for diversity maximization under imitation constraints are online in nature, many applications require offline algorithms without environment interactions. Tackling this problem in the offline setting, however, presents significant challenges that require non-trivial, multi-stage optimization processes with non-stationary rewards. In this work, we present a novel offline algorithm that enhances diversity using an objective based on Van der Waals (VdW) force and successor features, and eliminates the need to learn a previously used skill discriminator. Moreover, by conditioning the value function and policy on a pre-trained Functional Reward Encoding (FRE), our method allows for better handling of non-stationary rewards and provides zero-shot recall of all skills encountered during training, significantly expanding the set of skills learned in prior work. Consequently, our algorithm benefits from receiving a consistently strong diversity signal (VdW), and enjoys more stable and efficient training. We demonstrate the effectiveness of our method in generating diverse skills for two robotic tasks in simulation: locomotion of a quadruped and local navigation with obstacle traversal.

[AI-16] User Simulation in the Era of Generative AI: User Modeling Synthetic Data Generation and System Evaluation

链接: https://arxiv.org/abs/2501.04410
作者: Krisztian Balog,ChengXiang Zhai
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. User simulation has profound implications for diverse fields and plays a vital role in the pursuit of Artificial General Intelligence. This paper provides an overview of user simulation, highlighting its key applications, connections to various disciplines, and outlining future research directions to advance this increasingly important technology.

[AI-17] RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation ICASSP2025

链接: https://arxiv.org/abs/2501.04315
作者: Jun Liu,Zhenglun Kong,Peiyan Dong,Xuan Shen,Pu Zhao,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Xue Lin,Dong Huang,Yanzhi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICASSP 2025

点击查看摘要

Abstract:Fine-tuning helps large language models (LLM) recover degraded information and enhance task this http URL Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective method for optimizing LoRA’s scaling factor. By replacing \alpha/r with \alpha/\sqrtr , RoRA ensures improved performance as rank size increases. Moreover, RoRA enhances low-rank adaptation in fine-tuning uncompressed models and excels in the more challenging task of accuracy recovery when fine-tuning pruned models. Extensive experiments demonstrate the effectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA surpasses the state-of-the-art (SOTA) in average accuracy and robustness on LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and DoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning, RoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4% pruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher than DoRA.

[AI-18] MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

链接: https://arxiv.org/abs/2501.04292
作者: Zijiang Yang,Meishu Song,Xin Jing,Haojie Zhang,Kun Qian,Bin Hu,Kota Tamada,Toru Takumi,Björn W. Schuller,Yoshiharu Yamamoto
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure and 2 tables. For MAD-UV Challenge 2025

点击查看摘要

Abstract:The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.

[AI-19] Mapping the Edge of Chaos: Fractal-Like Boundaries in The Trainability of Decoder-Only Transformer Models

链接: https://arxiv.org/abs/2501.04286
作者: Bahman Torkamandi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:In the realm of fractal geometry, intricate structures emerge from simple iterative processes that partition parameter spaces into regions of stability and instability. Likewise, training large language models involves iteratively applying update functions, such as Adam, where even slight hyperparameter adjustments can shift the training process from convergence to divergence. Recent evidence from miniature neural networks suggests that the boundary separating these outcomes displays fractal characteristics [1]. Building on these insights, this study extends them to medium-sized, decoder-only transformer architectures by employing a more consistent convergence measure and examining the learning rate hyperparameter landscape for attention and fully connected layers. The results show that the trainability frontier is not a simple threshold; rather, it forms a self-similar yet seemingly random structure at multiple scales, with statistically consistent and repeating patterns. Within this landscape, a region of stable convergence is surrounded by a complex chaotic border, illustrating the sensitive nature of the underlying training dynamics.

[AI-20] Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

链接: https://arxiv.org/abs/2501.04266
作者: Lang Xu,Quentin Anthony,Jacob Hatef,Aamir Shafi,Hari Subramoni,Dhabaleswar K.(DK)Panda
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given that intra-node GPU-GPU transfer generally has more bandwidth and lower latency than inter-node connections. However, as more capable infrastructure like Frontier, equipped with AMD GPUs, emerged with impressive computing capability, there is a need for investigations on the hardware topology and to develop targeted strategies to improve training efficiency. In this work, we propose a collection of communication and optimization strategies for ZeRO++ to reduce communication costs and improve memory utilization. In this paper, we propose a 3-level hierarchical partitioning specifically for the current Top-1 supercomputing cluster, Frontier, which aims at leveraging various bandwidths across layers of communications (GCD-GCD, GPU-GPU, and inter-node) to reduce communication overhead. For a 20B GPT model, we observe a 1.71x increase in TFLOPS per GPU when compared with ZeRO++ up to 384 GCDs and a scaling efficiency of 0.94 for up to 384 GCDs. To the best of our knowledge, our work is also the first effort to efficiently optimize LLM workloads on Frontier AMD GPUs.

[AI-21] KN-LIO: Geometric Kinematics and Neural Field Coupled LiDAR-Inertial Odometry

链接: https://arxiv.org/abs/2501.04263
作者: Zhong Wang,Lele Ren,Yue Wen,Hesheng Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Recent advancements in LiDAR-Inertial Odometry (LIO) have boosted a large amount of applications. However, traditional LIO systems tend to focus more on localization rather than mapping, with maps consisting mostly of sparse geometric elements, which is not ideal for downstream tasks. Recent emerging neural field technology has great potential in dense mapping, but pure LiDAR mapping is difficult to work on high-dynamic vehicles. To mitigate this challenge, we present a new solution that tightly couples geometric kinematics with neural fields to enhance simultaneous state estimation and dense mapping capabilities. We propose both semi-coupled and tightly coupled Kinematic-Neural LIO (KN-LIO) systems that leverage online SDF decoding and iterated error-state Kalman filtering to fuse laser and inertial data. Our KN-LIO minimizes information loss and improves accuracy in state estimation, while also accommodating asynchronous multi-LiDAR inputs. Evaluations on diverse high-dynamic datasets demonstrate that our KN-LIO achieves performance on par with or superior to existing state-of-the-art solutions in pose estimation and offers improved dense mapping accuracy over pure LiDAR-based methods. The relevant code and datasets will be made available at https://**.

[AI-22] Constraints as Rewards: Reinforcement Learning for Robots without Reward Functions

链接: https://arxiv.org/abs/2501.04228
作者: Yu Ishihara,Noriaki Takasugi,Kotaro Kawakami,Masaya Kinoshita,Kazumi Aoyama
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning has become an essential algorithm for generating complex robotic behaviors. However, to learn such behaviors, it is necessary to design a reward function that describes the task, which often consists of multiple objectives that needs to be balanced. This tuning process is known as reward engineering and typically involves extensive trial-and-error. In this paper, to avoid this trial-and-error process, we propose the concept of Constraints as Rewards (CaR). CaR formulates the task objective using multiple constraint functions instead of a reward function and solves a reinforcement learning problem with constraints using the Lagrangian-method. By adopting this approach, different objectives are automatically balanced, because Lagrange multipliers serves as the weights among the objectives. In addition, we will demonstrate that constraints, expressed as inequalities, provide an intuitive interpretation of the optimization target designed for the task. We apply the proposed method to the standing-up motion generation task of a six-wheeled-telescopic-legged robot and demonstrate that the proposed method successfully acquires the target behavior, even though it is challenging to learn with manually designed reward functions.

[AI-23] CURing Large Models: Compression via CUR Decomposition

链接: https://arxiv.org/abs/2501.04211
作者: Sanghyeon Park,Soo-Mook Moon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large deep learning models have achieved remarkable success but are resource-intensive, posing challenges in computational cost and memory usage. We introduce CURing, a novel model compression method based on CUR matrix decomposition, which approximates weight matrices as the product of selected columns © and rows ®, and a small linking matrix (U). We apply this decomposition to weights chosen based on the combined influence of their magnitudes and activations. By identifying and retaining informative rows and columns, CURing significantly reduces model size with minimal performance loss. It preserves the original network’s input/output structures, retains important features such as non-negativity, and the compressed model’s activation patterns align with the original, thereby enhancing interpretability. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.04211 [cs.LG] (or arXiv:2501.04211v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.04211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-24] GNN-based Decentralized Perception in Multirobot Systems for Predicting Worker Actions

链接: https://arxiv.org/abs/2501.04193
作者: Ali Imran,Giovanni Beltrame,David St-Onge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Submitted to RA-L

点击查看摘要

Abstract:In industrial environments, predicting human actions is essential for ensuring safe and effective collaboration between humans and robots. This paper introduces a perception framework that enables mobile robots to understand and share information about human actions in a decentralized way. The framework first allows each robot to build a spatial graph representing its surroundings, which it then shares with other robots. This shared spatial data is combined with temporal information to track human behavior over time. A swarm-inspired decision-making process is used to ensure all robots agree on a unified interpretation of the human’s actions. Results show that adding more robots and incorporating longer time sequences improve prediction accuracy. Additionally, the consensus mechanism increases system resilience, making the multi-robot setup more reliable in dynamic industrial settings.

[AI-25] Fixed Points of Deep Neural Networks: Emergence Stability and Applications

链接: https://arxiv.org/abs/2501.04182
作者: L. Berlyand,V. Slavin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:We present numerical and analytical results on the formation and stability of a family of fixed points of deep neural networks (DNNs). Such fixed points appear in a class of DNNs when dimensions of input and output vectors are the same. We demonstrate examples of applications of such networks in supervised, semi-supervised and unsupervised learning such as encoding/decoding of images, restoration of damaged images among others. We present several numerical and analytical results. First, we show that for untrained DNN’s with weights and biases initialized by normally distributed random variables the only one fixed point exists. This result holds for DNN with any depth (number of layers) L , any layer width N , and sigmoid-type activation functions. Second, it has been shown that for a DNN whose parameters (weights and biases) are initialized by light-tailed'' distribution of weights (e.g. normal distribution), after training the distribution of these parameters become heavy-tailed’‘. This motivates our study of DNNs with ``heavy-tailed’’ initialization. For such DNNs we show numerically %existence and stability that training leads to emergence of Q(N,L) fixed points, where Q(N,L) is a positive integer which depends on the number of layers L and layer width N . We further observe numerically that for fixed N = N_0 the function Q(N_0, L) is non-monotone, that is it initially grows as L increases and then decreases to 1. This non-monotone behavior of Q(N_0, L) is also obtained by analytical derivation of equation for Empirical Spectral Distribution (ESD) of input-output Jacobian followed by numerical solution of this equation. Comments: 21 pages, 7 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA) Cite as: arXiv:2501.04182 [cs.LG] (or arXiv:2501.04182v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.04182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] HIVEX: A High-Impact Environment Suite for Multi-Agent Research (extended version)

链接: https://arxiv.org/abs/2501.04180
作者: Philipp D. Siedler
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Games have been vital test beds for the rapid development of Agent-based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real-world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real-world domains include multi-agent scenarios and require machine-machine and human-machine collaboration. Open-source environments have not advanced and are often toy scenarios, too abstract or not suitable for multi-agent research. By mimicking real-world problems and increasing the complexity of environments, we hope to advance state-of-the-art multi-agent research and inspire researchers to work on immediate real-world problems. Here, we present HIVEX, an environment suite to benchmark multi-agent research focusing on ecological challenges. HIVEX includes the following environments: Wind Farm Control, Wildfire Resource Management, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial Wildfire Suppression. We provide environments, training examples, and baselines for the main and sub-tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face. We also provide a leaderboard on Hugging Face and encourage the community to submit models trained on our environment suite.

[AI-27] Learning to Transfer Human Hand Skills for Robot Manipulations

链接: https://arxiv.org/abs/2501.04169
作者: Sungjae Park,Seungho Lee,Mingi Choi,Jiye Lee,Jeonghwan Kim,Jisoo Kim,Hanbyul Joo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint. Under Review

点击查看摘要

Abstract:We present a method for teaching dexterous manipulation tasks to robots from human hand motion demonstrations. Unlike existing approaches that solely rely on kinematics information without taking into account the plausibility of robot and object interaction, our method directly infers plausible robot manipulation actions from human motion demonstrations. To address the embodiment gap between the human hand and the robot system, our approach learns a joint motion manifold that maps human hand movements, robot hand actions, and object movements in 3D, enabling us to infer one motion component from others. Our key idea is the generation of pseudo-supervision triplets, which pair human, object, and robot motion trajectories synthetically. Through real-world experiments with robot hand manipulation, we demonstrate that our data-driven retargeting method significantly outperforms conventional retargeting techniques, effectively bridging the embodiment gap between human and robotic hands. Website at this https URL.

[AI-28] BiasGuard: Guardrailing Fairness in Machine Learning Production Systems

链接: https://arxiv.org/abs/2501.04142
作者: Nurit Cohen-Inger,Seffi Cohen,Neomi Rabaev,Lior Rokach,Bracha Shapira
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As machine learning (ML) systems increasingly impact critical sectors such as hiring, financial risk assessments, and criminal justice, the imperative to ensure fairness has intensified due to potential negative implications. While much ML fairness research has focused on enhancing training data and processes, addressing the outputs of already deployed systems has received less attention. This paper introduces ‘BiasGuard’, a novel approach designed to act as a fairness guardrail in production ML systems. BiasGuard leverages Test-Time Augmentation (TTA) powered by Conditional Generative Adversarial Network (CTGAN), a cutting-edge generative AI model, to synthesize data samples conditioned on inverted protected attribute values, thereby promoting equitable outcomes across diverse groups. This method aims to provide equal opportunities for both privileged and unprivileged groups while significantly enhancing the fairness metrics of deployed systems without the need for retraining. Our comprehensive experimental analysis across diverse datasets reveals that BiasGuard enhances fairness by 31% while only reducing accuracy by 0.09% compared to non-mitigated benchmarks. Additionally, BiasGuard outperforms existing post-processing methods in improving fairness, positioning it as an effective tool to safeguard against biases when retraining the model is impractical.

[AI-29] Implementing Systemic Thinking for Automatic Schema Matching: An Agent -Based Modeling Approach

链接: https://arxiv.org/abs/2501.04136
作者: Hicham Assoudi,Hakim Lounis
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: COGNITIVE 2018 : The Tenth International Conference on Advanced Cognitive Technologies and Applications

点击查看摘要

Abstract:Several approaches are proposed to deal with the problem of the Automatic Schema Matching (ASM). The challenges and difficulties caused by the complexity and uncertainty characterizing both the process and the outcome of Schema Matching motivated us to investigate how bio-inspired emerging paradigm can help with understanding, managing, and ultimately overcoming those challenges. In this paper, we explain how we approached Automatic Schema Matching as a systemic and Complex Adaptive System (CAS) and how we modeled it using the approach of Agent-Based Modeling and Simulation (ABMS). This effort gives birth to a tool (prototype) for schema matching called Reflex-SMAS. A set of experiments demonstrates the viability of our approach on two main aspects: (i) effectiveness (increasing the quality of the found matchings) and (ii) efficiency (reducing the effort required for this efficiency). Our approach represents a significant paradigm-shift, in the field of Automatic Schema Matching.

[AI-30] rojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning AAAI2025

链接: https://arxiv.org/abs/2501.04108
作者: Yupei Liu,Yanting Wang,Jinyuan Jia
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: To appear in AAAI 2025

点击查看摘要

Abstract:An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.

[AI-31] Enhancing Distribution and Label Consistency for Graph Out-of-Distribution Generalization ICDM2024

链接: https://arxiv.org/abs/2501.04102
作者: Song Wang,Xiaodong Yang,Rashidul Islam,Huiyuan Chen,Minghua Xu,Jundong Li,Yiwei Cai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICDM 2024

点击查看摘要

Abstract:To deal with distribution shifts in graph data, various graph out-of-distribution (OOD) generalization techniques have been recently proposed. These methods often employ a two-step strategy that first creates augmented environments and subsequently identifies invariant subgraphs to improve generalizability. Nevertheless, this approach could be suboptimal from the perspective of consistency. First, the process of augmenting environments by altering the graphs while preserving labels may lead to graphs that are not realistic or meaningfully related to the origin distribution, thus lacking distribution consistency. Second, the extracted subgraphs are obtained from directly modifying graphs, and may not necessarily maintain a consistent predictive relationship with their labels, thereby impacting label consistency. In response to these challenges, we introduce an innovative approach that aims to enhance these two types of consistency for graph OOD generalization. We propose a modifier to obtain both augmented and invariant graphs in a unified manner. With the augmented graphs, we enrich the training data without compromising the integrity of label-graph relationships. The label consistency enhancement in our framework further preserves the supervision information in the invariant graph. We conduct extensive experiments on real-world datasets to demonstrate the superiority of our framework over other state-of-the-art baselines.

[AI-32] Multi-armed Bandit and Backbone boost Lin-Kernighan-Helsgaun Algorithm for the Traveling Salesman Problems

链接: https://arxiv.org/abs/2501.04072
作者: Long Wang,Jiongzhi Zheng,Zhengda Xiong,Kun He
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Lin-Kernighan-Helsguan (LKH) heuristic is a classic local search algorithm for the Traveling Salesman Problem (TSP). LKH introduces an \alpha -value to replace the traditional distance metric for evaluating the edge quality, which leads to a significant improvement. However, we observe that the \alpha -value does not make full use of the historical information during the search, and single guiding information often makes LKH hard to escape from some local optima. To address the above issues, we propose a novel way to extract backbone information during the TSP local search process, which is dynamic and can be updated once a local optimal solution is found. We further propose to combine backbone information, \alpha -value, and distance to evaluate the edge quality so as to guide the search. Moreover, we abstract their different combinations to arms in a multi-armed bandit (MAB) and use an MAB model to help the algorithm select an appropriate evaluation metric dynamically. Both the backbone information and MAB can provide diverse guiding information and learn from the search history to suggest the best metric. We apply our methods to LKH and LKH-3, which is an extension version of LKH that can be used to solve about 40 variant problems of TSP and Vehicle Routing Problem (VRP). Extensive experiments show the excellent performance and generalization capability of our proposed method, significantly improving LKH for TSP and LKH-3 for two representative TSP and VRP variants, the Colored TSP (CTSP) and Capacitated VRP with Time Windows (CVRPTW).

[AI-33] Explainable Reinforcement Learning for Formula One Race Strategy

链接: https://arxiv.org/abs/2501.04068
作者: Devin Thomas,Junqi Jiang,Avinash Kori,Aaron Russo,Steffen Winkler,Stuart Sale,Joseph McMillan,Francesco Belardinelli,Antonio Rago
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures. Copyright ACM 2025. This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in SAC 2025, this http URL

点击查看摘要

Abstract:In Formula One, teams compete to develop their cars and achieve the highest possible finishing position in each race. During a race, however, teams are unable to alter the car, so they must improve their cars’ finishing positions via race strategy, i.e. optimising their selection of which tyre compounds to put on the car and when to do so. In this work, we introduce a reinforcement learning model, RSRL (Race Strategy Reinforcement Learning), to control race strategies in simulations, offering a faster alternative to the industry standard of hard-coded and Monte Carlo-based race strategies. Controlling cars with a pace equating to an expected finishing position of P5.5 (where P1 represents first place and P20 is last place), RSRL achieves an average finishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix, outperforming the best baseline of P5.63. We then demonstrate, in a generalisability study, how performance for one track or multiple tracks can be prioritised via training. Further, we supplement model predictions with feature importance, decision tree-based surrogate models, and decision tree counterfactuals towards improving user trust in the model. Finally, we provide illustrations which exemplify our approach in real-world situations, drawing parallels between simulations and reality.

[AI-34] Explainable Time Series Prediction of Tyre Energy in Formula One Race Strategy

链接: https://arxiv.org/abs/2501.04067
作者: Jamie Todd,Junqi Jiang,Aaron Russo,Steffen Winkler,Stuart Sale,Joseph McMillan,Antonio Rago
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures. Copyright ACM 2025. This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in SAC 2025, this http URL

点击查看摘要

Abstract:Formula One (F1) race strategy takes place in a high-pressure and fast-paced environment where split-second decisions can drastically affect race results. Two of the core decisions of race strategy are when to make pit stops (i.e. replace the cars’ tyres) and which tyre compounds (hard, medium or soft, in normal conditions) to select. The optimal pit stop decisions can be determined by estimating the tyre degradation of these compounds, which in turn can be computed from the energy applied to each tyre, i.e. the tyre energy. In this work, we trained deep learning models, using the Mercedes-AMG PETRONAS F1 team’s historic race data consisting of telemetry, to forecast tyre energies during races. Additionally, we fitted XGBoost, a decision tree-based machine learning algorithm, to the same dataset and compared the results, with both giving impressive performance. Furthermore, we incorporated two different explainable AI methods, namely feature importance and counterfactual explanations, to gain insights into the reasoning behind the forecasts. Our contributions thus result in an explainable, automated method which could assist F1 teams in optimising their race strategy.

[AI-35] ChronoLLM : A Framework for Customizing Large Language Model for Digital Twins generalization based on PyChrono

链接: https://arxiv.org/abs/2501.04062
作者: Jingquan Wang,Harry Zhang,Khailanii Slaton,Shu Wang,Radu Serban,Jinlong Wu,Dan Negrut
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Recently, the integration of advanced simulation technologies with artificial intelligence (AI) is revolutionizing science and engineering research. ChronoLlama introduces a novel framework that customizes the open-source LLMs, specifically for code generation, paired with PyChrono for multi-physics simulations. This integration aims to automate and improve the creation of simulation scripts, thus enhancing model accuracy and efficiency. This combination harnesses the speed of AI-driven code generation with the reliability of physics-based simulations, providing a powerful tool for researchers and engineers. Empirical results indicate substantial enhancements in simulation setup speed, accuracy of the generated codes, and overall computational efficiency. ChronoLlama not only expedites the development and testing of multibody systems but also spearheads a scalable, AI-enhanced approach to managing intricate mechanical simulations. This pioneering integration of cutting-edge AI with traditional simulation platforms represents a significant leap forward in automating and optimizing design processes in engineering applications.

[AI-36] Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

链接: https://arxiv.org/abs/2501.04038
作者: Rui Liu,Hongyu Yuan,Haizhou Li
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again’'. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: this https URL.

[AI-37] AICat: An AI Cataloguing Approach to Support the EU AI Act

链接: https://arxiv.org/abs/2501.04014
作者: Delaram Golpayegani,Harshvardhan J. Pandit,Dave Lewis
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Presented at 37th International Conference on Legal Knowledge and Information Systems (JURIX) 2024

点击查看摘要

Abstract:The European Union’s Artificial Intelligence Act (AI Act) requires providers and deployers of high-risk AI applications to register their systems into the EU database, wherein the information should be represented and maintained in an easily-navigable and machine-readable manner. Given the uptake of open data and Semantic Web-based approaches for other EU repositories, in particular the use of the Data Catalogue vocabulary Application Profile (DCAT-AP), a similar solution for managing the EU database of high-risk AI systems is needed. This paper introduces AICat - an extension of DCAT for representing catalogues of AI systems that provides consistency, machine-readability, searchability, and interoperability in managing open metadata regarding AI systems. This open approach to cataloguing ensures transparency, traceability, and accountability in AI application markets beyond the immediate needs of high-risk AI compliance in the EU. AICat is available online at this https URL under the CC-BY-4.0 license.

[AI-38] A Generative AI-driven Metadata Modelling Approach

链接: https://arxiv.org/abs/2501.04008
作者: Mayukh Bagchi
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Accepted for publication @ Special Issue on “Generative AI and Libraries” - Library Trends Journal, Johns Hopkins University Press, Maryland, USA

点击查看摘要

Abstract:Since decades, the modelling of metadata has been core to the functioning of any academic library. Its importance has only enhanced with the increasing pervasiveness of Generative Artificial Intelligence (AI)-driven information activities and services which constitute a library’s outreach. However, with the rising importance of metadata, there arose several outstanding problems with the process of designing a library metadata model impacting its reusability, crosswalk and interoperability with other metadata models. This paper posits that the above problems stem from an underlying thesis that there should only be a few core metadata models which would be necessary and sufficient for any information service using them, irrespective of the heterogeneity of intra-domain or inter-domain settings. To that end, this paper advances a contrary view of the above thesis and substantiates its argument in three key steps. First, it introduces a novel way of thinking about a library metadata model as an ontology-driven composition of five functionally interlinked representation levels from perception to its intensional definition via properties. Second, it introduces the representational manifoldness implicit in each of the five levels which cumulatively contributes to a conceptually entangled library metadata model. Finally, and most importantly, it proposes a Generative AI-driven Human-Large Language Model (LLM) collaboration based metadata modelling approach to disentangle the entanglement inherent in each representation level leading to the generation of a conceptually disentangled metadata model. Throughout the paper, the arguments are exemplified by motivating scenarios and examples from representative libraries handling cancer information.

[AI-39] DispFormer: Pretrained Transformer for Flexible Dispersion Curve Inversion from Global Synthesis to Regional Applications

链接: https://arxiv.org/abs/2501.04366
作者: Feng Liu,Bao Deng,Rui Su,Lei Bai,Wanli Ouyang
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注: 11 pages, 11 figures, related codes and data are available at this https URL

点击查看摘要

Abstract:Surface wave dispersion curve inversion is essential for estimating subsurface Shear-wave velocity ( v_s ), yet traditional methods often struggle to balance computational efficiency with inversion accuracy. While deep learning approaches show promise, previous studies typically require large amounts of labeled data and struggle with real-world datasets that have varying period ranges, missing data, and low signal-to-noise ratios. This study proposes DispFormer, a transformer-based neural network for inverting the v_s profile from Rayleigh-wave phase and group dispersion curves. DispFormer processes dispersion data at each period independently, thereby allowing it to handle data of varying lengths without requiring network modifications or alignment between training and testing data. The performance is demonstrated by pre-training it on a global synthetic dataset and testing it on two regional synthetic datasets using zero-shot and few-shot strategies. Results indicate that zero-shot DispFormer, even without any labeled data, produces inversion profiles that match well with the ground truth, providing a deployable initial model generator to assist traditional methods. When labeled data is available, few-shot DispFormer outperforms traditional methods with only a small number of labels. Furthermore, real-world tests indicate that DispFormer effectively handles varying length data, and yields lower data residuals than reference models. These findings demonstrate that DispFormer provides a robust foundation model for dispersion curve inversion and is a promising approach for broader applications.

[AI-40] Integrated Offline and Online Learning to Solve a Large Class of Scheduling Problems

链接: https://arxiv.org/abs/2501.04253
作者: Anbang Liu,Zhi-Long Chen,Jinyang Jiang,Xi Chen
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we develop a unified machine learning (ML) approach to predict high-quality solutions for single-machine scheduling problems with a non-decreasing min-sum objective function with or without release times. Our ML approach is novel in three major aspects. First, our approach is developed for the entire class of the aforementioned problems. To achieve this, we exploit the fact that the entire class of the problems considered can be formulated as a time-indexed formulation in a unified manner. We develop a deep neural network (DNN) which uses the cost parameters in the time-indexed formulation as the inputs to effectively predict a continuous solution to this formulation, based on which a feasible discrete solution is easily constructed. The second novel aspect of our approach lies in how the DNN model is trained. In view of the NP-hard nature of the problems, labels (i.e., optimal solutions) are hard to generate for training. To overcome this difficulty, we generate and utilize a set of special instances, for which optimal solutions can be found with little computational effort, to train the ML model offline. The third novel idea we employ in our approach is that we develop an online single-instance learning approach to fine tune the parameters in the DNN for a given online instance, with the goal of generating an improved solution for the given instance. To this end, we develop a feasibility surrogate that approximates the objective value of a given instance as a continuous function of the outputs of the DNN, which then enables us to derive gradients and update the learnable parameters in the DNN. Numerical results show that our approach can efficiently generate high-quality solutions for a variety of single-machine scheduling min-sum problems with up to 1000 jobs.

[AI-41] raits of a Leader: User Influence Level Prediction through Sociolinguistic Modeling

链接: https://arxiv.org/abs/2501.04046
作者: Denys Katerenchuk,Rivka Levitan
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Recognition of a user’s influence level has attracted much attention as human interactions move online. Influential users have the ability to sway others’ opinions to achieve some goals. As a result, predicting users’ level of influence can help to understand social networks, forecast trends, prevent misinformation, etc. However, predicting user influence is a challenging problem because the concept of influence is specific to a situation or a domain, and user communications are limited to text. In this work, we define user influence level as a function of community endorsement and develop a model that significantly outperforms the baseline by leveraging demographic and personality data. This approach consistently improves RankDCG scores across eight different domains.

机器学习

[LG-0] Comparative Analysis of Quantum and Classical Support Vector Classifiers for Software Bug Prediction: An Exploratory Study

链接: https://arxiv.org/abs/2501.04690
作者: Md Nadim,Mohammad Hassan,Ashis Kumar Mandal,Chanchal K. Roy,Banani Roy,Kevin A. Schneider
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted for publication in the Springer Journal: Quantum Machine Intelligence ( this https URL )

点击查看摘要

Abstract:Purpose: Quantum computing promises to transform problem-solving across various domains with rapid and practical solutions. Within Software Evolution and Maintenance, Quantum Machine Learning (QML) remains mostly an underexplored domain, particularly in addressing challenges such as detecting buggy software commits from code repositories. Methods: In this study, we investigate the practical application of Quantum Support Vector Classifiers (QSVC) for detecting buggy software commits across 14 open-source software projects with diverse dataset sizes encompassing 30,924 data instances. We compare the QML algorithm PQSVC (Pegasos QSVC) and QSVC against the classical Support Vector Classifier (SVC). Our technique addresses large datasets in QSVC algorithms by dividing them into smaller subsets. We propose and evaluate an aggregation method to combine predictions from these models to detect the entire test dataset. We also introduce an incremental testing methodology to overcome the difficulties of quantum feature mapping during the testing approach. Results: The study shows the effectiveness of QSVC and PQSVC in detecting buggy software commits. The aggregation technique successfully combines predictions from smaller data subsets, enhancing the overall detection accuracy for the entire test dataset. The incremental testing methodology effectively manages the challenges associated with quantum feature mapping during the testing process. Conclusion: We contribute to the advancement of QML algorithms in defect prediction, unveiling the potential for further research in this domain. The specific scenario of the Short-Term Activity Frame (STAF) highlights the early detection of buggy software commits during the initial developmental phases of software systems, particularly when dataset sizes remain insufficient to train machine learning models.

[LG-1] A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI

链接: https://arxiv.org/abs/2501.04641
作者: Kazusato Oko,Licong Lin,Yuhang Cai,Song Mei
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 108 pages

点击查看摘要

Abstract:Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.

[LG-2] A Semantic Partitioning Method for Large-Scale Training of Knowledge Graph Embeddings WWW’23

链接: https://arxiv.org/abs/2501.04613
作者: Yuhe Bai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

点击查看摘要

Abstract:In recent years, knowledge graph embeddings have achieved great success. Many methods have been proposed and achieved state-of-the-art results in various tasks. However, most of the current methods present one or more of the following problems: (i) They only consider fact triplets, while ignoring the ontology information of knowledge graphs. (ii) The obtained embeddings do not contain much semantic information. Therefore, using these embeddings for semantic tasks is problematic. (iii) They do not enable large-scale training. In this paper, we propose a new algorithm that incorporates the ontology of knowledge graphs and partitions the knowledge graph based on classes to include more semantic information for parallel training of large-scale knowledge graph embeddings. Our preliminary results show that our algorithm performs well on several popular benchmarks.

[LG-3] Resilient Peer-to-peer Learning based on Adaptive Aggregation

链接: https://arxiv.org/abs/2501.04610
作者: Chandreyee Bhowmick,Xenofon Koutsoukos
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Collaborative learning in peer-to-peer networks offers the benefits of distributed learning while mitigating the risks associated with single points of failure inherent in centralized servers. However, adversarial workers pose potential threats by attempting to inject malicious information into the network. Thus, ensuring the resilience of peer-to-peer learning emerges as a pivotal research objective. The challenge is exacerbated in the presence of non-convex loss functions and non-iid data distributions. This paper introduces a resilient aggregation technique tailored for such scenarios, aimed at fostering similarity among peers’ learning processes. The aggregation weights are determined through an optimization procedure, and use the loss function computed using the neighbor’s models and individual private data, thereby addressing concerns regarding data privacy in distributed machine learning. Theoretical analysis demonstrates convergence of parameters with non-convex loss functions and non-iid data distributions. Empirical evaluations across three distinct machine learning tasks support the claims. The empirical findings, which encompass a range of diverse attack models, also demonstrate improved accuracy when compared to existing methodologies.

[LG-4] Regret Analysis: a control perspective

链接: https://arxiv.org/abs/2501.04572
作者: Travis E. Gibson,Sawal Acharya
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 10 pages no figures

点击查看摘要

Abstract:Online learning and model reference adaptive control have many interesting intersections. One area where they differ however is in how the algorithms are analyzed and what objective or metric is used to discriminate “good” algorithms from “bad” algorithms. In adaptive control there are usually two objectives: 1) prove that all time varying parameters/states of the system are bounded, and 2) that the instantaneous error between the adaptively controlled system and a reference system converges to zero over time (or at least a compact set). For online learning the performance of algorithms is often characterized by the regret the algorithm incurs. Regret is defined as the cumulative loss (cost) over time from the online algorithm minus the cumulative loss (cost) of the single optimal fixed parameter choice in hindsight. Another significant difference between the two areas of research is with regard to the assumptions made in order to obtain said results. Adaptive control makes assumptions about the input-output properties of the control problem and derives solutions for a fixed error model or optimization task. In the online learning literature results are derived for classes of loss functions (i.e. convex) while a priori assuming that all time varying parameters are bounded, which for many optimization tasks is not unrealistic, but is a non starter in control applications. In this work we discuss these differences in detail through the regret based analysis of gradient descent for convex functions and the control based analysis of a streaming regression problem. We close with a discussion about the newly defined paradigm of online adaptive control and ask the following question “Are regret optimal control strategies deployable?”

[LG-5] Large-Scale Spectral Graph Neural Networks via Laplacian Sparsification: Technical Report

链接: https://arxiv.org/abs/2501.04570
作者: Haipeng Ding,Zhewei Wei,Yuhang Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) play a pivotal role in graph-based tasks for their proficiency in representation learning. Among the various GNN methods, spectral GNNs employing polynomial filters have shown promising performance on tasks involving both homophilous and heterophilous graph structures. However, The scalability of spectral GNNs on large graphs is limited because they learn the polynomial coefficients through multiple forward propagation executions during forward propagation. Existing works have attempted to scale up spectral GNNs by eliminating the linear layers on the input node features, a change that can disrupt end-to-end training, potentially impact performance, and become impractical with high-dimensional input features. To address the above challenges, we propose “Spectral Graph Neural Networks with Laplacian Sparsification (SGNN-LS)”, a novel graph spectral sparsification method to approximate the propagation patterns of spectral GNNs. We prove that our proposed method generates Laplacian sparsifiers that can approximate both fixed and learnable polynomial filters with theoretical guarantees. Our method allows the application of linear layers on the input node features, enabling end-to-end training as well as the handling of raw text features. We conduct an extensive experimental analysis on datasets spanning various graph scales and properties to demonstrate the superior efficiency and effectiveness of our method. The results show that our method yields superior results in comparison with the corresponding approximated base models, especially on dataset Ogbn-papers100M(111M nodes, 1.6B edges) and MAG-scholar-C (2.8M features).

[LG-6] Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification survival modelling and regression analyses

链接: https://arxiv.org/abs/2501.04547
作者: Ramtin Zargari Marandi,Anne Svane Frahm,Jens Lundgren,Daniel Dawson Murray,Maja Milojevic
类目: Machine Learning (cs.LG)
*备注: 14 pages, 2 figures, 1 table

点击查看摘要

Abstract:While machine learning offers diverse techniques suitable for exploring various medical research questions, a cohesive synergistic framework can facilitate the integration and understanding of new approaches within unified model development and interpretation. We therefore introduce the Medical Artificial Intelligence Toolbox (MAIT), an explainable, open-source Python pipeline for developing and evaluating binary classification, regression, and survival models on tabular datasets. MAIT addresses key challenges (e.g., high dimensionality, class imbalance, mixed variable types, and missingness) while promoting transparency in reporting (TRIPOD+AI compliant). Offering automated configurations for beginners and customizable source code for experts, MAIT streamlines two primary use cases: Discovery (feature importance via unified scoring, e.g., SHapley Additive exPlanations - SHAP) and Prediction (model development and deployment with optimized solutions). Moreover, MAIT proposes new techniques including fine-tuning of probability threshold in binary classification, translation of cumulative hazard curves to binary classification, enhanced visualizations for model interpretation for mixed data types, and handling censoring through semi-supervised learning, to adapt to a wide set of data constraints and study designs. We provide detailed tutorials on GitHub, using four open-access data sets, to demonstrate how MAIT can be used to improve implementation and interpretation of ML models in medical research.

[LG-7] HypeRL: Parameter-Informed Reinforcement Learning for Parametric PDEs

链接: https://arxiv.org/abs/2501.04538
作者: Nicolò Botteghi,Stefania Fresca,Mengwu Guo,Andrea Manzoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we devise a new, general-purpose reinforcement learning strategy for the optimal control of parametric partial differential equations (PDEs). Such problems frequently arise in applied sciences and engineering and entail a significant complexity when control and/or state variables are distributed in high-dimensional space or depend on varying parameters. Traditional numerical methods, relying on either iterative minimization algorithms or dynamic programming, while reliable, often become computationally infeasible. Indeed, in either way, the optimal control problem must be solved for each instance of the parameters, and this is out of reach when dealing with high-dimensional time-dependent and parametric PDEs. In this paper, we propose HypeRL, a deep reinforcement learning (DRL) framework to overcome the limitations shown by traditional methods. HypeRL aims at approximating the optimal control policy directly. Specifically, we employ an actor-critic DRL approach to learn an optimal feedback control strategy that can generalize across the range of variation of the parameters. To effectively learn such optimal control laws, encoding the parameter information into the DRL policy and value function neural networks (NNs) is essential. To do so, HypeRL uses two additional NNs, often called hypernetworks, to learn the weights and biases of the value function and the policy NNs. We validate the proposed approach on two PDE-constrained optimal control benchmarks, namely a 1D Kuramoto-Sivashinsky equation and a 2D Navier-Stokes equations, by showing that the knowledge of the PDE parameters and how this information is encoded, i.e., via a hypernetwork, is an essential ingredient for learning parameter-dependent control policies that can generalize effectively to unseen scenarios and for improving the sample efficiency of such policies.

[LG-8] A Plug-and-Play Bregman ADMM Module for Inferring Event Branches in Temporal Point Processes AAAI2025

链接: https://arxiv.org/abs/2501.04529
作者: Qingmei Wang,Yuxin Wu,Yujie Long,Jing Huang,Fengyuan Ran,Bing Su,Hongteng Xu
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:An event sequence generated by a temporal point process is often associated with a hidden and structured event branching process that captures the triggering relations between its historical and current events. In this study, we design a new plug-and-play module based on the Bregman ADMM (BADMM) algorithm, which infers event branches associated with event sequences in the maximum likelihood estimation framework of temporal point processes (TPPs). Specifically, we formulate the inference of event branches as an optimization problem for the event transition matrix under sparse and low-rank constraints, which is embedded in existing TPP models or their learning paradigms. We can implement this optimization problem based on subspace clustering and sparse group-lasso, respectively, and solve it using the Bregman ADMM algorithm, whose unrolling leads to the proposed BADMM module. When learning a classic TPP (e.g., Hawkes process) by the expectation-maximization algorithm, the BADMM module helps derive structured responsibility matrices in the E-step. Similarly, the BADMM module helps derive low-rank and sparse attention maps for the neural TPPs with self-attention layers. The structured responsibility matrices and attention maps, which work as learned event transition matrices, indicate event branches, e.g., inferring isolated events and those key events triggering many subsequent events. Experiments on both synthetic and real-world data show that plugging our BADMM module into existing TPP models and learning paradigms can improve model performance and provide us with interpretable structured event branches. The code is available at \urlthis https URL.

[LG-9] Histogram-Equalized Quantization for logic-gated Residual Neural Networks ISCAS2022

链接: https://arxiv.org/abs/2501.04517
作者: Van Thien Nguyen,William Guicquero,Gilles Sicard
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Published at IEEE ISCAS 2022

点击查看摘要

Abstract:Adjusting the quantization according to the data or to the model loss seems mandatory to enable a high accuracy in the context of quantized neural networks. This work presents Histogram-Equalized Quantization (HEQ), an adaptive framework for linear symmetric quantization. HEQ automatically adapts the quantization thresholds using a unique step size optimization. We empirically show that HEQ achieves state-of-the-art performances on CIFAR-10. Experiments on the STL-10 dataset even show that HEQ enables a proper training of our proposed logic-gated (OR, MUX) residual networks with a higher accuracy at a lower hardware complexity than previous work.

[LG-10] Safe Reinforcement Learning with Minimal Supervision ICML2023

链接: https://arxiv.org/abs/2501.04481
作者: Alexander Quessy,Thomas Richardson,Sebastian East
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Initially submitted to ICML 2023

点击查看摘要

Abstract:Reinforcement learning (RL) in the real world necessitates the development of procedures that enable agents to explore without causing harm to themselves or others. The most successful solutions to the problem of safe RL leverage offline data to learn a safe-set, enabling safe online exploration. However, this approach to safe-learning is often constrained by the demonstrations that are available for learning. In this paper we investigate the influence of the quantity and quality of data used to train the initial safe learning problem offline on the ability to learn safe-RL policies online. Specifically, we focus on tasks with spatially extended goal states where we have few or no demonstrations available. Classically this problem is addressed either by using hand-designed controllers to generate data or by collecting user-generated demonstrations. However, these methods are often expensive and do not scale to more complex tasks and environments. To address this limitation we propose an unsupervised RL-based offline data collection procedure, to learn complex and scalable policies without the need for hand-designed controllers or user demonstrations. Our research demonstrates the significance of providing sufficient demonstrations for agents to learn optimal safe-RL policies online, and as a result, we propose optimistic forgetting, a novel online safe-RL approach that is practical for scenarios with limited data. Further, our unsupervised data collection approach highlights the need to balance diversity and optimality for safe online exploration. Comments: Initially submitted to ICML 2023 Subjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY) Cite as: arXiv:2501.04481 [cs.LG] (or arXiv:2501.04481v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.04481 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Regularising NARX models with multi-task learning

链接: https://arxiv.org/abs/2501.04470
作者: Sarah Bee,Lawrence Bull,Nikolaos Dervilis,Keith Worden
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Nonlinear Auto-Regressive with eXogenous inputs (NARX) model can be used to describe time-varying processes; where the output depends on both previous outputs and current/previous external input variables. One limitation of NARX models is their propensity to overfit and result in poor generalisation for future predictions. The proposed method to help to overcome the issue of overfitting is a NARX model which predicts outputs at both the current time and several lead times into the future. This is a form of multi-task learner (MTL); whereby the lead time outputs will regularise the current time output. This work shows that for high noise level, MTL can be used to regularise NARX with a lower Normalised Mean Square Error (NMSE) compared to the NMSE of the independent learner counterpart.

[LG-12] Gradient Purification: Defense Against Poisoning Attack in Decentralized Federated Learning

链接: https://arxiv.org/abs/2501.04453
作者: Bin Li,Xiaoye Miao,Yongheng Shang,Xinkui Zhao,Shuiguang Deng,Jianwei Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decentralized federated learning (DFL) is inherently vulnerable to poisoning attacks, as malicious clients can transmit manipulated model gradients to neighboring clients. Existing defense methods either reject suspicious gradients per iteration or restart DFL aggregation after detecting all malicious clients. They overlook the potential accuracy benefit from the discarded malicious gradients. In this paper, we propose a novel gradient purification defense, named GPD, that integrates seamlessly with existing DFL aggregation to defend against poisoning attacks. It aims to mitigate the harm in model gradients while retaining the benefit in model weights for enhancing accuracy. For each benign client in GPD, a recording variable is designed to track the historically aggregated gradients from one of its neighbors. It allows benign clients to precisely detect malicious neighbors and swiftly mitigate aggregated malicious gradients via historical consistency checks. Upon mitigation, GPD optimizes model weights via aggregating gradients solely from benign clients. This retains the previously beneficial portions from malicious clients and exploits the contributions from benign clients, thereby significantly enhancing the model accuracy. We analyze the convergence of GPD, as well as its ability to harvest high accuracy. Extensive experiments over three datasets demonstrate that, GPD is capable of mitigating poisoning attacks under both iid and non-iid data distributions. It significantly outperforms state-of-the-art defenses in terms of accuracy against various poisoning attacks.

[LG-13] Motif Discovery Framework for Psychiatric EEG Data Classification

链接: https://arxiv.org/abs/2501.04441
作者: Melanija Kraljevska,Katerina Hlavackova-Schindler,Lukas Miklautz,Claudia Plant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In current medical practice, patients undergoing depression treatment must wait four to six weeks before a clinician can assess medication response due to the delayed noticeable effects of antidepressants. Identification of a treatment response at any earlier stage is of great importance, since it can reduce the emotional and economic burden connected with the treatment. We approach the prediction of a patient response to a treatment as a classification problem, by utilizing the dynamic properties of EEG recordings on the 7th day of the treatment. We present a novel framework that applies motif discovery to extract meaningful features from EEG data distinguishing between depression treatment responders and non-responders. We applied our framework also to classification tasks in other psychiatric EEG datasets, namely to patients with symptoms of schizophrenia, pediatric patients with intractable seizures, and Alzheimer disease and dementia. We achieved high classification precision in all data sets. The results demonstrate that the dynamic properties of the EEGs may support clinicians in decision making both in diagnosis and in the prediction depression treatment response as early as on the 7th day of the treatment. To our best knowledge, our work is the first one using motifs in the depression diagnostics in general.

[LG-14] Risk-averse policies for natural gas futures trading using distributional reinforcement learning

链接: https://arxiv.org/abs/2501.04421
作者: Félicien Hêche,Biagio Nigro,Oussama Barakat,Stephan Robert-Nicoud
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial markets have experienced significant instabilities in recent years, creating unique challenges for trading and increasing interest in risk-averse strategies. Distributional Reinforcement Learning (RL) algorithms, which model the full distribution of returns rather than just expected values, offer a promising approach to managing market uncertainty. This paper investigates this potential by studying the effectiveness of three distributional RL algorithms for natural gas futures trading and exploring their capacity to develop risk-averse policies. Specifically, we analyze the performance and behavior of Categorical Deep Q-Network (C51), Quantile Regression Deep Q-Network (QR-DQN), and Implicit Quantile Network (IQN). To the best of our knowledge, these algorithms have never been applied in a trading context. These policies are compared against five Machine Learning (ML) baselines, using a detailed dataset provided by Predictive Layer SA, a company supplying ML-based strategies for energy trading. The main contributions of this study are as follows. (1) We demonstrate that distributional RL algorithms significantly outperform classical RL methods, with C51 achieving performance improvement of more than 32%. (2) We show that training C51 and IQN to maximize CVaR produces risk-sensitive policies with adjustable risk aversion. Specifically, our ablation studies reveal that lower CVaR confidence levels increase risk aversion, while higher levels decrease it, offering flexible risk management options. In contrast, QR-DQN shows less predictable behavior. These findings emphasize the potential of distributional RL for developing adaptable, risk-averse trading strategies in volatile markets.

[LG-15] Lossless Privacy-Preserving Aggregation for Decentralized Federated Learning

链接: https://arxiv.org/abs/2501.04409
作者: Xiaoye Miao,Bin Li,Yangyang Wu,Meng Xi,Xinkui Zhao,Jianwei Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy concerns arise as sensitive data proliferate. Despite decentralized federated learning (DFL) aggregating gradients from neighbors to avoid direct data transmission, it still poses indirect data leaks from the transmitted gradients. Existing privacy-preserving methods for DFL add noise to gradients. They either diminish the model predictive accuracy or suffer from ineffective gradient protection. In this paper, we propose a novel lossless privacy-preserving aggregation rule named LPPA to enhance gradient protection as much as possible but without loss of DFL model predictive accuracy. LPPA subtly injects the noise difference between the sent and received noise into transmitted gradients for gradient protection. The noise difference incorporates neighbors’ randomness for each client, effectively safeguarding against data leaks. LPPA employs the noise flow conservation theory to ensure that the noise impact can be globally eliminated. The global sum of all noise differences remains zero, ensuring that accurate gradient aggregation is unaffected and the model accuracy remains intact. We theoretically prove that the privacy-preserving capacity of LPPA is \sqrt2 times greater than that of noise addition, while maintaining comparable model accuracy to the standard DFL aggregation without noise injection. Experimental results verify the theoretical findings and show that LPPA achieves a 13% mean improvement in accuracy over noise addition. We also demonstrate the effectiveness of LPPA in protecting raw data and guaranteeing lossless model accuracy.

[LG-16] Rising Rested MAB with Linear Drift

链接: https://arxiv.org/abs/2501.04403
作者: Omer Amichay,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider non-stationary multi-arm bandit (MAB) where the expected reward of each action follows a linear function of the number of times we executed the action. Our main result is a tight regret bound of \tilde\Theta(T^4/5K^3/5) , by providing both upper and lower bounds. We extend our results to derive instance dependent regret bounds, which depend on the unknown parametrization of the linear drift of the rewards.

[LG-17] racking UWB Devices Through Radio Frequency Fingerprinting Is Possible

链接: https://arxiv.org/abs/2501.04401
作者: Thibaud Ardoin,Niklas Pauli,Benedikt Groß,Mahsa Kholghi,Khan Reaz,Gerhard Wunder
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注: conference ICNC’25, 7 pages, 7 figures

点击查看摘要

Abstract:Ultra-wideband (UWB) is a state-of-the-art technology designed for applications requiring centimeter-level localization. Its widespread adoption by smartphone manufacturer naturally raises security and privacy concerns. Successfully implementing Radio Frequency Fingerprinting (RFF) to UWB could enable physical layer security, but might also allow undesired tracking of the devices. The scope of this paper is to explore the feasibility of applying RFF to UWB and investigates how well this technique generalizes across different environments. We collected a realistic dataset using off-the-shelf UWB devices with controlled variation in device positioning. Moreover, we developed an improved deep learning pipeline to extract the hardware signature from the signal data. In stable conditions, the extracted RFF achieves over 99% accuracy. While the accuracy decreases in more changing environments, we still obtain up to 76% accuracy in untrained locations.

[LG-18] AutoDFL: A Scalable and Automated Reputation-Aware Decentralized Federated Learning

链接: https://arxiv.org/abs/2501.04331
作者: Meryem Malak Dif,Mouhamed Amine Bouchiha,Mourad Rabah,Yacine Ghamri-Doudane
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Paper accepted at NOMS’2025 (pages 9, figures 5)

点击查看摘要

Abstract:Blockchained federated learning (BFL) combines the concepts of federated learning and blockchain technology to enhance privacy, security, and transparency in collaborative machine learning models. However, implementing BFL frameworks poses challenges in terms of scalability and cost-effectiveness. Reputation-aware BFL poses even more challenges, as blockchain validators are tasked with processing federated learning transactions along with the transactions that evaluate FL tasks and aggregate reputations. This leads to faster blockchain congestion and performance degradation. To improve BFL efficiency while increasing scalability and reducing on-chain reputation management costs, this paper proposes AutoDFL, a scalable and automated reputation-aware decentralized federated learning framework. AutoDFL leverages zk-Rollups as a Layer-2 scaling solution to boost the performance while maintaining the same level of security as the underlying Layer-1 blockchain. Moreover, AutoDFL introduces an automated and fair reputation model designed to incentivize federated learning actors. We develop a proof of concept for our framework for an accurate evaluation. Tested with various custom workloads, AutoDFL reaches an average throughput of over 3000 TPS with a gas reduction of up to 20X.

[LG-19] Navigating the Designs of Privacy-Preserving Fine-tuning for Large Language Models

链接: https://arxiv.org/abs/2501.04323
作者: Shi Haonan,Ouyang Tu,Wang An
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Instruction tuning has proven effective in enhancing Large Language Models’ (LLMs) performance on downstream tasks. However, real-world fine-tuning faces inherent conflicts between model providers’ intellectual property protection, clients’ data privacy requirements, and tuning costs. While recent approaches like split learning and offsite tuning demonstrate promising architectures for privacy-preserving fine-tuning, there is a gap in systematically addressing the multidimensional trade-offs required for diverse real-world deployments. We propose several indicative evaluation metrics to guide design trade-offs for privacy-preserving fine-tuning and a series of example designs, collectively named GuardedTuning; they result from novel combinations of system architectures with adapted privacy-enhancement methods and emerging computation techniques. Each design represents distinct trade-offs across model utility, privacy guarantees, and costs. Experimental results demonstrate that these designs protect against data reconstruction attacks while maintaining competitive fine-tuning performance.

[LG-20] VerifBFL: Leverag ing zk-SNARKs for A Verifiable Blockchained Federated Learning

链接: https://arxiv.org/abs/2501.04319
作者: Ahmed Ayoub Bellachia,Mouhamed Amine Bouchiha,Yacine Ghamri-Doudane,Mourad Rabah
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Paper accepted at NOMS’25 (9 pages, 6 Figures)

点击查看摘要

Abstract:Blockchain-based Federated Learning (FL) is an emerging decentralized machine learning paradigm that enables model training without relying on a central server. Although some BFL frameworks are considered privacy-preserving, they are still vulnerable to various attacks, including inference and model poisoning. Additionally, most of these solutions employ strong trust assumptions among all participating entities or introduce incentive mechanisms to encourage collaboration, making them susceptible to multiple security flaws. This work presents VerifBFL, a trustless, privacy-preserving, and verifiable federated learning framework that integrates blockchain technology and cryptographic protocols. By employing zero-knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARKs) and incrementally verifiable computation (IVC), VerifBFL ensures the verifiability of both local training and aggregation processes. The proofs of training and aggregation are verified on-chain, guaranteeing the integrity and auditability of each participant’s contributions. To protect training data from inference attacks, VerifBFL leverages differential privacy. Finally, to demonstrate the efficiency of the proposed protocols, we built a proof of concept using emerging tools. The results show that generating proofs for local training and aggregation in VerifBFL takes less than 81s and 2s, respectively, while verifying them on-chain takes less than 0.6s.

[LG-21] Physics-Informed Super-Resolution Diffusion for 6D Phase Space Diagnostics

链接: https://arxiv.org/abs/2501.04305
作者: Alexander Scheinker
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Accelerator Physics (physics.acc-ph)
*备注:

点击查看摘要

Abstract:Adaptive physics-informed super-resolution diffusion is developed for non-invasive virtual diagnostics of the 6D phase space density of charged particle beams. An adaptive variational autoencoder (VAE) embeds initial beam condition images and scalar measurements to a low-dimensional latent space from which a 326 pixel 6D tensor representation of the beam’s 6D phase space density is generated. Projecting from a 6D tensor generates physically consistent 2D projections. Physics-guided super-resolution diffusion transforms low-resolution images of the 6D density to high resolution 256x256 pixel images. Un-supervised adaptive latent space tuning enables tracking of time-varying beams without knowledge of time-varying initial conditions. The method is demonstrated with experimental data and multi-particle simulations at the HiRES UED. The general approach is applicable to a wide range of complex dynamic systems evolving in high-dimensional phase space. The method is shown to be robust to distribution shift without re-training.

[LG-22] Handling Incomplete Heterogeneous Data using a Data-Dependent Kernel

链接: https://arxiv.org/abs/2501.04300
作者: Youran Zhou,Mohamed Reda Bouadjenek,Jonathan Wells,Sunil Aryal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling incomplete data in real-world applications is a critical challenge due to two key limitations of existing methods: (i) they are primarily designed for numeric data and struggle with categorical or heterogeneous/mixed datasets; (ii) they assume that data is missing completely at random, which is often not the case in practice – in reality, data is missing in patterns, leading to biased results if these patterns are not accounted for. To address these two limitations, this paper presents a novel approach to handling missing values using the Probability Mass Similarity Kernel (PMK), a data-dependent kernel, which does not make any assumptions about data types and missing mechanisms. It eliminates the need for prior knowledge or extensive pre-processing steps and instead leverages the distribution of observed data. Our method unifies the representation of diverse data types by capturing more meaningful pairwise similarities and enhancing downstream performance. We evaluated our approach across over 10 datasets with numerical-only, categorical-only, and mixed features under different missing mechanisms and rates. Across both classification and clustering tasks, our approach consistently outperformed existing techniques, demonstrating its robustness and effectiveness in managing incomplete heterogeneous data.

[LG-23] An Analysis of Model Robustness across Concurrent Distribution Shifts

链接: https://arxiv.org/abs/2501.04288
作者: Myeongho Jeon,Suhwan Choi,Hyoje Lee,Teresa Yeo
类目: Machine Learning (cs.LG)
*备注: Accepted to TMLR

点击查看摘要

Abstract:Machine learning models, meticulously optimized for source data, often fail to predict target data when faced with distribution shifts (DSs). Previous benchmarking studies, though extensive, have mainly focused on simple DSs. Recognizing that DSs often occur in more complex forms in real-world scenarios, we broadened our study to include multiple concurrent shifts, such as unseen domain shifts combined with spurious correlations. We evaluated 26 algorithms that range from simple heuristic augmentations to zero-shot inference using foundation models, across 168 source-target pairs from eight datasets. Our analysis of over 100K models reveals that (i) concurrent DSs typically worsen performance compared to a single shift, with certain exceptions, (ii) if a model improves generalization for one distribution shift, it tends to be effective for others, and (iii) heuristic data augmentations achieve the best overall performance on both synthetic and real-world datasets.

[LG-24] ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization

链接: https://arxiv.org/abs/2501.04287
作者: Keisuke Sugiura,Hiroki Matsutani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zeroth-order (ZO) optimization is being recognized as a simple yet powerful alternative to standard backpropagation (BP)-based training. Notably, ZO optimization allows for training with only forward passes and (almost) the same memory as inference, making it well-suited for edge devices with limited computing and memory resources. In this paper, we propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs), namely ElasticZO and ElasticZO-INT8. ElasticZO lies in the middle between pure ZO- and pure BP-based approaches, and is based on the idea to employ BP for the last few layers and ZO for the remaining layers. ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values. Experimental results on the classification datasets show that ElasticZO effectively addresses the slow convergence of vanilla ZO and shrinks the accuracy gap to BP-based training. Compared to vanilla ZO, ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training. ElasticZO-INT8 further reduces the memory usage and training time by 1.46-1.60x and 1.38-1.42x without compromising the accuracy. These results demonstrate a better tradeoff between accuracy and training cost compared to pure ZO- and BP-based approaches, and also highlight the potential of ZO optimization in on-device learning.

[LG-25] Cluster Disperse: a general air conflict resolution heuristic using unsupervised learning

链接: https://arxiv.org/abs/2501.04281
作者: Mirmojtaba Gharibi,John-Paul Clarke
类目: Robotics (cs.RO); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:We provide a general and malleable heuristic for the air conflict resolution problem. This heuristic is based on a new neighborhood structure for searching the solution space of trajectories and flight-levels. Using unsupervised learning, the core idea of our heuristic is to cluster the conflict points and disperse them in various flight levels. Our first algorithm is called Cluster Disperse and in each iteration it assigns the most problematic flights in each cluster to another flight-level. In effect, we shuffle them between the flight-levels until we achieve a well-balanced configuration. The Cluster Disperse algorithm then uses any horizontal plane conflict resolution algorithm as a subroutine to solve these well-balanced instances. Nevertheless, we develop a novel algorithm for the horizontal plane based on a similar idea. That is we cluster and disperse the conflict points spatially in the same flight level using the gradient descent and a social force. We use a novel maneuver making flights travel on an arc instead of a straight path which is based on the aviation routine of the Radius to Fix legs. Our algorithms can handle a high density of flights within a reasonable computation time. We put their performance in context with some notable algorithms from the literature. Being a general framework, a particular strength of the Cluster Disperse is its malleability in allowing various constraints regarding the aircraft or the environment to be integrated with ease. This is in contrast to the models for instance based on mixed integer programming.

[LG-26] Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion Across Varied Physics

链接: https://arxiv.org/abs/2501.04276
作者: Yichao Zhong,Chong Zhang,Tairan He,Guanya Shi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 Pages, 6 Figures

点击查看摘要

Abstract:Real-world legged locomotion systems often need to reconcile agility and safety for different scenarios. Moreover, the underlying dynamics are often unknown and time-variant (e.g., payload, friction). In this paper, we introduce BAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior work Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety even in dynamic environments with uncertainties. BAS involves an agile policy to avoid obstacles rapidly and a recovery policy to prevent collisions, a physical parameter estimator that is concurrently trained with agile policy, and a learned control-theoretic RA (reach-avoid) value network that governs the policy switch. Also, the agile policy and RA network are both conditioned on physical parameters to make them adaptive. To mitigate the distribution shift issue, we further introduce an on-policy fine-tuning phase for the estimator to enhance its robustness and accuracy. The simulation results show that BAS achieves 50% better safety than baselines in dynamic environments while maintaining a higher speed on average. In real-world experiments, BAS shows its capability in complex environments with unknown physics (e.g., slippery floors with unknown frictions, unknown payloads up to 8kg), while baselines lack adaptivity, leading to collisions or. degraded agility. As a result, BAS achieves a 19.8% increase in speed and gets a 2.36 times lower collision rate than ABS in the real world. Videos: this https URL.

[LG-27] Modeling All Response Surfaces in One for Conditional Search Spaces

链接: https://arxiv.org/abs/2501.04260
作者: Jiaxing Li,Wei Liu,Chao Xue,Yibing Zhan,Xiaoxing Wang,Weifeng Liu,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP’s performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.

[LG-28] Stable Derivative Free Gaussian Mixture Variational Inference for Bayesian Inverse Problems

链接: https://arxiv.org/abs/2501.04259
作者: Baojun Che,Yifan Chen,Zhenghao Huan,Daniel Zhengyu Huang,Weijie Wang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 25 pages, 10 figures

点击查看摘要

Abstract:This paper is concerned with the approximation of probability distributions known up to normalization constants, with a focus on Bayesian inference for large-scale inverse problems in scientific computing. In this context, key challenges include costly repeated evaluations of forward models, multimodality, and inaccessible gradients for the forward model. To address them, we develop a variational inference framework that combines Fisher-Rao natural gradient with specialized quadrature rules to enable derivative free updates of Gaussian mixture variational families. The resulting method, termed Derivative Free Gaussian Mixture Variational Inference (DF-GMVI), guarantees covariance positivity and affine invariance, offering a stable and efficient framework for approximating complex posterior distributions. The effectiveness of DF-GMVI is demonstrated through numerical experiments on challenging scenarios, including distributions with multiple modes, infinitely many modes, and curved modes in spaces with up to hundreds of dimensions. The method’s practicality is further demonstrated in a large-scale application, where it successfully recovers the initial conditions of the Navier-Stokes equations from solution data at positive times.

[LG-29] Dynamic Localisation of Spatial-Temporal Graph Neural Network KDD’25

链接: https://arxiv.org/abs/2501.04239
作者: Wenying Duan,Shujun Guo,Wei huang,Hong Rao,Xiaoxi He
类目: Machine Learning (cs.LG)
*备注: This paper was accepted by KDD’25

点击查看摘要

Abstract:Spatial-temporal data, fundamental to many intelligent applications, reveals dependencies indicating causal links between present measurements at specific locations and historical data at the same or other locations. Within this context, adaptive spatial-temporal graph neural networks (ASTGNNs) have emerged as valuable tools for modelling these dependencies, especially through a data-driven approach rather than pre-defined spatial graphs. While this approach offers higher accuracy, it presents increased computational demands. Addressing this challenge, this paper delves into the concept of localisation within ASTGNNs, introducing an innovative perspective that spatial dependencies should be dynamically evolving over time. We introduce \textitDynAGS, a localised ASTGNN framework aimed at maximising efficiency and accuracy in distributed deployment. This framework integrates dynamic localisation, time-evolving spatial graphs, and personalised localisation, all orchestrated around the Dynamic Graph Generator, a light-weighted central module leveraging cross attention. The central module can integrate historical information in a node-independent manner to enhance the feature representation of nodes at the current moment. This improved feature representation is then used to generate a dynamic sparse graph without the need for costly data exchanges, and it supports personalised localisation. Performance assessments across two core ASTGNN architectures and nine real-world datasets from various applications reveal that \textitDynAGS outshines current benchmarks, underscoring that the dynamic modelling of spatial dependencies can drastically improve model expressibility, flexibility, and system efficiency, especially in distributed settings.

[LG-30] STLCG: A Masking Approach for Differentiable Signal Temporal Logic Specification

链接: https://arxiv.org/abs/2501.04194
作者: Parv Kapoor,Kazuki Mizuta,Eunsuk Kang,Karen Leung
类目: Robotics (cs.RO); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: To be submitted to robotics journal for review

点击查看摘要

Abstract:Signal Temporal Logic (STL) offers a concise yet expressive framework for specifying and reasoning about spatio-temporal behaviors of robotic systems. Attractively, STL admits the notion of robustness, the degree to which an input signal satisfies or violates an STL specification, thus providing a nuanced evaluation of system performance. Notably, the differentiability of STL robustness enables direct integration to robotics workflows that rely on gradient-based optimization, such as trajectory optimization and deep learning. However, existing approaches to evaluating and differentiating STL robustness rely on recurrent computations, which become inefficient with longer sequences, limiting their use in time-sensitive applications. In this paper, we present STLCG++, a masking-based approach that parallelizes STL robustness evaluation and backpropagation across timesteps, achieving more than 1000x faster computation time than the recurrent approach. We also introduce a smoothing technique for differentiability through time interval bounds, expanding STL’s applicability in gradient-based optimization tasks over spatial and temporal variables. Finally, we demonstrate STLCG++'s benefits through three robotics use cases and provide open-source Python libraries in JAX and PyTorch for seamless integration into modern robotics workflows.

[LG-31] KGIF: Optimizing Relation-Aware Recommendations with Knowledge Graph Information Fusion

链接: https://arxiv.org/abs/2501.04161
作者: Dong Hyun Jeon,Wenbo Sun,Houbing Herbert Song,Dongfang Liu,Velasquez Alvaro,Yixin Chloe Xie,Shuteng Niu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Published at IEEE Big Data 2024

点击查看摘要

Abstract:While deep-learning-enabled recommender systems demonstrate strong performance benchmarks, many struggle to adapt effectively in real-world environments due to limited use of user-item relationship data and insufficient transparency in recommendation generation. Traditional collaborative filtering approaches fail to integrate multifaceted item attributes, and although Factorization Machines account for item-specific details, they overlook broader relational patterns. Collaborative knowledge graph-based models have progressed by embedding user-item interactions with item-attribute relationships, offering a holistic perspective on interconnected entities. However, these models frequently aggregate attribute and interaction data in an implicit manner, leaving valuable relational nuances underutilized. This study introduces the Knowledge Graph Attention Network with Information Fusion (KGIF), a specialized framework designed to merge entity and relation embeddings explicitly through a tailored self-attention mechanism. The KGIF framework integrates reparameterization via dynamic projection vectors, enabling embeddings to adaptively represent intricate relationships within knowledge graphs. This explicit fusion enhances the interplay between user-item interactions and item-attribute relationships, providing a nuanced balance between user-centric and item-centric representations. An attentive propagation mechanism further optimizes knowledge graph embeddings, capturing multi-layered interaction patterns. The contributions of this work include an innovative method for explicit information fusion, improved robustness for sparse knowledge graphs, and the ability to generate explainable recommendations through interpretable path visualization. Comments: Published at IEEE Big Data 2024 Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2501.04161 [cs.LG] (or arXiv:2501.04161v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.04161 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Stochastic Process Learning via Operator Flow Matching

链接: https://arxiv.org/abs/2501.04126
作者: Yaozhong Shi,Zachary E. Ross,Domniki Asimaki,Kamyar Azizzadenesheli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expanding on neural operators, we propose a novel framework for stochastic process learning across arbitrary domains. In particular, we develop operator flow matching (\alg) for learning stochastic process priors on function spaces. \alg provides the probability density of the values of any collection of points and enables mathematically tractable functional regression at new points with mean and density estimation. Our method outperforms state-of-the-art models in stochastic process learning, functional regression, and prior learning.

[LG-33] DeepVIVONet: Using deep neural operators to optimize sensor locations with application to vortex-induced vibrations

链接: https://arxiv.org/abs/2501.04105
作者: Ruyin Wan,Ehsan Kharazmi,Michael S Triantafyllou,George Em Karniadakis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce DeepVIVONet, a new framework for optimal dynamic reconstruction and forecasting of the vortex-induced vibrations (VIV) of a marine riser, using field data. We demonstrate the effectiveness of DeepVIVONet in accurately reconstructing the motion of an off–shore marine riser by using sparse spatio-temporal measurements. We also show the generalization of our model in extrapolating to other flow conditions via transfer learning, underscoring its potential to streamline operational efficiency and enhance predictive accuracy. The trained DeepVIVONet serves as a fast and accurate surrogate model for the marine riser, which we use in an outer–loop optimization algorithm to obtain the optimal locations for placing the sensors. Furthermore, we employ an existing sensor placement method based on proper orthogonal decomposition (POD) to compare with our data-driven approach. We find that that while POD offers a good approach for initial sensor placement, DeepVIVONet’s adaptive capabilities yield more precise and cost-effective configurations.

[LG-34] Neighbor displacement-based enhanced synthetic oversampling for multiclass imbalanced data

链接: https://arxiv.org/abs/2501.04099
作者: I Made Putrama,Peter Martinek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imbalanced multiclass datasets pose challenges for machine learning algorithms. These datasets often contain minority classes that are important for accurate prediction. Existing methods still suffer from sparse data and may not accurately represent the original data patterns, leading to noise and poor model performance. A hybrid method called Neighbor Displacement-based Enhanced Synthetic Oversampling (NDESO) is proposed in this paper. This approach uses a displacement strategy for noisy data points, computing the average distance to their neighbors and moving them closer to their centroids. Random oversampling is then performed to achieve dataset balance. Extensive evaluations compare 14 alternatives on nine classifiers across synthetic and 20 real-world datasets with varying imbalance ratios. The results show that our method outperforms its competitors regarding average G-mean score and achieves the lowest statistical mean rank. This highlights its superiority and suitability for addressing data imbalance in practical applications.

[LG-35] FedKD-hybrid: Federated Hybrid Knowledge Distillation for Lithography Hotspot Detection

链接: https://arxiv.org/abs/2501.04066
作者: Yuqi Li,Xingyou Lin,Kai Zhang,Chuanguang Yang,Zhongliang Guo,Jianping Gou,Yanli Li
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) provides novel solutions for machine learning (ML)-based lithography hotspot detection (LHD) under distributed privacy-preserving settings. Currently, two research pipelines have been investigated to aggregate local models and achieve global consensus, including parameter/nonparameter based (also known as knowledge distillation, namely KD). While these two kinds of methods show effectiveness in specific scenarios, we note they have not fully utilized and transferred the information learned, leaving the potential of FL-based LDH remains unexplored. Thus, we propose FedKDhybrid in this study to mitigate the research gap. Specifically, FedKD-hybrid clients agree on several identical layers across all participants and a public dataset for achieving global consensus. During training, the trained local model will be evaluated on the public dataset, and the generated logits will be uploaded along with the identical layer parameters. The aggregated information is consequently used to update local models via the public dataset as a medium. We compare our proposed FedKD-hybrid with several state-of-the-art (SOTA) FL methods under ICCAD-2012 and FAB (real-world collected) datasets with different settings; the experimental results demonstrate the superior performance of the FedKD-hybrid algorithm. Our code is available at this https URL

[LG-36] Fuzzy Information Entropy and Region Biased Matrix Factorization for Web Service QoS Prediction

链接: https://arxiv.org/abs/2501.04063
作者: Guoxing Tang,Yugen Du,Xia Chen,Yingwei Luo,Benchi Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, there are many similar services available on the internet, making Quality of Service (QoS) a key concern for users. Since collecting QoS values for all services through user invocations is impractical, predicting QoS values is a more feasible approach. Matrix factorization is considered an effective prediction method. However, most existing matrix factorization algorithms focus on capturing global similarities between users and services, overlooking the local similarities between users and their similar neighbors, as well as the non-interactive effects between users and services. This paper proposes a matrix factorization approach based on user information entropy and region bias, which utilizes a similarity measurement method based on fuzzy information entropy to identify similar neighbors of users. Simultaneously, it integrates the region bias between each user and service linearly into matrix factorization to capture the non-interactive features between users and services. This method demonstrates improved predictive performance in more realistic and complex network environments. Additionally, numerous experiments are conducted on real-world QoS datasets. The experimental results show that the proposed method outperforms some of the state-of-the-art methods in the field at matrix densities ranging from 5% to 20%.

[LG-37] Causal Machine Learning Methods for Estimating Personalised Treatment Effects – Insights on validity from two large trials

链接: https://arxiv.org/abs/2501.04061
作者: Hongruyu Chen,Helena Aebersold,Milo Alan Puhan,Miquel Serra-Burriel
类目: Machine Learning (cs.LG)
*备注: 15 pages 1 Main table 2 Figures

点击查看摘要

Abstract:Causal machine learning (ML) methods hold great promise for advancing precision medicine by estimating personalized treatment effects. However, their reliability remains largely unvalidated in empirical settings. In this study, we assessed the internal and external validity of 17 mainstream causal heterogeneity ML methods – including metalearners, tree-based methods, and deep learning methods – using data from two large randomized controlled trials: the International Stroke Trial (N=19,435) and the Chinese Acute Stroke Trial (N=21,106). Our findings reveal that none of the ML methods reliably validated their performance, neither internal nor external, showing significant discrepancies between training and test data on the proposed evaluation metrics. The individualized treatment effects estimated from training data failed to generalize to the test data, even in the absence of distribution shifts. These results raise concerns about the current applicability of causal ML models in precision medicine, and highlight the need for more robust validation techniques to ensure generalizability.

[LG-38] SFADNet: Spatio-temporal Fused Graph based on Attention Decoupling Network for Traffic Prediction

链接: https://arxiv.org/abs/2501.04060
作者: Mei Wu,Wenchao Weng,Jun Li,Yiqian Lin,Jing Chen,Dewen Seng
类目: Machine Learning (cs.LG)
*备注: Accepted by 2025 lEEE International Conference on Acoustics, speech, and signal Processing (lCASSP2025)

点击查看摘要

Abstract:In recent years, traffic flow prediction has played a crucial role in the management of intelligent transportation systems. However, traditional prediction methods are often limited by static spatial modeling, making it difficult to accurately capture the dynamic and complex relationships between time and space, thereby affecting prediction accuracy. This paper proposes an innovative traffic flow prediction network, SFADNet, which categorizes traffic flow into multiple traffic patterns based on temporal and spatial feature matrices. For each pattern, we construct an independent adaptive spatio-temporal fusion graph based on a cross-attention mechanism, employing residual graph convolution modules and time series modules to better capture dynamic spatio-temporal relationships under different fine-grained traffic patterns. Extensive experimental results demonstrate that SFADNet outperforms current state-of-the-art baselines across four large-scale datasets.

[LG-39] Approximation Rates in Frechet Metrics: Barron Spaces Paley-Wiener Spaces and Fourier Multipliers

链接: https://arxiv.org/abs/2501.04023
作者: Ahmed Abdeljawad,Thomas Dittrich
类目: Numerical Analysis (math.NA); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Operator learning is a recent development in the simulation of Partial Differential Equations (PDEs) by means of neural networks. The idea behind this approach is to learn the behavior of an operator, such that the resulting neural network is an (approximate) mapping in infinite-dimensional spaces that is capable of (approximately) simulating the solution operator governed by the PDE. In our work, we study some general approximation capabilities for linear differential operators by approximating the corresponding symbol in the Fourier domain. Analogous to the structure of the class of Hörmander-Symbols, we consider the approximation with respect to a topology that is induced by a sequence of semi-norms. In that sense, we measure the approximation error in terms of a Fréchet metric, and our main result identifies sufficient conditions for achieving a predefined approximation error. Secondly, we then focus on a natural extension of our main theorem, in which we manage to reduce the assumptions on the sequence of semi-norms. Based on some existing approximation result for the exponential spectral Barron space, we then present a concrete example of symbols that can be approximated well, and we also show the analogy of this approximation to the design of digital filters in Signal Processing.

[LG-40] FlexCache: Flexible Approximate Cache System for Video Diffusion

链接: https://arxiv.org/abs/2501.04012
作者: Desen Sun,Henry Tian,Tim Lu,Sihang Liu
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from substantial computational complexity, often requiring several minutes to generate a single video. While prior research has addressed the computational overhead in text-to-image diffusion models, the techniques developed are not directly suitable for video diffusion models due to the significantly larger cache requirements and enhanced computational demands associated with video generation. We present FlexCache, a flexible approximate cache system that addresses the challenges in two main designs. First, we compress the caches before saving them to storage. Our compression strategy can reduce 6.7 times consumption on average. Then we find that the approximate cache system can achieve higher hit rate and computation savings by decoupling the object and background. We further design a tailored cache replacement policy to support the two techniques mentioned above better. Through our evaluation, FlexCache reaches 1.26 times higher throughput and 25% lower cost compared to the state-of-the-art diffusion approximate cache system. Subjects: Multimedia (cs.MM); Machine Learning (cs.LG) Cite as: arXiv:2501.04012 [cs.MM] (or arXiv:2501.04012v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2501.04012 Focus to learn more arXiv-issued DOI via DataCite

[LG-41] Multi-SpaCE: Multi-Objective Subsequence-based Sparse Counterfactual Explanations for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2501.04009
作者: Mario Refoyo,David Luengo
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deep Learning systems excel in complex tasks but often lack transparency, limiting their use in critical applications. Counterfactual explanations, a core tool within eXplainable Artificial Intelligence (XAI), offer insights into model decisions by identifying minimal changes to an input to alter its predicted outcome. However, existing methods for time series data are limited by univariate assumptions, rigid constraints on modifications, or lack of validity guarantees. This paper introduces Multi-SpaCE, a multi-objective counterfactual explanation method for multivariate time series. Using non-dominated ranking genetic algorithm II (NSGA-II), Multi-SpaCE balances proximity, sparsity, plausibility, and contiguity. Unlike most methods, it ensures perfect validity, supports multivariate data and provides a Pareto front of solutions, enabling flexibility to different end-user needs. Comprehensive experiments in diverse datasets demonstrate the ability of Multi-SpaCE to consistently achieve perfect validity and deliver superior performance compared to existing methods.

[LG-42] oward Sufficient Statistical Power in Algorithmic Bias Assessment: A Test for ABROCA

链接: https://arxiv.org/abs/2501.04683
作者: Conrad Borchers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic bias is a pressing concern in educational data mining (EDM), as it risks amplifying inequities in learning outcomes. The Area Between ROC Curves (ABROCA) metric is frequently used to measure discrepancies in model performance across demographic groups to quantify overall model fairness. However, its skewed distribution–especially when class or group imbalances exist–makes significance testing challenging. This study investigates ABROCA’s distributional properties and contributes robust methods for its significance testing. Specifically, we address (1) whether ABROCA follows any known distribution, (2) how to reliably test for algorithmic bias using ABROCA, and (3) the statistical power achievable with ABROCA-based bias assessments under typical EDM sample specifications. Simulation results confirm that ABROCA does not match standard distributions, including those suited to accommodate skewness. We propose nonparametric randomization tests for ABROCA and demonstrate that reliably detecting bias with ABROCA requires large sample sizes or substantial effect sizes, particularly in imbalanced settings. Findings suggest that ABROCA-based bias evaluation based on sample sizes common in EDM tends to be underpowered, undermining the reliability of conclusions about model fairness. By offering open-source code to simulate power and statistically test ABROCA, this paper aims to foster more reliable statistical testing in EDM research. It supports broader efforts toward replicability and equity in educational modeling.

[LG-43] Natural Variational Annealing for Multimodal Optimization

链接: https://arxiv.org/abs/2501.04667
作者: Tâm Le Minh,Julyan Arbel,Thomas Möllenhoff,Mohammad Emtiyaz Khan,Florence Forbes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate “fitness shaping”, a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.

[LG-44] Revisiting LocalSGD and SCAFFOLD: Improved Rates and Missing Analysis

链接: https://arxiv.org/abs/2501.04443
作者: Ruichen Luo,Sebastian U Stich,Samuel Horváth,Martin Takáč
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LocalSGD and SCAFFOLD are widely used methods in distributed stochastic optimization, with numerous applications in machine learning, large-scale data processing, and federated learning. However, rigorously establishing their theoretical advantages over simpler methods, such as minibatch SGD (MbSGD), has proven challenging, as existing analyses often rely on strong assumptions, unrealistic premises, or overly restrictive scenarios. In this work, we revisit the convergence properties of LocalSGD and SCAFFOLD under a variety of existing or weaker conditions, including gradient similarity, Hessian similarity, weak convexity, and Lipschitz continuity of the Hessian. Our analysis shows that (i) LocalSGD achieves faster convergence compared to MbSGD for weakly convex functions without requiring stronger gradient similarity assumptions; (ii) LocalSGD benefits significantly from higher-order similarity and smoothness; and (iii) SCAFFOLD demonstrates faster convergence than MbSGD for a broader class of non-quadratic functions. These theoretical insights provide a clearer understanding of the conditions under which LocalSGD and SCAFFOLD outperform MbSGD. Subjects: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2501.04443 [math.OC] (or arXiv:2501.04443v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2501.04443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Machine Learning and statistical classification of CRISPR-Cas12a diagnostic assays

链接: https://arxiv.org/abs/2501.04413
作者: Nathan Khosla,Jake M. Lesinski,Marcus Haywood-Alexander,Andrew J. deMello,Daniel A. Richards
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 25 pages, 5 figures, research paper. Nathan Khosla and Jake M. Lesinski contributed equally. Electronic supporting information is included as an appendix

点击查看摘要

Abstract:CRISPR-based diagnostics have gained increasing attention as biosensing tools able to address limitations in contemporary molecular diagnostic tests. To maximise the performance of CRISPR-based assays, much effort has focused on optimizing the chemistry and biology of the biosensing reaction. However, less attention has been paid to improving the techniques used to analyse CRISPR-based diagnostic data. To date, diagnostic decisions typically involve various forms of slope-based classification. Such methods are superior to traditional methods based on assessing absolute signals, but still have limitations. Herein, we establish performance benchmarks (total accuracy, sensitivity, and specificity) using common slope-based methods. We compare the performance of these benchmark methods with three different quadratic empirical distribution function statistical tests, finding significant improvements in diagnostic speed and accuracy when applied to a clinical data set. Two of the three statistical techniques, the Kolmogorov-Smirnov and Anderson-Darling tests, report the lowest time-to-result and highest total test accuracy. Furthermore, we developed a long short-term memory recurrent neural network to classify CRISPR-biosensing data, achieving 100% specificity on our model data set. Finally, we provide guidelines on choosing the classification method and classification method parameters that best suit a diagnostic assays needs.

[LG-46] he unbearable lightness of Restricted Boltzmann Machines: Theoretical Insights and Biological Applications

链接: https://arxiv.org/abs/2501.04387
作者: Giovanni di Sarra,Barbara Bravi,Yasser Roudi
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 7 pages, 3 figures. To be published in EPL as di Sarra et al 2025 EPL. Accepted manuscript available online at this https URL

点击查看摘要

Abstract:Restricted Boltzmann Machines are simple yet powerful neural networks. They can be used for learning structure in data, and are used as a building block of more complex neural architectures. At the same time, their simplicity makes them easy to use, amenable to theoretical analysis, yielding interpretable models in applications. Here, we focus on reviewing the role that the activation functions, describing the input-output relationship of single neurons in RBM, play in the functionality of these models. We discuss recent theoretical results on the benefits and limitations of different activation functions. We also review applications to biological data analysis, namely neural data analysis, where RBM units are mostly taken to have sigmoid activation functions and binary units, to protein data analysis and immunology where non-binary units and non-sigmoid activation functions have recently been shown to yield important insights into the data. Finally, we discuss open problems addressing which can shed light on broader issues in neural network research.

[LG-47] DCIts – Deep Convolutional Interpreter for time series

链接: https://arxiv.org/abs/2501.04339
作者: Davor Horvatic,Domjan Baric
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 15 figures

点击查看摘要

Abstract:We introduce an interpretable deep learning model for multivariate time series forecasting that prioritizes both predictive performance and interpretability - key requirements for understanding complex physical phenomena. Our model not only matches but often surpasses existing interpretability methods, achieving this without compromising accuracy. Through extensive experiments, we demonstrate its ability to identify the most relevant time series and lags that contribute to forecasting future values, providing intuitive and transparent explanations for its predictions. To minimize the need for manual supervision, the model is designed so one can robustly determine the optimal window size that captures all necessary interactions within the smallest possible time frame. Additionally, it effectively identifies the optimal model order, balancing complexity when incorporating higher-order terms. These advancements hold significant implications for modeling and understanding dynamic systems, making the model a valuable tool for applied and computational physicists.

[LG-48] FSC-loss: A Frequency-domain Structure Consistency Learning Approach for Signal Data Recovery and Reconstruction

链接: https://arxiv.org/abs/2501.04308
作者: Liwen Zhang,Zhaoji Miao,Fan Yang,Gen Shi,Jie He,Yu An,Hui Hui,Jie Tian
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages,7 figures

点击查看摘要

Abstract:A core challenge for signal data recovery is to model the distribution of signal matrix (SM) data based on measured low-quality data in biomedical engineering of magnetic particle imaging (MPI). For acquiring the high-resolution (high-quality) SM, the number of meticulous measurements at numerous positions in the field-of-view proves time-consuming (measurement of a 37x37x37 SM takes about 32 hours). To improve reconstructed signal quality and shorten SM measurement time, existing methods explore to generating high-resolution SM based on time-saving measured low-resolution SM (a 9x9x9 SM just takes about 0.5 hours). However, previous methods show poor performance for high-frequency signal recovery in SM. To achieve a high-resolution SM recovery and shorten its acquisition time, we propose a frequency-domain structure consistency loss function and data component embedding strategy to model global and local structural information of SM. We adopt a transformer-based network to evaluate this function and the strategy. We evaluate our methods and state-of-the-art (SOTA) methods on the two simulation datasets and four public measured SMs in Open MPI Data. The results show that our method outperforms the SOTA methods in high-frequency structural signal recovery. Additionally, our method can recover a high-resolution SM with clear high-frequency structure based on a down-sampling factor of 16 less than 15 seconds, which accelerates the acquisition time over 60 times faster than the measurement-based HR SM with the minimum error (nRMSE=0.041). Moreover, our method is applied in our three in-house MPI systems, and boost their performance for signal reconstruction.

[LG-49] On weight and variance uncertainty in neural networks for regression tasks

链接: https://arxiv.org/abs/2501.04272
作者: Moein Monemi,Morteza Amini,S. Mahmoud Taheri,Mohammad Arashi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to journal

点击查看摘要

Abstract:We consider the problem of weight uncertainty proposed by [Blundell et al. (2015). Weight uncertainty in neural network. In International conference on machine learning, 1613-1622, PMLR.] in neural networks (NNs) specialized for regression tasks. We further investigate the effect of variance uncertainty in their model. We show that including the variance uncertainty can improve the prediction performance of the Bayesian NN. Variance uncertainty enhances the generalization of the model by considering the posterior distribution over the variance parameter. We examine the generalization ability of the proposed model using a function approximation example and further illustrate it with the riboflavin genetic data set. We explore fully connected dense networks and dropout NNs with Gaussian and spike-and-slab priors, respectively, for the network weights.

[LG-50] Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks NEURIPS2024

链接: https://arxiv.org/abs/2501.04234
作者: Rachel Longjohn,Giri Gopalan,Emily Casleton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: LA-UR-24-25289; presented at the Workshop on Statistical Frontiers in LLMs and Foundation Models at NeurIPS 2024

点击查看摘要

Abstract:Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.

[LG-51] Comparison of Neural Models for X-ray Image Classification in COVID-19 Detection

链接: https://arxiv.org/abs/2501.04196
作者: Jimi Togni,Romis Attux
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 9 pages, 7 tables, 5 figures. XXXIX SIMPOSIO BRASILEIRO DE TELECOMUNICACOES E PROCESSAMENTO DE SINAIS - SBrT 2021

点击查看摘要

Abstract:This study presents a comparative analysis of methods for detecting COVID-19 infection in radiographic images. The images, sourced from publicly available datasets, were categorized into three classes: ‘normal,’ ‘pneumonia,’ and ‘COVID.’ For the experiments, transfer learning was employed using eight pre-trained networks: SqueezeNet, DenseNet, ResNet, AlexNet, VGG, GoogleNet, ShuffleNet, and MobileNet. DenseNet achieved the highest accuracy of 97.64% using the ADAM optimization function in the multiclass approach. In the binary classification approach, the highest precision was 99.98%, obtained by the VGG, ResNet, and MobileNet networks. A comparative evaluation was also conducted using heat maps.

[LG-52] Generation from Noisy Examples

链接: https://arxiv.org/abs/2501.04179
作者: Ananth Raman,Vinod Raman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 19 pages. arXiv admin note: text overlap with arXiv:2410.13714

点击查看摘要

Abstract:We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.

[LG-53] Mixing Times and Privacy Analysis for the Projected Langevin Algorithm under a Modulus of Continuity

链接: https://arxiv.org/abs/2501.04134
作者: Mario Bravo,Juan P. Flores-Mella,Cristóbal Guzmán
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 40 pages, 2 figures

点击查看摘要

Abstract:We study the mixing time of the projected Langevin algorithm (LA) and the privacy curve of noisy Stochastic Gradient Descent (SGD), beyond nonexpansive iterations. Specifically, we derive new mixing time bounds for the projected LA which are, in some important cases, dimension-free and poly-logarithmic on the accuracy, closely matching the existing results in the smooth convex case. Additionally, we establish new upper bounds for the privacy curve of the subsampled noisy SGD algorithm. These bounds show a crucial dependency on the regularity of gradients, and are useful for a wide range of convex losses beyond the smooth case. Our analysis relies on a suitable extension of the Privacy Amplification by Iteration (PABI) framework (Feldman et al., 2018; Altschuler and Talwar, 2022, 2023) to noisy iterations whose gradient map is not necessarily nonexpansive. This extension is achieved by designing an optimization problem which accounts for the best possible Rényi divergence bound obtained by an application of PABI, where the tractability of the problem is crucially related to the modulus of continuity of the associated gradient mapping. We show that, in several interesting cases – including the nonsmooth convex, weakly smooth and (strongly) dissipative – such optimization problem can be solved exactly and explicitly. This yields the tightest possible PABI-based bounds, where our results are either new or substantially sharper than those in previous works.

[LG-54] MERCURY: A fast and versatile multi-resolution based global emulator of compound climate hazards

链接: https://arxiv.org/abs/2501.04018
作者: Shruti Nath,Julie Carreau,Kai Kornhuber,Peter Pfleiderer,Carl-Friedrich Schleussner,Philippe Naveau
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:High-impact climate damages are often driven by compounding climate conditions. For example, elevated heat stress conditions can arise from a combination of high humidity and temperature. To explore future changes in compounding hazards under a range of climate scenarios and with large ensembles, climate emulators can provide light-weight, data-driven complements to Earth System Models. Yet, only a few existing emulators can jointly emulate multiple climate variables. In this study, we present the Multi-resolution EmulatoR for CompoUnd climate Risk analYsis: MERCURY. MERCURY extends multi-resolution analysis to a spatio-temporal framework for versatile emulation of multiple variables. MERCURY leverages data-driven, image compression techniques to generate emulations in a memory-efficient manner. MERCURY consists of a regional component that represents the monthly, regional response of a given variable to yearly Global Mean Temperature (GMT) using a probabilistic regression based additive model, resolving regional cross-correlations. It then adapts a reverse lifting-scheme operator to jointly spatially disaggregate regional, monthly values to grid-cell level. We demonstrate MERCURY’s capabilities on representing the humid-heat metric, Wet Bulb Globe Temperature, as derived from temperature and relative humidity emulations. The emulated WBGT spatial correlations correspond well to those of ESMs and the 95% and 97.5% quantiles of WBGT distributions are well captured, with an average of 5% deviation. MERCURY’s setup allows for region-specific emulations from which one can efficiently “zoom” into the grid-cell level across multiple variables by means of the reverse lifting-scheme operator. This circumvents the traditional problem of having to emulate complete, global-fields of climate data and resulting storage requirements.

信息检索

[IR-0] Evaluating Interval-based Tokenization for Pitch Representation in Symbolic Music Analysis AAAI AAAI2025

链接: https://arxiv.org/abs/2501.04630
作者: Dinh-Viet-Toan Le,Louis Bigo,Mikaela Keller
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at Artificial Intelligence for Music Workshop at AAAI 2025 ( this https URL )

点击查看摘要

Abstract:Symbolic music analysis tasks are often performed by models originally developed for Natural Language Processing, such as Transformers. Such models require the input data to be represented as sequences, which is achieved through a process of tokenization. Tokenization strategies for symbolic music often rely on absolute MIDI values to represent pitch information. However, music research largely promotes the benefit of higher-level representations such as melodic contour and harmonic relations for which pitch intervals turn out to be more expressive than absolute pitches. In this work, we introduce a general framework for building interval-based tokenizations. By evaluating these tokenizations on three music analysis tasks, we show that such interval-based tokenizations improve model performances and facilitate their explainability.

[IR-1] A Closer Look on Gender Stereotypes in Movie Recommender Systems and Their Implications with Privacy

链接: https://arxiv.org/abs/2501.04420
作者: Falguni Roy,Yiduo Shen,Na Zhao,Xiaofeng Ding,Md. Omar Faruk
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:The movie recommender system typically leverages user feedback to provide personalized recommendations that align with user preferences and increase business revenue. This study investigates the impact of gender stereotypes on such systems through a specific attack scenario. In this scenario, an attacker determines users’ gender, a private attribute, by exploiting gender stereotypes about movie preferences and analyzing users’ feedback data, which is either publicly available or observed within the system. The study consists of two phases. In the first phase, a user study involving 630 participants identified gender stereotypes associated with movie genres, which often influence viewing choices. In the second phase, four inference algorithms were applied to detect gender stereotypes by combining the findings from the first phase with users’ feedback data. Results showed that these algorithms performed more effectively than relying solely on feedback data for gender inference. Additionally, we quantified the extent of gender stereotypes to evaluate their broader impact on digital computational science. The latter part of the study utilized two major movie recommender datasets: MovieLens 1M and Yahoo!Movie. Detailed experimental information is available on our GitHub repository: this https URL

[IR-2] An innovative data collection method to eliminate the preprocessing phase in web usage mining

链接: https://arxiv.org/abs/2501.04364
作者: Ozkan Canay,Umit Kocabicak
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:The underlying data source for web usage mining (WUM) is commonly thought to be server logs. However, access log files ensure quite limited data about the clients. Identifying sessions from this messy data takes a considerable effort, and operations performed for this purpose do not always yield excellent results. Also, this data cannot be used for web analytics efficiently. This study proposes an innovative method for user tracking, session management, and collecting web usage data. The method is mainly based on a new approach for using collected data for web analytics extraction as the data source in web usage mining. An application-based API has been developed with a different strategy from conventional client-side methods to obtain and process log data. The log data has been successfully gathered by integrating the technique into an enterprise web application. The results reveal that the homogeneous structured data collected and stored with this method is more convenient to browse, filter, and process than web server logs. This data stored on a relational database can be used effortlessly as a reliable data source for high-performance web usage mining activity, real-time web analytics, or a functional recommendation system.

[IR-3] Advancing Similarity Search with GenAI: A Retrieval Augmented Generation Approach

链接: https://arxiv.org/abs/2501.04006
作者: Jean Bertin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This article introduces an innovative Retrieval Augmented Generation approach to similarity search. The proposed method uses a generative model to capture nuanced semantic information and retrieve similarity scores based on advanced context understanding. The study focuses on the BIOSSES dataset containing 100 pairs of sentences extracted from the biomedical domain, and introduces similarity search correlation results that outperform those previously attained on this dataset. Through an in-depth analysis of the model sensitivity, the research identifies optimal conditions leading to the highest similarity search accuracy: the results reveals high Pearson correlation scores, reaching specifically 0.905 at a temperature of 0.5 and a sample size of 20 examples provided in the prompt. The findings underscore the potential of generative models for semantic information retrieval and emphasize a promising research direction to similarity search.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-09

目录

概览 (2025-01-09)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载