Arxiv今日论文 | 2025-01-08

本篇博文主要内容为 2025-01-08 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大语言模型（LLMs）的校准问题，即模型置信度与预测准确性之间的对齐问题。现有的研究往往忽略了其方法在不同提示风格和不同规模LLMs上的泛化能力。为此，论文提出了一个控制实验设置，涵盖了12种LLMs和四种提示风格，并探讨了通过整合多个LLMs的响应一致性以及使用适当的损失函数来提升校准性能的可能性。具体而言，论文提出了Calib-n框架，该框架通过训练一个辅助模型来估计置信度，并聚合多个LLMs的响应以捕捉模型间的一致性。为了优化校准，论文结合了焦点损失（focal loss）和AUC替代损失（AUC surrogate loss）以及二元交叉熵（binary cross-entropy）。实验结果表明，响应一致性和焦点损失均能显著提升校准性能，且少样本提示（few-shot prompts）在基于辅助模型的方法中最为有效。辅助模型在不同准确性变化下表现出稳健的校准性能，优于LLMs的内部概率和口头置信度。这些发现深化了对LLM校准影响因素的理解，支持其在不同应用中的可靠部署。

链接: https://arxiv.org/abs/2501.03991
作者: Yuxi Xia,Pedro Henrique Luz de Araujo,Klim Zaporojets,Benjamin Roth
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 11 figures, 8 tables

点击查看摘要

Abstract:Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs’ internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.
zh

[NLP-1] Semantically Cohesive Word Grouping in Indian Languages

【速读】：该论文试图解决印度语言在计算和语言学处理中的句法结构不一致性问题。由于印度语言具有屈折性和黏着性（inflectional and agglutinative），且通常遵循无从句词序（clause-free word order），不同语言在依赖解析树（dependency parse trees）上的结构相似，但由于语言特性或表达方式的差异，解析结构可能存在不一致。这些不一致部分源于句子中最小语义单元（semantic unit）的表示粒度不同。解决方案的关键在于提出词分组（word grouping）作为预处理步骤，通过基于语义的词分组来统一不同语言平行句子的解析结构，从而提升形态学（morphology）和句法结构的一致性。论文以印地语（Hindi）为例，通过定量和定性分析验证了词分组在机器翻译（Machine Translation, MT）等自然语言处理任务中的重要性。实验表明，词分组技术能够有效统一句法结构，并提升底层自然语言处理任务的性能。

链接: https://arxiv.org/abs/2501.03988
作者: N J Karthika,Adyasha Patra,Nagasai Saketh Naidu,Arnab Bhattacharya,Ganesh Ramakrishnan,Chaitali Dangarikar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Indian languages are inflectional and agglutinative and typically follow clause-free word order. The structure of sentences across most major Indian languages are similar when their dependency parse trees are considered. While some differences in the parsing structure occur due to peculiarities of a language or its preferred natural way of conveying meaning, several apparent differences are simply due to the granularity of representation of the smallest semantic unit of processing in a sentence. The semantic unit is typically a word, typographically separated by whitespaces. A single whitespace-separated word in one language may correspond to a group of words in another. Hence, grouping of words based on semantics helps unify the parsing structure of parallel sentences across languages and, in the process, morphology. In this work, we propose word grouping as a major preprocessing step for any computational or linguistic processing of sentences for Indian languages. Among Indian languages, since Hindi is one of the least agglutinative, we expect it to benefit the most from word-grouping. Hence, in this paper, we focus on Hindi to study the effects of grouping. We perform quantitative assessment of our proposal with an intrinsic method that perturbs sentences by shuffling words as well as an extrinsic evaluation that verifies the importance of word grouping for the task of Machine Translation (MT) using decomposed prompting. We also qualitatively analyze certain aspects of the syntactic structure of sentences. Our experiments and analyses show that the proposed grouping technique brings uniformity in the syntactic structures, as well as aids underlying NLP tasks.
zh

[NLP-2] Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States ALT

【速读】：该论文试图解决在数据隐私受限的背景下，如何在欧盟等地区使用本地可部署的开源权重大语言模型（LLMs）来支持较少使用的语言（如立陶宛语、拉脱维亚语和爱沙尼亚语）的问题。研究评估了多种开源权重模型（如Llama~3、Gemma~2、Phi和NeMo）在不同任务（如机器翻译、选择题问答和自由文本生成）中的表现。研究结果表明，尽管某些模型（如Gemma~2）在性能上接近顶级商业模型，但大多数开源权重模型在处理这些较少使用的语言时仍存在困难，尤其是在词汇幻觉（lexical hallucinations）方面，所有开源多语言模型在每20个词中至少出现一个错误。解决方案的关键在于通过本地部署的开源权重模型来平衡数据隐私和语言处理需求，同时识别和改进这些模型在较少使用语言上的性能瓶颈。

链接: https://arxiv.org/abs/2501.03952
作者: Jurgita Kapočiūtė-Dzikienė,Toms Bergmanis,Mārcis Pinnis
机构: Tilde IT, Lithuania; Faculty of Informatics, Vytautas Magnus University, Lithuania; Tilde, Latvia; Faculty of Computing, University of Latvia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper is accepted to NoDaLiDa/Baltic-HLT 2025

点击查看摘要

Abstract:Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defence, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma~2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.
zh

[NLP-3] Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection

【速读】：该论文试图解决的是如何有效检测由大型语言模型（LLMs）生成的文本，特别是在未见过的领域或使用不熟悉的LLMs时，这一问题尤为复杂。现有的零样本方法在利用LLM的下一词分布输出进行检测时效果有限，主要原因在于这些方法通常使用均值来聚合跨词的下一词分布指标，而未考虑到不同词的预测难度差异。论文提出的解决方案是Perplexity Attention Weighted Network (PAWN)，该方法通过利用LLM的最后一层隐藏状态和位置信息，对基于下一词分布指标的系列特征进行加权求和。PAWN不仅减少了训练资源的需求，还在分布内检测任务中表现出色，甚至优于经过微调的语言模型，同时在未见过的领域和源模型上具有更好的泛化能力，且对对抗攻击更具鲁棒性。此外，PAWN在多语言环境下也表现出良好的泛化能力。

链接: https://arxiv.org/abs/2501.03940
作者: Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho
机构: Technical University of Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement in large language models (LLMs) has significantly enhanced their ability to generate coherent and contextually relevant text, raising concerns about the misuse of AI-generated content and making it critical to detect it. However, the task remains challenging, particularly in unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution outputs offers a theoretically appealing approach for detection, as they encapsulate insights from the models’ extensive pre-training on diverse corpora. Despite its promise, zero-shot methods that attempt to operationalize these outputs have met with limited success. We hypothesize that one of the problems is that they use the mean to aggregate next-token distribution metrics across tokens, when some tokens are naturally easier or harder to predict and should be weighted differently. Based on this idea, we propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. Although not zero-shot, our method allows us to cache the last hidden states and next-token distribution metrics on disk, greatly reducing the training resource requirements. PAWN shows competitive and even better performance in-distribution than the strongest baselines (fine-tuned LMs) with a fraction of their trainable parameters. Our model also generalizes better to unseen domains and source models, with smaller variability in the decision boundary across distribution shifts. It is also more robust to adversarial attacks, and if the backbone has multilingual capabilities, it presents decent generalization to languages not seen during supervised training, with LLaMA3-1B reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine languages.
zh

[NLP-4] PPTAgent : Generating and Evaluating Presentations Beyond Text-to-Slides

【速读】：该论文试图解决自动从文档生成演示文稿（presentation）时面临的挑战，特别是如何在内容质量、视觉设计和结构连贯性之间取得平衡。现有的方法主要集中于单独提升和评估内容质量，往往忽视了视觉设计和结构连贯性，这限制了其实际应用性。为解决这些局限性，论文提出了PPTAgent，这是一种基于两阶段编辑的方法，灵感来源于人类的工作流程。PPTAgent首先分析参考演示文稿以理解其结构模式和内容模式，然后通过代码操作草拟大纲并生成幻灯片，以确保一致性和对齐性。此外，论文还引入了PPTEval，一个评估框架，用于从内容、设计和连贯性三个维度全面评估生成的演示文稿质量。实验结果表明，PPTAgent在所有三个维度上均显著优于传统的自动演示文稿生成方法。

链接: https://arxiv.org/abs/2501.03936
作者: Hao Zheng,Xinyan Guan,Hao Kong,Jia Zheng,Hongyu Lin,Yaojie Lu,Ben He,Xianpei Han,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所中文信息处理实验室); University of Chinese Academy of Sciences(中国科学院大学); Shanghai Jiexin Technology(上海捷信科技)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 20 figures

点击查看摘要

Abstract:Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, often overlooking visual design and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to understand their structural patterns and content schemas, then drafts outlines and generates slides through code actions to ensure consistency and alignment. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments show that PPTAgent significantly outperforms traditional automatic presentation generation methods across all three dimensions. The code and data are available at this https URL.
zh

[NLP-5] From Newswire to Nexus: Using text-based actor embeddings and transformer networks to forecast conflict dynamics

【速读】：该论文旨在解决暴力冲突模式动态变化的预测问题，特别是在冲突参与者（如政府、民兵、分离主义运动和恐怖组织）之间的冲突升级和降级预测。现有方法难以准确捕捉暴力冲突的易变性，而本文通过结合新闻文本和结构化冲突事件数据，利用自然语言处理（NLP）技术中的transformer模型生成基于文本的参与者嵌入（actor embeddings），从而实现对冲突动态变化的精准预测。解决方案的关键在于将新闻文本的上下文信息与结构化事件数据的精确性相结合，构建了一个混合数据集，并通过历史事件的反向测试验证了该方法的优越预测能力。这一方法不仅能够动态且细致地预测冲突发展，还为政策制定者、人道主义组织和维和行动提供了可操作的见解，以支持有针对性和有效的干预策略。

链接: https://arxiv.org/abs/2501.03928
作者: Mihai Croicu,Simon Polichinel von der Maase
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 35 pages, 5 figures. Paper presented at the 120th American Political Science Association Annual Meeting

点击查看摘要

Abstract:This study advances the field of conflict forecasting by using text-based actor embeddings with transformer models to predict dynamic changes in violent conflict patterns at the actor level. More specifically, we combine newswire texts with structured conflict event data and leverage recent advances in Natural Language Processing (NLP) techniques to forecast escalations and de-escalations among conflicting actors, such as governments, militias, separatist movements, and terrorists. This new approach accurately and promptly captures the inherently volatile patterns of violent conflicts, which existing methods have not been able to achieve. To create this framework, we began by curating and annotating a vast international newswire corpus, leveraging hand-labeled event data from the Uppsala Conflict Data Program. By using this hybrid dataset, our models can incorporate the textual context of news sources along with the precision and detail of structured event data. This combination enables us to make both dynamic and granular predictions about conflict developments. We validate our approach through rigorous back-testing against historical events, demonstrating superior out-of-sample predictive power. We find that our approach is quite effective in identifying and predicting phases of conflict escalation and de-escalation, surpassing the capabilities of traditional models. By focusing on actor interactions, our explicit goal is to provide actionable insights to policymakers, humanitarian organizations, and peacekeeping operations in order to enable targeted and effective intervention strategies.
zh

[NLP-6] Dolphin: Closed-loop Open-ended Auto-research through Thinking Practice and Feedback

【速读】：该论文旨在解决科学研究范式在人工智能（AI）发展背景下的转型问题，特别是如何通过AI辅助方法实现自动化科学研究。论文提出了Dolphin，这是首个闭环开放式自动研究框架，旨在模拟人类科学研究的全过程。Dolphin的核心解决方案包括三个关键步骤：首先，基于相关文献生成新颖的研究思路，这些文献通过主题和任务属性进行排序；其次，自动生成代码并通过异常回溯引导的局部代码结构进行调试；最后，自动分析每个研究思路的实验结果，并将结果反馈到下一轮思路生成中。实验结果表明，Dolphin能够在不同主题的基准数据集上持续生成新颖的研究思路，并在循环中完成实验，其提出的方法在某些任务（如2D图像分类和3D点分类）中与现有最先进方法相当。

链接: https://arxiv.org/abs/2501.03916
作者: Jiakang Yuan,Xiangchao Yan,Botian Shi,Tao Chen,Wanli Ouyang,Bo Zhang,Lei Bai,Yu Qiao,Bowen Zhou
机构: Fudan University(复旦大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures, and our homepage: this https URL

点击查看摘要

Abstract:The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we propose Dolphin, the first closed-loop open-ended auto-research framework to further build the entire process of human scientific research. Dolphin can generate research ideas, perform experiments, and get feedback from experimental results to generate higher-quality ideas. More specifically, Dolphin first generates novel ideas based on relevant papers which are ranked by the topic and task attributes. Then, the codes are automatically generated and debugged with the exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and results show that Dolphin can generate novel ideas continuously and complete the experiment in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 2D image classification and 3D point classification.
zh

[NLP-7] LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

【速读】：该论文旨在解决实时大规模多模态模型（LMMs）在处理视觉输入时面临的高计算开销问题。具体而言，现有的LMM框架通常将视觉输入编码为视觉标记（vision tokens），并将其与文本指令整合到大规模语言模型（LLMs）的上下文中，导致大量的参数和上下文标记（主要是视觉标记）带来了显著的计算负担。以往的研究主要集中在用较小的模型替换LLM骨干，而忽视了标记数量的关键问题。本文提出的解决方案LLaVA-Mini通过引入模态预融合（modality pre-fusion）技术，提前将视觉信息融合到文本标记中，从而将输入LLM骨干的视觉标记压缩到仅一个标记，实现了视觉标记的高压缩比。实验表明，LLaVA-Mini在仅使用1个视觉标记的情况下，性能优于使用576个视觉标记的LLaVA-v1.5，并在11个图像基准和7个视频基准上表现出色。此外，LLaVA-Mini显著降低了计算开销（FLOPs减少77%），并在40毫秒内提供低延迟响应，能够在24GB内存的GPU硬件上处理超过10,000帧的视频。

链接: https://arxiv.org/abs/2501.03895
作者: Shaolei Zhang,Qingkai Fang,Zhe Yang,Yang Feng
机构: 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) (中国科学院计算技术研究所智能信息处理重点实验室); 2Key Laboratory of AI Safety, Chinese Academy of Sciences (中国科学院人工智能安全重点实验室); 3University of Chinese Academy of Sciences, Beijing, China (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL Model: this https URL

点击查看摘要

Abstract:The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
zh

[NLP-8] AlphaPO – Reward shape matters for LLM alignment

【速读】：该论文试图解决直接对齐算法（Direct Alignment Algorithms, DAAs）中存在的似然位移（likelihood displacement）问题，即在对齐过程中，偏好响应的概率往往会不理想地降低。为了解决这一问题，论文提出了一种新的DAA方法——AlphaPO，其关键创新在于通过引入一个α参数来改变奖励函数的形状，超越标准的对数奖励函数。这种方法能够更好地控制似然位移和过度优化问题，从而提升对齐性能。实验结果表明，与当前表现最好的DAA方法SimPO相比，AlphaPO在Mistral-7B和Llama3-8B模型的指令版本上实现了约7%到10%的相对性能提升。论文强调了奖励函数形状的重要性，并展示了如何通过系统性地改变奖励函数形状来影响训练动态并提升对齐性能。

链接: https://arxiv.org/abs/2501.03884
作者: Aman Gupta,Shao Tang,Qingquan Song,Sirou Zhu,Jiwoo Hong,Ankan Saha,Viral Gupta,Noah Lee,Eunki Kim,Jason Zhu,Natesh Pillai,S. Sathiya Keerthi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. Work in progress

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Examples include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an \alpha -parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B. The analysis and results presented highlight the importance of the reward shape, and how one can systematically change it to affect training dynamics, as well as improve alignment performance. Comments: Preprint. Work in progress Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.03884 [cs.CL] (or arXiv:2501.03884v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.03884 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] Add Noise Tasks or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection COLING2025

【速读】：该论文试图解决在低资源场景下，针对挪威方言的槽位和意图检测（Slot and Intent Detection, SID）问题。尽管槽位和意图检测是自然语言理解（Natural Language Understanding, NLU）中的经典任务，但针对方言和口语变体的研究相对较少，且许多低资源场景下的方法尚未应用于方言数据或在同一数据集上进行对比。论文通过参与VarDial 2025共享任务，比较了多种实验设置，包括使用不同训练数据（英语、挪威语或挪威方言）、注入字符级噪声、训练辅助任务以及应用层交换（Layer Swapping）技术。其中，层交换技术通过将不同数据集上微调的模型层组合成一个新模型，取得了显著效果。实验结果表明，噪声注入对模型性能有积极影响，而辅助任务的效果则较为复杂。最终，结合英语和少量挪威方言数据训练的模型在槽位预测上表现最为稳健，最佳模型在共享任务中达到了97.6%的意图准确率和85.6%的槽位F1分数。

链接: https://arxiv.org/abs/2501.03870
作者: Verena Blaschke,Felicia Körner,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany (信息与语言处理中心, 慕尼黑大学); Munich Center for Machine Learning (MCML), Munich, Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: VarDial @ COLING 2025

点击查看摘要

Abstract:Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data, or compared to each other on the same datasets. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties, and compare multiple set-ups: varying the training data (English, Norwegian, or dialectal Norwegian), injecting character-level noise, training on auxiliary tasks, and applying Layer Swapping, a technique in which layers of models fine-tuned on different datasets are assembled into a model. We find noise injection to be beneficial while the effects of auxiliary tasks are mixed. Though some experimentation was required to successfully assemble a model from layers, it worked surprisingly well; a combination of models trained on English and small amounts of dialectal data produced the most robust slot predictions. Our best models achieve 97.6% intent accuracy and 85.6% slot F1 in the shared task.
zh

[NLP-10] Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study COLING2025

【速读】：该论文试图解决在自然语言理解（NLU）中，尤其是在方言数据上，槽位和意图检测（Slot and Intent Detection, SID）的可靠性问题。由于方言缺乏标准化形式且训练数据稀缺且成本高昂，传统的基于高资源语言微调的编码器-仅变压器模型（Encoder-only Transformer Models）在方言数据上表现不佳。论文提出了一种零样本迁移学习（Zero-shot Transfer Learning）的解决方案，重点关注巴伐利亚方言，并发布了慕尼黑方言的新数据集。解决方案的关键在于通过辅助任务（Auxiliary Tasks）进行多任务学习（Multi-task Learning）和中间任务训练（Intermediate-task Training），并比较了三种辅助任务的效果：词级句法任务（Token-level Syntactic Tasks）、命名实体识别（Named Entity Recognition, NER）和语言建模（Language Modelling）。研究发现，辅助任务对槽位填充（Slot Filling）的正面影响大于意图分类（Intent Classification），其中NER的效果最为显著，且中间任务训练能够带来更一致的性能提升。最终，最佳方法在巴伐利亚方言上的意图分类性能提升了5.1个百分点，槽位填充的F1分数提升了8.4个百分点。

链接: https://arxiv.org/abs/2501.03863
作者: Xaver Maria Krückl,Verena Blaschke,Barbara Plank
机构: 未知
类目: Computation and Language (cs.CL)
备注: VarDial @ COLING 2025

点击查看摘要

Abstract:Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.
zh

[NLP-11] Progressive Document-level Text Simplification via Large Language Models

【速读】：该论文试图解决长文档简化（Document Simplification, DS）任务中的挑战，特别是现有大型语言模型（LLMs）如ChatGPT在处理该任务时表现不佳的问题。现有模型往往将文档简化误认为仅仅是文档摘要，而忽略了在保持文档一致性的同时，需要在篇章、句子和词汇层面进行适度的简化操作。论文提出了一种渐进式简化方法（Progressive Simplification Method, ProgDS），通过多阶段协作的方式模拟人类编辑的分层复杂性简化策略。该方法将任务分解为篇章级、主题级和词汇级简化，逐步进行文档简化。实验结果表明，ProgDS显著优于现有小型模型或直接使用LLMs进行提示的方法，推动了文档简化任务的技术前沿。

链接: https://arxiv.org/abs/2501.03857
作者: Dengzhao Fang,Jipeng Qiang,Yi Zhu,Yunhao Yuan,Wei Li,Yan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Research on text simplification has primarily focused on lexical and sentence-level changes. Long document-level simplification (DS) is still relatively unexplored. Large Language Models (LLMs), like ChatGPT, have excelled in many natural language processing tasks. However, their performance on DS tasks is unsatisfactory, as they often treat DS as merely document summarization. For the DS task, the generated long sequences not only must maintain consistency with the original document throughout, but complete moderate simplification operations encompassing discourses, sentences, and word-level simplifications. Human editors employ a hierarchical complexity simplification strategy to simplify documents. This study delves into simulating this strategy through the utilization of a multi-stage collaboration using LLMs. We propose a progressive simplification method (ProgDS) by hierarchically decomposing the task, including the discourse-level, topic-level, and lexical-level simplification. Experimental results demonstrate that ProgDS significantly outperforms existing smaller models or direct prompting with LLMs, advancing the state-of-the-art in the document simplification task.
zh

[NLP-12] BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context

【速读】：该论文试图解决低资源语言（low-resource languages）在自然语言处理任务中的挑战，特别是在可用语料库（corpora）有限的情况下如何有效训练语言模型。论文以isiXhosa语言为例，探讨了数据高效的语言模型（data-efficient language models）在低资源语言上的潜力。解决方案的关键在于使用BabyLM挑战中提出的新型架构，如ELC-BERT和MLSM，这些架构在有限的语料库（100m单词）上进行预训练，并在词性标注（POS tagging）和命名实体识别（NER）任务中表现优异，甚至在某些情况下超越了XLM-R模型。研究结果表明，数据高效的模型在低资源语言上是可行的，但也强调了高质量预训练数据的持续重要性及其缺乏的现状。

链接: https://arxiv.org/abs/2501.03855
作者: Alexis Matzopoulos,Charl Hendriks,Hishaam Mahomed,Francois Meyer
机构: University of Cape Town (开普敦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (100m). The challenge produced new architectures for data-efficient language modelling, which outperformed models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.
zh

[NLP-13] BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study COLING2025 ACL

【速读】：该论文试图解决在印地语（Hindi）等本土语言的短文本数据中，如何有效进行主题建模（topic modeling）的问题。随着这些语言在现代媒体中的使用日益增多，开发适用于短文本的鲁棒主题建模方法变得尤为重要。论文的关键解决方案是采用BERTopic模型，该模型利用上下文嵌入（contextual embeddings）来捕捉数据中的语义关系，从而在处理短文本和多样化文本时表现出比传统模型更高的效果。研究通过评估6种不同的文档嵌入模型，并将BERTopic与8种现有的主题建模技术（如LDA、NMF、LSI等）进行比较，发现BERTopic在从印地语短文本中提取连贯主题方面表现最佳。

链接: https://arxiv.org/abs/2501.03843
作者: Atharva Mutsaddi,Anvi Jamkhande,Aryan Thakre,Yashodhara Haribhakta
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted into IndoNLP: The First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages, collocated with COLING 2025. Set to appear in the workshop proceedings published in ACL Anthology

点击查看摘要

Abstract:As short text data in native languages like Hindi increasingly appear in modern media, robust methods for topic modeling on such data have gained importance. This study investigates the performance of BERTopic in modeling Hindi short texts, an area that has been under-explored in existing research. Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models, especially for short and diverse texts. We evaluate BERTopic using 6 different document embedding models and compare its performance against 8 established topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive Regularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis (PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec. The models are assessed using coherence scores across a range of topic counts. Our results reveal that BERTopic consistently outperforms other models in capturing coherent topics from short Hindi texts.
zh

[NLP-14] ACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification

【速读】：该论文试图解决产品属性值识别（Product Attribute Value Identification, PAVI）中的关键挑战，包括推断隐式属性值、处理分布外（Out-of-Distribution, OOD）值以及生成规范化输出。现有方法在这些方面存在显著不足。为此，论文提出了一种基于检索的解决方案——Taxonomy-Aware Contrastive Learning Retrieval (TACLR)。TACLR 将 PAVI 任务转化为信息检索问题，通过对产品描述和候选属性值进行嵌入编码，并根据其与产品嵌入的相似性进行检索。其关键创新在于结合了基于分类的对比学习（Contrastive Learning）和硬负样本采样（Hard Negative Sampling），并采用动态阈值的自适应推理机制。TACLR 的优势在于能够有效处理隐式和 OOD 值，支持大规模类别、属性和值的扩展，并适用于高负载的工业场景。实验表明，TACLR 在私有和公开数据集上均表现出色，并已成功应用于实际电商平台，每日处理数百万条产品列表。

链接: https://arxiv.org/abs/2501.03835
作者: Yindu Su,Huike Zou,Lin Sun,Ting Zhang,Haiyang Yang,Liyu Chen,David Lo,Qingheng Zhang,Shuguang Han,Jufeng Chen
机构: 1Alibaba Group(阿里巴巴集团); 2Singapore Management University(新加坡管理大学); 3Hangzhou City University(杭州城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendations, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity to the item embedding. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial scenarios. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Moreover, it has been successfully deployed in a real-world e-commerce platform, processing millions of product listings daily while supporting dynamic, large-scale attribute taxonomies.
zh

[NLP-15] Investigating the Impact of Data Selection Strategies on Language Model Performance

【速读】：该论文探讨了数据选择（data selection）对语言模型性能的影响，特别是如何通过选择训练数据子集来更好地与目标分布（target distribution）对齐。研究的关键在于评估不同数据选择方法和特征类型对模型性能的影响，具体包括：1）数据子集选择是否会影响下游任务的性能；2）n-gram特征是否有助于更好地对齐目标分布；3）基于嵌入的神经特征（embedding-based neural features）是否能提供额外的优势。通过对比实验，研究比较了基线随机选择方法与分布对齐方法的效果，揭示了数据选择策略与模型训练效能之间的相互作用。解决方案的关键在于通过实验验证不同数据选择方法的有效性，并为模型训练提供优化方向。

链接: https://arxiv.org/abs/2501.03826
作者: Jiayao Gu,Liting Chen,Yihong Li
机构: McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature types on model performance. We evaluate whether selecting data subsets can influence downstream tasks, whether n-gram features improve alignment with target distributions, and whether embedding-based neural features provide complementary benefits. Through comparative experiments using baseline random selection methods and distribution aligned approaches, we provide insights into the interplay between data selection strategies and model training efficacy. All code for this study can be found on \hrefthis https URLgithub repository.
zh

[NLP-16] Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

【速读】：该论文旨在解决神经语音编辑（Neural Speech Editing）技术在欺骗攻击（spoofing attacks）中的滥用问题。传统的部分编辑语音语料库主要关注剪切粘贴（cut-and-paste）编辑，虽然保持了说话者的一致性，但通常会引入可检测的不连续性。为了解决这一问题，论文提出了使用Voicebox技术创建的Speech INfilling Edit (SINE)数据集，该数据集通过利用上下文信息改进了语音编辑的过渡效果。主观评估表明，使用这种新技术编辑的语音比传统的剪切粘贴方法更难被检测到。尽管人类难以察觉，实验结果表明，基于自监督学习（self-supervised learning）的检测器在不同编辑方法的检测、定位和泛化方面表现出色。关键解决方案在于引入SINE数据集，并通过重新实现Voicebox的训练和数据集创建过程，推动了欺骗检测研究的发展。

链接: https://arxiv.org/abs/2501.03805
作者: Sung-Feng Huang,Heng-Cheng Kuo,Zhehuai Chen,Xuesong Yang,Chao-Han Huck Yang,Yu Tsao,Yu-Chiang Frank Wang,Hung-yi Lee,Szu-Wei Fu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: SLT 2024

点击查看摘要

Abstract:Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript3T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
zh

[NLP-17] How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

【速读】：该论文试图解决在大规模代码语料库上预训练语言模型（Pre-trained Code Models, PCMs）的高计算成本问题，并提出如何从众多公开的预训练模型中选择最适合特定代码智能任务（如代码生成、代码摘要和漏洞检测）的模型。解决方案的关键在于探索基于学习的模型选择策略，这些策略通过训练代理模型来评估预训练模型的性能，并利用模型潜在特征与任务标签之间的分布偏差来衡量模型的可迁移性。实验结果表明，这种基于学习的选择方法将模型选择时间从暴力微调所需的2,700小时大幅减少到100秒，同时在相关任务上的性能损失不到6%。

链接: https://arxiv.org/abs/2501.03783
作者: Zhangqian Bi,Yao Wan,Zhaoyang Chu,Yufei Hu,Junyi Zhang,Hongyu Zhang,Guandong Xu,Hai Jin
机构: National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, Wuhan, China (国家大数据技术与系统工程技术研究中心, 服务计算技术与系统实验室, 集群与网格计算实验室, 武汉, 中国); School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China (华中科技大学计算机科学与技术学院, 武汉, 中国); School of Big Data and Software Engineering, Chongqing University, Chongqing, China (重庆大学大数据与软件工程学院, 重庆, 中国); School of Computer Science, University of Technology Sydney, Sydney, Australia (悉尼科技大学计算机科学学院, 悉尼, 澳大利亚)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted by IEEE SANER 2025

点击查看摘要

Abstract:Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model’s latent features and the task’s labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.
zh

[NLP-18] Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series

【速读】：该论文试图解决如何有效利用预训练大语言模型（LLMs）处理时间序列（TS）任务的问题。传统方法通常基于词元级别的对齐来激活LLMs的能力，但忽略了LLMs在自然语言处理中的深层优势，即对语言逻辑和结构的深刻理解，而非表面的嵌入处理。论文提出了一种新的范式——上下文对齐（Context-Alignment），通过将时间序列与LLMs熟悉的语言环境中的语言成分对齐，使LLMs能够将时间序列数据上下文化并理解其含义，从而激活其能力。具体而言，上下文对齐包括结构对齐和逻辑对齐，通过双尺度上下文对齐图神经网络（DSCA-GNNs）实现。结构对齐利用双尺度节点描述时间序列-语言的多层次结构，使LLMs能够将长时间序列数据视为一个整体语言成分，同时保留内在的词元特征；逻辑对齐则通过有向边引导逻辑关系，确保上下文语义的连贯性。此外，论文还提出了基于示范例子的上下文对齐（DECA），可以灵活地集成到预训练LLMs的各个层中，以增强其对逻辑和结构的感知能力，从而提升性能。实验结果表明，DECA在少样本和零样本预测任务中表现出色，验证了上下文对齐在提供强大上下文先验知识方面的重要性。

链接: https://arxiv.org/abs/2501.03747
作者: Yuxiao Hu,Qian Li,Dongxiao Zhang,Jinyue Yan,Yuntian Chen
机构: The Hong Kong Polytechnic University(香港理工大学); Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院, 东方理工学院); Shanghai Jiao Tong University(上海交通大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Applications (stat.AP)
备注: no comment

点击查看摘要

Abstract:Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs’ capabilities. Many methods aim to activate LLMs’ capabilities based on token-level alignment but overlook LLMs’ inherent strength on natural language processing – their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context.
zh

[NLP-19] Unsupervised Speech Segmentation: A General Approach Using Speech Language Models

【速读】：该论文旨在解决无监督语音分割（Unsupervised Speech Segmentation）问题，特别是针对包含多种声学-语义风格变化的语音片段进行分割。传统语音分割方法主要关注输入信号的频谱变化（如音素分割），而本文提出的方法则试图将语音片段分割为具有不同声学-语义风格的块，重点关注那些难以转化为文本的声学-语义信息（如情感或说话者）。与大多数仅处理单一风格变化（如情感分割）的语音分割任务不同，本文方法能够处理多种声学-语义风格变化。其解决方案的关键在于利用最新的语音语言模型（Speech Language Models, SLMs），提出了一种简单的无监督方法来实现语音分割。实验结果表明，该方法在边界检测、片段纯度和过分割方面优于所评估的基线方法。

链接: https://arxiv.org/abs/2501.03711
作者: Avishai Elmakies,Omri Abend,Yossi Adi
机构: Hebrew University of Jerusalem(希伯来大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.
zh

[NLP-20] SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment COLING2025

【速读】：该论文旨在解决大语言模型（LLMs）在多语言推理任务中的性能瓶颈问题。尽管LLMs在英语推理任务中取得了显著进展，但在多语言推理方面仍存在困难。现有方法采用全参数和两阶段训练范式，首先让模型理解非英语问题，然后进行推理，但这种方法存在计算资源消耗大和灾难性遗忘（catastrophic forgetting）的问题。论文提出了一种高效的多语言推理对齐方法（SLAM），其关键在于精确识别并微调负责处理多语言能力的层次。实验结果表明，SLAM仅微调了7B和13B LLMs中的6层前馈子层（feed-forward sub-layers），占总参数的6.5-8%，并在10种语言上实现了优于所有强基线的平均性能。此外，SLAM仅需一个训练阶段，训练时间比两阶段方法减少了4.1-11.9倍。

链接: https://arxiv.org/abs/2501.03681
作者: Yuchun Fan,Yongyu Mu,Yilin Wang,Lei Huang,Junhao Ruan,Bei Li,Tong Xiao,Shujian Huang,Xiaocheng Feng,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by COLING 2025 (Oral)

点击查看摘要

Abstract:Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers’ feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.
zh

[NLP-21] A Diversity-Enhanced Knowledge Distillation Model for Practical Math Word Problem Solving

【速读】：该论文试图解决数学应用题（Math Word Problem, MWP）求解中生成多样化且对应的解方程的问题。现有的Seq2Seq模型及其扩展（如Seq2Tree和Graph2Tree）虽然有效，但在生成多样化的解方程方面存在局限，导致其在不同数学问题场景中的泛化能力不足。论文提出的解决方案关键是一种新颖的多样性增强知识蒸馏（Diversity-enhanced Knowledge Distillation, DivKD）模型。该模型通过自适应多样性蒸馏方法，使学生模型能够从教师模型中选择性迁移高质量知识，从而学习到多样化的方程。此外，论文还设计了一种多样性先验增强的学生模型，通过引入条件变分自编码器（conditional variational auto-encoder）来更好地捕捉方程的多样性分布。实验结果表明，该方法在四个MWP基准数据集上实现了比强基线更高的答案准确率，同时保持了较高的实际应用效率。

链接: https://arxiv.org/abs/2501.03670
作者: Yi Zhang,Guangyou Zhou,Zhiwen Xie,Jinjin Ma,Jimmy Xiangji Huang
机构: Zhongnan University of Economics and Law (中南财经政法大学); Central China Normal University (华中师范大学); York University (约克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Math Word Problem (MWP) solving is a critical task in natural language processing, has garnered significant research interest in recent years. Various recent studies heavily rely on Seq2Seq models and their extensions (e.g., Seq2Tree and Graph2Tree) to generate mathematical equations. While effective, these models struggle to generate diverse but counterpart solution equations, limiting their generalization across various math problem scenarios. In this paper, we introduce a novel Diversity-enhanced Knowledge Distillation (DivKD) model for practical MWP solving. Our approach proposes an adaptive diversity distillation method, in which a student model learns diverse equations by selectively transferring high-quality knowledge from a teacher model. Additionally, we design a diversity prior-enhanced student model to better capture the diversity distribution of equations by incorporating a conditional variational auto-encoder. Extensive experiments on four MWP benchmark datasets demonstrate that our approach achieves higher answer accuracy than strong baselines while maintaining high efficiency for practical applications.
zh

[NLP-22] LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment

【速读】：该论文旨在解决抑郁症严重程度评估的自动化问题，特别是在资源有限的环境中提高心理健康评估的可及性。解决方案的关键在于利用开源大语言模型（LLMs），通过零样本提示策略（zero-shot prompting strategy）和精心设计的提示，指导模型对转录的临床访谈进行解释和评分。研究采用了蒙哥马利-阿斯伯格抑郁量表（MADRS）作为评估工具，并在CAMI数据集的236个真实世界访谈上进行了测试。结果表明，Qwen 2.5–72b模型在大多数MADRS项目上达到了接近人类评估者的一致性，类内相关系数（ICC）接近人类评估者之间的水平。尽管模型在依赖非语言线索的症状评估上仍存在挑战，但研究结果表明，通过适当的提示，LLMs可以作为心理健康评估的有效工具，未来研究需要进一步探索多模态方法以克服现有局限性。

链接: https://arxiv.org/abs/2501.03624
作者: Gaoussou Youssouf Kebe,Jeffrey M. Girard,Einat Liebenthal,Justin Baker,Fernando De la Torre,Louis-Philippe Morency
机构: Carnegie Mellon University(卡内基梅隆大学); University of Kansas(堪萨斯大学); McLean Hospital(麦克莱恩医院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment using the Montgomery-Asberg Depression Rating Scale (MADRS). We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews. Our approach, tested on 236 real-world interviews from the Context-Adaptive Multimodal Informatics (CAMI) dataset, demonstrates strong correlations with clinician assessments. The Qwen 2.5–72b model achieves near-human level agreement across most MADRS items, with Intraclass Correlation Coefficients (ICC) closely approaching those between human raters. We provide a comprehensive analysis of model performance across different MADRS items, highlighting strengths and current limitations. Our findings suggest that LLMs, with appropriate prompting, can serve as efficient tools for mental health assessment, potentially increasing accessibility in resource-limited settings. However, challenges remain, particularly in assessing symptoms that rely on non-verbal cues, underscoring the need for multimodal approaches in future work.
zh

[NLP-23] Discriminative Representation learning via Attention-Enhanced Contrastive Learning for Short Text Clustering

【速读】：该论文试图解决在短文本聚类（short text clustering）中对比学习（contrastive learning）存在的“假负分离”（false negative separation）问题，即错误地将同一类别的样本识别为负样本并在特征空间中分离，从而阻碍生成更优的表示。为解决这一问题，论文提出了一种名为“基于注意力增强对比学习的判别表示学习方法”（AECL）的新方法。AECL的关键在于其包含两个模块：伪标签生成模块和对比学习模块。这两个模块通过样本级注意力机制（sample-level attention mechanism）捕捉样本间的相似性关系，并聚合跨样本特征以生成一致的表示。伪标签生成模块利用更具判别性的一致表示生成可靠的监督信息以辅助聚类，而对比学习模块则通过探索相似性关系和一致表示来优化正样本的构建，从而执行相似性引导的对比学习，有效解决假负分离问题。实验结果表明，AECL在性能上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.03584
作者: Zhihao Yao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contrastive learning has gained significant attention in short text clustering, yet it has an inherent drawback of mistakenly identifying samples from the same category as negatives and then separating them in the feature space (false negative separation), which hinders the generation of superior representations. To generate more discriminative representations for efficient clustering, we propose a novel short text clustering method, called Discriminative Representation learning via \textbfAttention-\textbfEnhanced \textbfContrastive \textbfLearning for Short Text Clustering (\textbfAECL). The \textbfAECL consists of two modules which are the pseudo-label generation module and the contrastive learning module. Both modules build a sample-level attention mechanism to capture similarity relationships between samples and aggregate cross-sample features to generate consistent representations. Then, the former module uses the more discriminative consistent representation to produce reliable supervision information for assist clustering, while the latter module explores similarity relationships and consistent representations optimize the construction of positive samples to perform similarity-guided contrastive learning, effectively addressing the false negative separation issue. Experimental results demonstrate that the proposed \textbfAECL outperforms state-of-the-art methods. If the paper is accepted, we will open-source the code.
zh

[NLP-24] From Code to Compliance: Assessing ChatGPT s Utility in Designing an Accessible Webpage – A Case Study

【速读】：该论文探讨了生成式 AI（Generative AI）模型 ChatGPT（GPT-4）在生成和改进符合 Web 内容可访问性指南（Web Content Accessibility Guidelines, WCAG）的网页方面的能力。研究旨在解决当前大多数网站未能满足无障碍访问标准的问题，特别是针对残障人士的数字内容访问障碍。研究发现，尽管 ChatGPT 在被提示时能够有效解决一些无障碍访问问题，但其默认生成的代码往往不符合 WCAG 标准，反映了其训练数据的局限性和当前网络实践中普遍存在的无障碍问题。解决方案的关键在于通过有效的提示工程（prompt engineering），如提供简洁、结构化的反馈并结合视觉辅助（如截图），以增强 ChatGPT 在分析和解决复杂无障碍问题时的表现。此外，研究还强调了人工监督和多次迭代的重要性，特别是在处理动态元素和复杂任务时。这些发现为开发者提供了实用的指导，以利用大语言模型（LLM）创建更具包容性的网站。

链接: https://arxiv.org/abs/2501.03572
作者: Ammar Ahmed,Margarida Fresco,Fredrik Forsberg,Hallvard Grotli
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Web accessibility ensures that individuals with disabilities can access and interact with digital content without barriers, yet a significant majority of most used websites fail to meet accessibility standards. This study evaluates ChatGPT’s (GPT-4o) ability to generate and improve web pages in line with Web Content Accessibility Guidelines (WCAG). While ChatGPT can effectively address accessibility issues when prompted, its default code often lacks compliance, reflecting limitations in its training data and prevailing inaccessible web practices. Automated and manual testing revealed strengths in resolving simple issues but challenges with complex tasks, requiring human oversight and additional iterations. Unlike prior studies, we incorporate manual evaluation, dynamic elements, and use the visual reasoning capability of ChatGPT along with the prompts to fix accessibility issues. Providing screenshots alongside prompts enhances the LLM’s ability to address accessibility issues by allowing it to analyze surrounding components, such as determining appropriate contrast colors. We found that effective prompt engineering, such as providing concise, structured feedback and incorporating visual aids, significantly enhances ChatGPT’s performance. These findings highlight the potential and limitations of large language models for accessible web development, offering practical guidance for developers to create more inclusive websites.
zh

[NLP-25] KG-TRICK: Unifying Textual and Relational Information Completion of Knowledge for Multilingual Knowledge Graphs COLING2025

【速读】：该论文试图解决多语言知识图谱（Multilingual Knowledge Graphs, KGs）中信息不完整的问题，特别是在非英语语言中。具体而言，论文关注知识图谱补全（Knowledge Graph Completion, KGC）和知识图谱增强（Knowledge Graph Enhancement, KGE）这两个任务。尽管以往的研究将KGC和KGE视为独立任务，但本文假设它们是相互依赖且互利的。为此，作者提出了KG-TRICK，一种新颖的序列到序列框架，将文本和关系信息的补全任务统一起来。KG-TRICK的关键在于：i）将KGC和KGE任务统一到一个框架中，ii）通过结合多语言的文本信息来提高知识图谱的完整性。此外，论文还引入了WikiKGE10++，这是一个包含10种语言、超过25,000个实体的手动标注基准数据集，用于评估文本信息补全任务。

链接: https://arxiv.org/abs/2501.03560
作者: Zelin Zhou,Simone Conia,Daniel Lee,Min Li,Shenglei Huang,Umar Farooq Minhas,Saloni Potdar,Henry Xiao,Yunyao Li
机构: Apple; Sapienza University of Rome (罗马大学); Adobe
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Camera ready for COLING 2025

点击查看摘要

Abstract:Multilingual knowledge graphs (KGs) provide high-quality relational and textual information for various NLP applications, but they are often incomplete, especially in non-English languages. Previous research has shown that combining information from KGs in different languages aids either Knowledge Graph Completion (KGC), the task of predicting missing relations between entities, or Knowledge Graph Enhancement (KGE), the task of predicting missing textual information for entities. Although previous efforts have considered KGC and KGE as independent tasks, we hypothesize that they are interdependent and mutually beneficial. To this end, we introduce KG-TRICK, a novel sequence-to-sequence framework that unifies the tasks of textual and relational information completion for multilingual KGs. KG-TRICK demonstrates that: i) it is possible to unify the tasks of KGC and KGE into a single framework, and ii) combining textual information from multiple languages is beneficial to improve the completeness of a KG. As part of our contributions, we also introduce WikiKGE10++, the largest manually-curated benchmark for textual information completion of KGs, which features over 25,000 entities across 10 diverse languages.
zh

[NLP-26] Beyond Factual Accuracy: Evaluating Coverag e of Diverse Factual Information in Long-form Text Generation

【速读】：该论文试图解决长文本生成（long-form text generation）中多样化事实信息覆盖度（coverage of diverse factual information）的评估问题。现有的评估方法往往难以全面衡量生成文本中事实信息的多样性和覆盖度，尤其是在长文本生成任务中。为此，作者提出了ICAT（Information Coverage Assessment Tool）评估框架，其关键解决方案包括：将长文本分解为原子声明（atomic claims），并通过检索可靠知识源验证每个声明的准确性；同时，计算这些原子声明与预期输出内容之间的对齐度（alignment）。ICAT框架的模块化设计使其能够灵活适应不同领域和数据集，并提供可解释的细粒度分析，从而为评估大语言模型（LLMs）生成的长文本质量提供了有力工具。

链接: https://arxiv.org/abs/2501.03545
作者: Chris Samarinas,Alexander Krubner,Alireza Salemi,Youngwoo Kim,Hamed Zamani
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Salzburg University of Applied Sciences(萨尔茨堡应用科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents ICAT, an evaluation framework for measuring coverage of diverse factual information in long-form text generation. ICAT breaks down a long output text into a list of atomic claims and not only verifies each claim through retrieval from a (reliable) knowledge source, but also computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. We study three implementations of the ICAT framework, each with a different assumption on the availability of aspects and alignment method. By adopting data from the diversification task in the TREC Web Track and the ClueWeb corpus, we evaluate the ICAT framework. We demonstrate strong correlation with human judgments and provide comprehensive evaluation across multiple state-of-the-art LLMs. Our framework further offers interpretable and fine-grained analysis of diversity and coverage. Its modular design allows for easy adaptation to different domains and datasets, making it a valuable tool for evaluating the qualitative aspects of long-form responses produced by LLMs.
zh

[NLP-27] A Sequential Optimal Learning Approach to Automated Prompt Engineering in Large Language Models

【速读】：该论文旨在解决自动化提示工程（Automated Prompt Engineering）中的效率问题，特别是在有限评估预算下如何有效设计和优化自然语言提示（prompts）以引导大语言模型（LLMs）生成期望的响应。解决方案的关键在于提出了一种基于最优学习框架的方法，通过顺序识别有效的提示特征并高效分配评估预算。具体而言，论文引入了一种基于特征的方法来表达提示，显著扩展了搜索空间，并采用贝叶斯回归（Bayesian Regression）利用相似提示之间的相关性，加速学习过程。此外，论文采用了前瞻性知识梯度（Knowledge-Gradient, KG）策略进行顺序最优学习，通过求解混合整数二阶锥优化问题（Mixed-Integer Second-Order Cone Optimization Problems）高效计算KG策略，使其具有可扩展性并能处理仅通过约束描述的提示。实验结果表明，该方法在指令归纳任务中显著优于一系列基准策略，展示了在有限评估预算下使用KG策略进行提示学习的优势。

链接: https://arxiv.org/abs/2501.03508
作者: Shuyang Wang,Somayeh Moazeni,Diego Klabjan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Designing effective prompts is essential to guiding large language models (LLMs) toward desired responses. Automated prompt engineering aims to reduce reliance on manual effort by streamlining the design, refinement, and optimization of natural language prompts. This paper proposes an optimal learning framework for automated prompt engineering, designed to sequentially identify effective prompt features while efficiently allocating a limited evaluation budget. We introduce a feature-based method to express prompts, which significantly broadens the search space. Bayesian regression is employed to utilize correlations among similar prompts, accelerating the learning process. To efficiently explore the large space of prompt features for a high quality prompt, we adopt the forward-looking Knowledge-Gradient (KG) policy for sequential optimal learning. The KG policy is computed efficiently by solving mixed-integer second-order cone optimization problems, making it scalable and capable of accommodating prompts characterized only through constraints. We demonstrate that our method significantly outperforms a set of benchmark strategies assessed on instruction induction tasks. The results highlight the advantages of using the KG policy for prompt learning given a limited evaluation budget. Our framework provides a solution to deploying automated prompt engineering in a wider range applications where prompt evaluation is costly.
zh

[NLP-28] Can LLM s Design Good Questions Based on Context?

【速读】：该论文旨在评估由大语言模型（LLMs）从上下文中生成的问题，并将其与人类生成的问题在六个维度上进行比较。研究的关键在于引入了一种基于LLM的自动化评估方法，重点关注问题长度、类型、上下文覆盖度和可回答性等方面。通过这种方法，论文揭示了LLM生成问题的独特特征，为问题质量和下游应用的进一步研究提供了有价值的见解。

链接: https://arxiv.org/abs/2501.03491
作者: Yueheng Zhang,Xiaoyuan Liu,Yiyou Sun,Atheer Alharbi,Hend Alzahrani,Basel Alomair,Dawn Song
机构: University of California Berkeley(加州大学伯克利分校); KACST(沙特阿拉伯国王科技城); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper evaluates questions generated by LLMs from context, comparing them to human-generated questions across six dimensions. We introduce an automated LLM-based evaluation method, focusing on aspects like question length, type, context coverage, and answerability. Our findings highlight unique characteristics of LLM-generated questions, contributing insights that can support further research in question quality and downstream applications.
zh

[NLP-29] Women Infamous and Exotic Beings: What Honorific Usages in Wikipedia Reveal about the Socio-Cultural Norms

【速读】：该论文旨在探讨社会文化因素如何影响语言中的敬语使用，特别是通过分析孟加拉语（Bengali）和印地语（Hindi）维基百科文章中的敬语代词（honorific pronouns）使用情况。研究的关键在于利用生成式 AI（GPT-4o）对每种语言的 10,000 篇文章进行标注，涵盖性别（gender）、年龄（age）、知名度（fame）和异域性（exoticness）等社会人口学特征，以及敬语的使用情况。研究发现，孟加拉语在所有特征组合中敬语使用频率均高于印地语，且两种语言中非敬语代词更常用于指代声名狼藉、未成年或异域性较强的个体。此外，印地语中存在性别偏见，男性比女性更常被使用敬语指代。

链接: https://arxiv.org/abs/2501.03479
作者: Sourabrata Mukherjee,Soumya Teotia,Sougata Saha,Monojit Choudhury
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Honorifics serve as powerful linguistic markers that reflect social hierarchies and cultural values. This paper presents a large-scale, cross-linguistic exploration of usage of honorific pronouns in Bengali and Hindi Wikipedia articles, shedding light on how socio-cultural factors shape language. Using LLM (GPT-4o), we annotated 10, 000 articles of real and fictional beings in each language for several sociodemographic features such as gender, age, fame, and exoticness, and the use of honorifics. We find that across all feature combinations, use of honorifics is consistently more common in Bengali than Hindi. For both languages, the use non-honorific pronouns is more commonly observed for infamous, juvenile, and exotic beings. Notably, we observe a gender bias in use of honorifics in Hindi, with men being more commonly referred to with honorifics than women.
zh

[NLP-30] Reading with Intent – Neutralizing Intent

【速读】：该论文试图解决在检索增强生成（RAG）系统中，由于互联网内容具有多样化的语气和语言风格，导致下游任务性能下降的问题。具体来说，论文关注的是不同情感语气对模型性能的影响，并通过构建一个包含11种不同情感的数据集来评估这一问题。解决方案的关键在于使用合成数据生成方法，训练一个情感翻译模型（emotion-translator），将文本转换为指定的情感语气。通过这种方式，模型能够将带有情感色彩的文本转换为中性语气，从而缓解讽刺性文本带来的挑战，并在“Reading with Intent”任务中提升了约3%的性能。

链接: https://arxiv.org/abs/2501.03475
作者: Benjamin Reichman,Adar Avsian,Larry Heck
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Queries to large language models (LLMs) can be divided into two parts: the instruction/question and the accompanying context. The context for retrieval-augmented generation (RAG) systems in most benchmarks comes from Wikipedia or Wikipedia-like texts which are written in a neutral and factual tone. However, when RAG systems retrieve internet-based content, they encounter text with diverse tones and linguistic styles, introducing challenges for downstream tasks. The Reading with Intent task addresses this issue by evaluating how varying tones in context passages affect model performance. Building on prior work that focused on sarcasm, we extend this paradigm by constructing a dataset where context passages are transformed to 11 distinct emotions using a better synthetic data generation approach. Using this dataset, we train an emotion translation model to systematically adapt passages to specified emotional tones. The human evaluation shows that the LLM fine-tuned to become the emotion-translator benefited from the synthetically generated data. Finally, the emotion-translator is used in the Reading with Intent task to transform the passages to a neutral tone. By neutralizing the passages, it mitigates the challenges posed by sarcastic passages and improves overall results on this task by about 3% .
zh

[NLP-31] MTRAG : A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems

【速读】：该论文旨在解决在多轮对话环境中评估检索增强生成（Retrieval-Augmented Generation, RAG）系统的挑战。现有的RAG系统在多轮对话任务中表现不佳，尤其是在处理后续对话轮次、无法回答的问题、非独立问题以及跨多个领域的问题时。为了解决这一问题，论文提出了MTRAG（Multi-Turn RAG）基准测试，这是一个端到端的人工生成的多轮RAG基准测试，涵盖了四个领域的110个对话，平均每个对话包含7.7轮，总计842个任务。MTRAG基准测试反映了多个现实世界的特性，能够全面评估RAG系统的检索和生成能力。此外，论文还探讨了通过合成数据和LLM-as-a-Judge（LLM作为评判者）的自动化评估路径。实验结果表明，即使是当前最先进的LLM RAG系统在MTRAG上也表现不佳，强调了开发能够处理复杂多轮对话的强检索和生成系统的必要性。

链接: https://arxiv.org/abs/2501.03468
作者: Yannis Katsis,Sara Rosenthal,Kshitij Fadnis,Chulaka Gunasekara,Young-Suk Lee,Lucian Popa,Vraj Shah,Huaiyu Zhu,Danish Contractor,Marina Danilevsky
机构: IBM Research(IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at this https URL.
zh

[NLP-32] ISSR: Iterative Selection with Self-Review for Vocabulary Test Distractor Generation

【速读】：该论文试图解决在英语词汇测试设计中生成有效干扰项（distractors）的问题。当前的方法通常依赖于词汇数据库或预定义规则，容易产生多个正确选项，从而影响测试的有效性。论文提出的解决方案是迭代选择与自我审查（ISSR）框架，该框架利用基于大语言模型（LLMs）的自我审查机制，确保生成的干扰项既有效又多样化。实验结果表明，ISSR在生成合理的干扰项方面表现出色，并且自我审查机制能够有效过滤掉可能使问题无效的干扰项。

链接: https://arxiv.org/abs/2501.03462
作者: Yu-Cheng Liu,An-Zi Yen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vocabulary acquisition is essential to second language learning, as it underpins all core language skills. Accurate vocabulary assessment is particularly important in standardized exams, where test items evaluate learners’ comprehension and contextual use of words. Previous research has explored methods for generating distractors to aid in the design of English vocabulary tests. However, current approaches often rely on lexical databases or predefined rules, and frequently produce distractors that risk invalidating the question by introducing multiple correct options. In this study, we focus on English vocabulary questions from Taiwan’s university entrance exams. We analyze student response distributions to gain insights into the characteristics of these test items and provide a reference for future research. Additionally, we identify key limitations in how large language models (LLMs) support teachers in generating distractors for vocabulary test design. To address these challenges, we propose the iterative selection with self-review (ISSR) framework, which makes use of a novel LLM-based self-review mechanism to ensure that the distractors remain valid while offering diverse options. Experimental results show that ISSR achieves promising performance in generating plausible distractors, and the self-review mechanism effectively filters out distractors that could invalidate the question.
zh

[NLP-33] xt to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction

【速读】：该论文试图解决半导体材料带隙（band gap）预测中的计算效率和数据预处理复杂性问题。传统的量子化学模拟方法，如密度泛函理论（DFT），虽然精确但计算量大且耗时，限制了其在高通量材料筛选中的应用。此外，浅层机器学习模型虽然有效，但通常需要大量的数据预处理工作，将非数值材料属性转换为数值输入。论文提出的解决方案是利用基于Transformer的语言模型（如RoBERTa）作为编码器，直接从材料的文本描述中预测带隙，从而避免了复杂的特征工程。通过使用格式化字符串和自然语言文本生成的材料描述，RoBERTa模型在少量微调的情况下，达到了约0.33 eV的平均绝对误差（MAE），优于支持向量回归、随机森林和XGBoost等浅层机器学习模型。研究表明，预训练的RoBERTa编码器在处理与材料属性相关的领域特定文本时具有高度适应性，显著减少了对大规模重新训练的需求。这一方法展示了基于Transformer的语言模型在半导体材料属性预测任务中的高效性和多功能性。

链接: https://arxiv.org/abs/2501.03456
作者: Ying-Ting Yeh,Janghoon Ock,Amir Barati Farimani
机构: 未知
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials directly from their text descriptions. Quantum chemistry simulations, including Density Functional Theory (DFT), are computationally intensive and time-consuming, which limits their practicality for high-throughput material screening, particularly for complex systems. Shallow machine learning (ML) models, while effective, often require extensive data preprocessing to convert non-numerical material properties into numerical inputs. In contrast, our approach leverages textual data directly, bypassing the need for complex feature engineering. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks. With minimal fine-tuning, it achieves a mean absolute error (MAE) of approximately 0.33 eV, performing better than shallow machine learning models such as Support Vector Regression, Random Forest, and XGBoost. Even when only the linear regression head is trained while keeping the RoBERTa encoder layers frozen, the accuracy remains nearly identical to that of the fully trained model. This demonstrates that the pre-trained RoBERTa encoder is highly adaptable for processing domain-specific text related to material properties, such as the band gap, significantly reducing the need for extensive retraining. This study highlights the potential of transformer-based language models to serve as efficient and versatile encoders for semiconductor materials property prediction tasks.
zh

[NLP-34] Finding A Voice: Evaluating African American Dialect Generation for Chatbot Technology

【速读】：该论文探讨了当代大型语言模型（LLMs）在生成非裔美国人白话英语（African American Vernacular English, AAVE）方面的能力，并评估了在聊天机器人应用中使用AAVE对用户体验的影响。研究分析了三个LLM家族（Llama、GPT和Claude）在不同方言强度下生成AAVE类似语句的表现，并评估了用户在不同领域（如医疗和教育）中的偏好。尽管LLMs在生成AAVE类似语言方面表现出色，但研究发现，使用AAVE的用户更倾向于使用标准美国英语（Standard American English, SAE）的聊天机器人，且AAVE使用水平越高，用户对聊天机器人的信任度和角色适当性等特征的评分越低。这些结果揭示了创建包容性AI系统的复杂性，并强调了进一步探索多样性以增强人机交互的必要性。

链接: https://arxiv.org/abs/2501.03441
作者: Sarah E. Finch,Ellie S. Paek,Sejung Kwon,Ikseon Choi,Jessica Wells,Rasheeta Chandler,Jinho D. Choi
机构: School of Nursing, Emory University (埃默里大学护理学院); Department of Computer Science, Emory University (埃默里大学计算机科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As chatbots become increasingly integrated into everyday tasks, designing systems that accommodate diverse user populations is crucial for fostering trust, engagement, and inclusivity. This study investigates the ability of contemporary Large Language Models (LLMs) to generate African American Vernacular English (AAVE) and evaluates the impact of AAVE usage on user experiences in chatbot applications. We analyze the performance of three LLM families (Llama, GPT, and Claude) in producing AAVE-like utterances at varying dialect intensities and assess user preferences across multiple domains, including healthcare and education. Despite LLMs’ proficiency in generating AAVE-like language, findings indicate that AAVE-speaking users prefer Standard American English (SAE) chatbots, with higher levels of AAVE correlating with lower ratings for a variety of characteristics, including chatbot trustworthiness and role appropriateness. These results highlight the complexities of creating inclusive AI systems and underscore the need for further exploration of diversity to enhance human-computer interactions.
zh

[NLP-35] DAMAGE: Detecting Adversarially Modified AI Generated Text

【速读】：该论文旨在解决AI生成文本经过“AI人化器”（AI humanizers）改写后难以被现有AI检测工具识别的问题。AI人化器是一类在线软件工具，旨在通过改写和重述AI生成的文本，使其能够逃避AI检测软件的识别。论文研究了19种AI人化器和改写工具，并定性评估了它们在保留原文意义方面的效果和忠实度。研究发现，许多现有的AI检测工具无法识别经过人化处理的文本。为解决这一问题，论文提出了一种基于数据增强（data-centric augmentation）的鲁棒模型，该模型能够在保持低误报率的同时，有效检测经过人化处理的AI文本。此外，论文还通过训练一个针对该检测器预测进行优化的微调模型，攻击了自身的检测器，证明了该检测器在跨人化器泛化方面的鲁棒性。

链接: https://arxiv.org/abs/2501.03437
作者: Elyas Masrour,Bradley Emi,Max Spero
机构: Pangram Labs, Inc.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector’s predictions, and show that our detector’s cross-humanizer generalization is sufficient to remain robust to this attack.
zh

[NLP-36] BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

【速读】：该论文旨在解决文档问答（Document Question-Answering, QA）任务中的数据集统一性问题，并探索如何通过不同的提示技术（prompting techniques）提升开放权重模型（open-weight models）在文档理解任务中的性能。论文的关键解决方案包括两个方面：首先，将现有的文档AI任务（如信息抽取，Information Extraction, IE）重新表述为问答任务，使其更适合用于训练和评估大语言模型（Large Language Models）；其次，发布所有文档的光学字符识别（OCR）结果，并在文档图像中标注答案的精确位置（bounding box），从而为模型提供更丰富的视觉信息。通过这些方法，论文不仅提供了一个统一的文档问答数据集，还深入研究了包含边界框信息的提示技术对模型性能的影响，识别出最有效的文档理解方法。

链接: https://arxiv.org/abs/2501.03403
作者: Simone Giovannini,Fabio Coppini,Andrea Gemelli,Simone Marinai
机构: University of Florence (佛罗伦萨大学); Letxbe.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
zh

[NLP-37] Advanced Machine Learning Techniques for Social Support Detection on Social Media

【速读】：该论文旨在探讨社交媒体上在线社交支持（online social support）的影响，并解决其内容分类问题。研究通过三个任务来实现这一目标：首先，区分支持性（supportive）和非支持性（non-supportive）内容；其次，识别支持是面向个体还是群体；最后，将社交支持内容细分为特定类别，如国家、LGBTQ、黑人、女性、宗教等。为解决数据不平衡问题，研究采用了K-means聚类（K-means clustering）进行数据集平衡，并对比了平衡前后的结果。研究还使用了先进的机器学习技术，包括基于transformer的模型和零样本学习（zero-shot learning）方法（如GPT3、GPT4和GPT4-o），以预测不同情境下的社交支持水平。实验结果表明，基于transformer的方法在性能上优于传统机器学习方法，并在第二和第三任务中分别实现了0.4%和0.7%的宏F1分数提升。

链接: https://arxiv.org/abs/2501.03370
作者: Olga Kolesnikova,Moein Shahiki Tash,Zahra Ahani,Ameeta Agrawal,Raul Monroy,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread use of social media highlights the need to understand its impact, particularly the role of online social support. This study uses a dataset focused on online social support, which includes binary and multiclass classifications of social support content on social media. The classification of social support is divided into three tasks. The first task focuses on distinguishing between supportive and non-supportive. The second task aims to identify whether the support is directed toward an individual or a group. The third task categorizes the specific type of social support, grouping it into categories such as Nation, LGBTQ, Black people, Women, Religion, and Other (if it does not fit into the previously mentioned categories). To address data imbalances in these tasks, we employed K-means clustering for balancing the dataset and compared the results with the original unbalanced data. Using advanced machine learning techniques, including transformers and zero-shot learning approaches with GPT3, GPT4, and GPT4-o, we predict social support levels in various contexts. The effectiveness of the dataset is evaluated using baseline models across different learning approaches, with transformer-based methods demonstrating superior performance. Additionally, we achieved a 0.4% increase in the macro F1 score for the second task and a 0.7% increase for the third task, compared to previous work utilizing traditional machine learning with psycholinguistic and unigram-based TF-IDF values.
zh

[NLP-38] Analyzing Bias in Swiss Federal Supreme Court Judgments Using Facebooks Holistic Bias Dataset: Implications for Language Model Training

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）模型在预测法律判决时因训练数据中的偏见（bias）而导致的不公平问题。具体而言，研究聚焦于分析瑞士判决预测数据集（Swiss Judgment Prediction Dataset, SJP-Dataset）中的偏见，以确保NLP模型在法律语境中能够基于无偏见的事实描述做出公平的决策。解决方案的关键在于使用Holistic Bias数据集中的社会偏见描述符（social bias descriptors）来分析SJP-Dataset，并采用先进的NLP技术（如注意力可视化，attention visualization）来探索不受欢迎的描述符（dispreferred descriptors）对模型预测的影响。通过识别偏见并分析其对模型行为的影响，研究试图克服数据集不平衡和标记限制等挑战，从而提升模型性能。

链接: https://arxiv.org/abs/2501.03324
作者: Sabine Wehnert,Muhammet Ertas,Ernesto William De Luca
机构: Otto von Guericke University Magdeburg, Germany(奥托·冯·格里克大学马格德堡分校, 德国); Leibniz Institute for Educational Media | Georg Eckert Institute, Brunswick, Germany(莱布尼茨教育媒体研究所 | 乔治·埃克特研究所, 不伦瑞克, 德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language Processing (NLP) is vital for computers to process and respond accurately to human language. However, biases in training data can introduce unfairness, especially in predicting legal judgment. This study focuses on analyzing biases within the Swiss Judgment Prediction Dataset (SJP-Dataset). Our aim is to ensure unbiased factual descriptions essential for fair decision making by NLP models in legal contexts. We analyze the dataset using social bias descriptors from the Holistic Bias dataset and employ advanced NLP techniques, including attention visualization, to explore the impact of dispreferred descriptors on model predictions. The study identifies biases and examines their influence on model behavior. Challenges include dataset imbalance and token limits affecting model performance.
zh

[NLP-39] ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

【速读】：该论文试图解决Decomposed Prompt Tuning (DePT)在适应多样化模型输入时泛化能力受限的问题，以及由于共享嵌入偏移导致的次优优化问题。DePT通过将软提示分解为较短的软提示和一对低秩矩阵，并将低秩矩阵的乘积添加到输入标记嵌入中以进行偏移，从而实现了较快的推理速度。然而，DePT的位置基于标记嵌入偏移限制了其在不同模型输入上的泛化能力，且共享的嵌入偏移导致优化效果不佳。

为解决这些问题，论文提出了Adaptive Decomposed Prompt Tuning (ADePT)，其核心在于引入了一个浅层的标记共享前馈神经网络（token-shared feed-forward neural network），用于学习每个标记的嵌入偏移。ADePT通过这种方式实现了根据模型输入自适应的嵌入偏移，从而更好地优化了标记嵌入偏移。与传统的Prompt Tuning及其变体相比，ADePT在不增加推理时间或可训练参数的情况下，显著提升了适应性能。实验结果表明，ADePT在23个自然语言处理任务和4种不同规模的预训练大语言模型上均优于现有的参数高效微调方法，并在某些场景下甚至超越了全微调基线。

链接: https://arxiv.org/abs/2501.03291
作者: Pengwei Tang,Xiaolin Hu,Yong Liu
机构: Renmin University of China(中国人民大学); Beijing Key Laboratory of Big Data Management and Analysis Methods(北京市大数据管理与分析方法重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restricts its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce \textbfAdaptive \textbfDecomposed \textbfPrompt \textbfTuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing (NLP) tasks and 4 typical PLMs of different scales, we show that ADePT consistently surpasses the leading parameter-efficient fine-tuning (PEFT) methods, and even outperforms the full fine-tuning baseline in certain scenarios. Code is available at \urlthis https URL.
zh

[NLP-40] HonkaiChat: Companions from Anime that feel alive! DATE

【速读】：该论文试图解决现代对话代理（包括动漫主题聊天机器人）在对话中缺乏动态性和自然性的问题。现有的聊天机器人通常是反应性和个性驱动的，但无法捕捉人类互动的动态特性。为此，论文提出了一种事件驱动的对话框架（event-driven dialogue framework），通过在对话提示中嵌入动态事件（dynamic events）并结合角色特定数据进行模型微调，以提升对话的参与度和自然性，同时减少幻觉（hallucinations）现象。解决方案的关键在于利用事件驱动的提示（event-driven prompts）来增强对话的动态性和真实性，特别是在《崩坏：星穹铁道》（Honkai: Star Rail）这一背景下展示了该框架在角色扮演和互动对话中的潜力。

链接: https://arxiv.org/abs/2501.03277
作者: Yueze Liu,Yichi Zhang,Shaan Om Patel,Zhaoyang Zhu,Shilong Guo
机构: University of Illinois at Urbana-Champaign (UIUC) (伊利诺伊大学厄巴纳-香槟分校); Divergence 2% Research Laboratory (Divergence 2% 研究实验室)
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures. This is a preprint. Not yet submitted to a journal or conference. More iterated versions to be updated

点击查看摘要

Abstract:Modern conversational agents, including anime-themed chatbots, are frequently reactive and personality-driven but fail to capture the dynamic nature of human interactions. We propose an event-driven dialogue framework to address these limitations by embedding dynamic events in conversation prompts and fine-tuning models on character-specific data. Evaluations on GPT-4 and comparisons with industry-leading baselines demonstrate that event-driven prompts significantly improve conversational engagement and naturalness while reducing hallucinations. This paper explores the application of this approach in creating lifelike chatbot interactions within the context of Honkai: Star Rail, showcasing the potential for dynamic event-based systems to transform role-playing and interactive dialogue.
zh

[NLP-41] ComMer: a Framework for Compressing and Merging User Data for Personalization

【速读】：该论文试图解决大型语言模型（LLMs）在个性化应用中的适应性问题，特别是在资源受限和计算成本高昂的情况下。现有方法要么通过提示（prompt）向模型暴露新数据，但受限于上下文长度且推理时计算成本高；要么通过微调（fine-tuning）进行，但训练和更新成本巨大。论文提出的解决方案是ComMer（Compress and Merge）框架，该框架通过将用户的文档压缩为紧凑表示，并将其合并后输入到冻结的LLM中，从而高效地实现个性化。ComMer在个性化技能学习任务（如推文改写和新闻标题生成）中表现出色，但在知识密集型任务（如PerLTQA数据集）中由于信息细节的丢失存在局限性。这一方案的关键在于通过多文档压缩实现个性化，同时权衡了信息保留与计算效率。

链接: https://arxiv.org/abs/2501.03276
作者: Yoel Zeldes,Amir Zait,Ilia Labzovsky,Danny Karmon,Efrat Farkash
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel at a wide range of tasks, but adapting them to new data, particularly for personalized applications, poses significant challenges due to resource and computational constraints. Existing methods either rely on exposing fresh data to the model through the prompt, which is limited by context size and computationally expensive at inference time, or fine-tuning, which incurs substantial training and update costs. In this paper, we introduce ComMer - Compress and Merge - a novel framework that efficiently personalizes LLMs by compressing users’ documents into compact representations, which are then merged and fed into a frozen LLM. We evaluate ComMer on two types of personalization tasks - personalized skill learning, using the tweet paraphrasing dataset and the personalized news headline generation dataset from the LaMP benchmark, and knowledge-intensive, using the PerLTQA dataset. Our experiments demonstrate that in constrained inference budget scenarios ComMer achieves superior quality in skill learning tasks, while highlighting limitations in knowledge-intensive settings due to the loss of detailed information. These results offer insights into trade-offs and potential optimizations in multi-document compression for personalization.
zh

[NLP-42] Strategic Fusion Optimizes Transformer Compression ICML2025

【速读】：该论文旨在解决Transformer模型压缩问题，特别是在资源受限的应用场景中如何通过剪枝（pruning）技术减少模型规模而不显著损失性能。研究通过系统地评估14种剪枝策略，包括基于层激活、互信息、梯度、权重和注意力等不同信号的12种单信号策略，以及两种融合策略（线性回归和随机森林），以优化剪枝决策。关键解决方案在于引入融合策略，特别是随机森林融合策略，该策略在九个数据集中的七个表现优于单信号策略，并在其余两个数据集中接近最优性能。此外，研究还应用了知识蒸馏（knowledge distillation）技术，以缓解剪枝过程中可能出现的精度损失，结果表明知识蒸馏在六个数据集中超越了原始模型的精度，并在其余三个数据集中有效减少了精度下降。最终，研究通过数学基础和生物学类比支持了多信号融合策略的有效性，表明其在资源受限应用中能够实现高效且高性能的Transformer模型。

链接: https://arxiv.org/abs/2501.03273
作者: Md Shoaibur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 1 table, 8 figures; will be submitted to ICML 2025; codes will be made public after acceptance

点击查看摘要

Abstract:This study investigates transformer model compression by systematically pruning its layers. We evaluated 14 pruning strategies across nine diverse datasets, including 12 strategies based on different signals obtained from layer activations, mutual information, gradients, weights, and attention. To address the limitations of single-signal strategies, we introduced two fusion strategies, linear regression and random forest, which combine individual strategies (i.e., strategic fusion), for more informed pruning decisions. Additionally, we applied knowledge distillation to mitigate any accuracy loss during layer pruning. Our results reveal that random forest strategic fusion outperforms individual strategies in seven out of nine datasets and achieves near-optimal performance in the other two. The distilled random forest surpasses the original accuracy in six datasets and mitigates accuracy drops in the remaining three. Knowledge distillation also improves the accuracy-to-size ratio by an average factor of 18.84 across all datasets. Supported by mathematical foundations and biological analogies, our findings suggest that strategically combining multiple signals can lead to efficient, high-performing transformer models for resource-constrained applications.
zh

[NLP-43] Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models AAAI2025

【速读】：该论文旨在解决在训练阶段防御后门攻击（backdoor attacks）的问题。现有的防御方法主要集中在训练后的防御，而在训练阶段有效防御后门攻击的研究仍然不足。为此，作者提出了一种名为“后门令牌遗忘”（Backdoor Token Unlearning, BTU）的新防御方法。该方法的核心理念在于：1）后门学习会导致词嵌入层中后门令牌参数与干净令牌参数之间存在显著差异；2）后门攻击的成功高度依赖于后门令牌参数。BTU通过识别异常的嵌入参数，并利用细粒度的遗忘技术消除后门行为，从而在训练阶段主动检测并中和后门触发器。实验结果表明，BTU在多种数据集和不同类型的后门攻击下均能有效防御，同时保持模型在主要任务上的性能。

链接: https://arxiv.org/abs/2501.03272
作者: Peihai Jiang,Xixiang Lyu,Yige Li,Jing Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AAAI 2025

点击查看摘要

Abstract:Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model’s performance on primary tasks. Our code is available at this https URL.
zh

[NLP-44] A Semantically-Aware Kernel-Enhanced and Divergence-Rich Paradigm for Direct Preference Optimization

【速读】：该论文旨在解决大语言模型（LLMs）在多样化价值观和偏好对齐方面的挑战。尽管大语言模型的快速发展带来了许多应用，但其对齐问题仍然存在，特别是在直接偏好优化（Direct Preference Optimization, DPO）过程中，现有的方法受到固定差异度量和有限特征变换的限制。为解决这些问题，论文提出了DPO-Kernels方法，其关键创新包括：（i）核化表示（Kernelized Representations），通过多项式、径向基函数（RBF）、马氏距离（Mahalanobis）和谱核（spectral kernels）实现更丰富的特征变换，并结合嵌入和概率目标的混合损失函数；（ii）差异度量替代方案（Divergence Alternatives），如Jensen-Shannon、Hellinger、Renyi、Bhattacharyya、Wasserstein和f-差异度量，以提高稳定性；（iii）数据驱动选择指标（Data-Driven Selection metrics），自动选择最佳的核-差异度量组合；（iv）分层混合核（Hierarchical Mixture of Kernels），兼顾局部精度和全局建模。通过在12个数据集上的评估，DPO-Kernels在事实性、安全性、推理和指令遵循方面展示了最先进的性能，并基于重尾自正则化（Heavy-Tailed Self-Regularization）保持了强大的泛化能力，为大语言模型的对齐研究提供了全面的解决方案。

链接: https://arxiv.org/abs/2501.03271
作者: Amitava Das,Suranjana Trivedy,Danush Khanna,Rajarshi Roy,Gurpreet Singh,Basab Ghosh,Yaswanth Narsupalli,Vinija Jain,Vasu Sharma,Aishwarya Naresh Reganti,Aman Chadha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: -

点击查看摘要

Abstract:The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.
zh

[NLP-45] LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

【速读】：该论文试图解决内容审核（content moderation）对用户满意度的影响问题，这一问题在当前关于大语言模型（LLM）安全性和伦理对齐的广泛讨论中尚未得到充分探索。为了解决这一问题，作者通过分析近50,000个Chatbot Arena的响应对，使用了一种基于手工标注数据微调的RoBERTa模型，以区分因伦理问题导致的拒绝响应与其他因技术限制或信息不足导致的拒绝响应。研究结果表明，内容审核存在显著的拒绝惩罚（refusal penalty），用户选择基于伦理的拒绝响应的频率仅为标准响应的四分之一。然而，上下文和措辞在用户满意度中起关键作用：对于高度敏感的提示（如非法内容），拒绝响应的胜率高于较低敏感度的伦理问题，且与提示紧密对齐的较长响应表现更好。这些结果强调了需要制定细致的内容审核策略，以平衡伦理保障与用户满意度。此外，研究发现，在使用LLM-as-a-Judge方法进行评估时，拒绝惩罚显著降低，揭示了用户评估与自动化评估之间的差异。

链接: https://arxiv.org/abs/2501.03266
作者: Stefan Pasch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. To address this, we analyze nearly 50,000 Chatbot Arena response-pairs using a novel fine-tuned RoBERTa model, that we trained on hand-labeled data to disentangle refusals due to ethical concerns from other refusals due to technical disabilities or lack of information. Our findings reveal a significant refusal penalty on content moderation, with users choosing ethical-based refusals roughly one-fourth as often as their preferred LLM response compared to standard responses. However, the context and phrasing play critical roles: refusals on highly sensitive prompts, such as illegal content, achieve higher win rates than less sensitive ethical concerns, and longer responses closely aligned with the prompt perform better. These results emphasize the need for nuanced moderation strategies that balance ethical safeguards with user satisfaction. Moreover, we find that the refusal penalty is notably lower in evaluations using the LLM-as-a-Judge method, highlighting discrepancies between user and automated assessments.
zh

[NLP-46] REINFORCE: A Simple and Efficient Approach for Aligning Large Language Models

【速读】：该论文旨在解决在强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中，如何通过改进经典算法来提升训练稳定性、简化模型结构并降低计算开销的问题。论文提出的解决方案是REINFORCE++，这是经典REINFORCE算法的增强版本，结合了近端策略优化（Proximal Policy Optimization, PPO）的关键优化技术，同时去除了对评论家网络（critic network）的依赖。REINFORCE++通过简化算法结构、提升训练稳定性并减少计算开销，实现了与现有方法（如GRPO和PPO）相当的性能，同时在稳定性和计算效率上表现更优。

链接: https://arxiv.org/abs/2501.03262
作者: Jian Hu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: this is a tech report

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at \urlthis https URL.
zh

[NLP-47] oward Inclusive Educational AI: Auditing Frontier LLM s through a Multiplexity Lens

【速读】：该论文试图解决大语言模型（LLMs）在教育和应用场景中存在的文化偏见、权力失衡和伦理限制问题。这些模型通常反映了西方、教育程度高、工业化、富裕和民主（WEIRD）文化范式中的价值观，可能忽视全球多样化的文化视角。论文提出了一种基于应用多重性（applied multiplexity）的框架来评估和缓解LLMs中的文化偏见。多重性概念源自Senturk等人的研究，并植根于伊斯兰及其他智慧传统，强调多种文化视角的共存，支持整合经验科学和规范价值的多层次认识论。论文提出了两种关键策略来解决这一问题：一是“上下文实现的多重性LLMs”，通过将多重性原则直接嵌入系统提示中，从根本上影响LLM的输出；二是“多代理系统（MAS）实现的多重性LLMs”，通过多个代表不同文化视角的LLM代理协作生成平衡的综合响应。研究结果表明，随着缓解策略从上下文提示发展到MAS实现，文化包容性显著提高，表现为视角分布评分（PDS）显著上升，PDS熵从基线3.25%增加到98%。情感分析也显示跨文化的积极情感有所增加。

链接: https://arxiv.org/abs/2501.03259
作者: Abdullah Mushtaq,Muhammad Rafay Naeem,Muhammad Imran Taj,Ibrahim Ghaznavi,Junaid Qadir
机构: Department of Computer Science, Information Technology University, Lahore, Pakistan (巴基斯坦拉合尔信息技术大学计算机科学系); College of Interdisciplinary Studies, Zayed University, Dubai, UAE (阿联酋迪拜扎耶德大学跨学科研究学院); Department of Computer Science and Engineering, Qatar University, Doha, Qatar (卡塔尔多哈卡塔尔大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As large language models (LLMs) like GPT-4 and Llama 3 become integral to educational contexts, concerns are mounting over the cultural biases, power imbalances, and ethical limitations embedded within these technologies. Though generative AI tools aim to enhance learning experiences, they often reflect values rooted in Western, Educated, Industrialized, Rich, and Democratic (WEIRD) cultural paradigms, potentially sidelining diverse global perspectives. This paper proposes a framework to assess and mitigate cultural bias within LLMs through the lens of applied multiplexity. Multiplexity, inspired by Senturk et al. and rooted in Islamic and other wisdom traditions, emphasizes the coexistence of diverse cultural viewpoints, supporting a multi-layered epistemology that integrates both empirical sciences and normative values. Our analysis reveals that LLMs frequently exhibit cultural polarization, with biases appearing in both overt responses and subtle contextual cues. To address inherent biases and incorporate multiplexity in LLMs, we propose two strategies: \textitContextually-Implemented Multiplex LLMs, which embed multiplex principles directly into the system prompt, influencing LLM outputs at a foundational level and independent of individual prompts, and \textitMulti-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents, each representing distinct cultural viewpoints, collaboratively generate a balanced, synthesized response. Our findings demonstrate that as mitigation strategies evolve from contextual prompting to MAS-implementation, cultural inclusivity markedly improves, evidenced by a significant rise in the Perspectives Distribution Score (PDS) and a PDS Entropy increase from 3.25% at baseline to 98% with the MAS-Implemented Multiplex LLMs. Sentiment analysis further shows a shift towards positive sentiment across cultures,…
zh

[NLP-48] Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition ICASSP2025

【速读】：该论文试图解决端到端自动语音识别（ASR）系统中使用加权有限状态转换器（WFST）进行解码时推理速度较慢的问题。具体而言，WFST在解码过程中需要对CTC（Connectionist Temporal Classification）后验概率进行逐帧搜索，这种自回归方式显著降低了推理速度。论文的关键解决方案是基于对CTC输出中“尖峰特性”（spike property）的深入研究，提出了“尖峰窗口解码算法”（Spike Window Decoding algorithm）。该算法通过仅解码与CTC输出中尖峰帧相邻的帧，使得解码帧数与尖峰帧数呈线性关系，从而大幅提升了推理速度，同时保证了识别性能。实验结果表明，该方法在AISHELL-1和大规模内部数据集上均达到了最先进的识别精度，并显著加速了解码速度。

链接: https://arxiv.org/abs/2501.03257
作者: Wei Zhang,Tian-Hao Zhang,Chao Luo,Hui Zhou,Chao Yang,Xinyuan Qian,Xu-Cheng Yin
机构: TRIP.COM GROUP(携程集团), Shanghai, China; School of Computer and Communication Engineering, University of Science and Technology Beijing(北京科技大学计算机与通信工程学院), Beijing, China
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.
zh

[NLP-49] Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models ICLR2024

【速读】：该论文旨在探究听觉和语言处理的神经机制，特别是通过分析大脑对口语刺激的反应来揭示这些机制。研究的关键解决方案是开发了两种不同的编码模型：一种是基于音频的编码模型（audio-to-MEG encoder），利用时频分解（TFD）和wav2vec2的潜在空间表示来预测神经活动；另一种是基于文本的编码模型（text-to-MEG encoder），利用CLIP和GPT-2的嵌入表示来预测神经活动。这两种模型均能成功预测神经活动，且文本编码模型在预测准确性上优于音频编码模型，尤其是在额叶皮层（frontal cortex）和布洛卡区（Broca’s area）表现出更高的相关性。研究结果表明，听觉信息主要通过外侧颞叶区域（lateral temporal regions）进行初级听觉处理和信号整合，而语言信息则通过涉及语义整合和语言产生的高阶网络进行处理，特别是在8-30 Hz频率范围内。这些发现揭示了听觉和语言信息处理的不同神经通路，并为复杂语言刺激的神经响应建模提供了定量进展。

链接: https://arxiv.org/abs/2501.03246
作者: Matteo Ciferri,Matteo Ferrante,Nicola Toschi
机构: University of Rome, Tor Vergata (罗马大学托尔维加塔分校); A.A. Martinos Center for Biomedical Imaging (A.A. 马丁诺斯生物医学影像中心); Harvard Medical School/MGH, Boston (US) (哈佛医学院/麻省总医院，波士顿)
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: 10 pages, 4 figures, Accepted at ICLR2024 Workshop TS4H

点击查看摘要

Abstract:Understanding the neural mechanisms behind auditory and linguistic processing is key to advancing cognitive neuroscience. In this study, we use Magnetoencephalography (MEG) data to analyze brain responses to spoken language stimuli. We develop two distinct encoding models: an audio-to-MEG encoder, which uses time-frequency decompositions (TFD) and wav2vec2 latent space representations, and a text-to-MEG encoder, which leverages CLIP and GPT-2 embeddings. Both models successfully predict neural activity, demonstrating significant correlations between estimated and observed MEG signals. However, the text-to-MEG model outperforms the audio-based model, achieving higher Pearson Correlation (PC) score. Spatially, we identify that auditory-based embeddings (TFD and wav2vec2) predominantly activate lateral temporal regions, which are responsible for primary auditory processing and the integration of auditory signals. In contrast, textual embeddings (CLIP and GPT-2) primarily engage the frontal cortex, particularly Broca’s area, which is associated with higher-order language processing, including semantic integration and language production, especially in the 8-30 Hz frequency range. The strong involvement of these regions suggests that auditory stimuli are processed through more direct sensory pathways, while linguistic information is encoded via networks that integrate meaning and cognitive control. Our results reveal distinct neural pathways for auditory and linguistic information processing, with higher encoding accuracy for text representations in the frontal regions. These insights refine our understanding of the brain’s functional architecture in processing auditory and textual information, offering quantitative advancements in the modelling of neural responses to complex language stimuli.
zh

计算机视觉

[CV-0] LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving

【速读】：该论文旨在解决3D场景理解，特别是在自动驾驶应用中的视觉基础模型（VFMs）潜力未充分开发的问题。解决方案的关键在于提出了一个名为LargeAD的框架，该框架通过利用VFMs从2D图像中提取语义丰富的超像素（superpixels），并将其与LiDAR点云对齐，生成高质量的对比样本。这一对齐过程促进了跨模态表示学习，增强了2D和3D数据之间的语义一致性。论文还引入了几个关键创新：i）基于VFMs的超像素生成以实现详细的语义表示，ii）VFMs辅助的对比学习策略以对齐多模态特征，iii）超点时间一致性以保持跨时间的稳定表示，以及iv）多源数据预训练以泛化到不同的LiDAR配置。通过这些创新，该框架在LiDAR分割和物体检测任务中显著优于现有方法，展示了其在现实世界自动驾驶场景中的适应性、效率和鲁棒性。

链接: https://arxiv.org/abs/2501.04005
作者: Lingdong Kong,Xiang Xu,Youquan Liu,Jun Cen,Runnan Chen,Wenwei Zhang,Liang Pan,Kai Chen,Ziwei Liu
机构: National University of Singapore(新加坡国立大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学); Hochschule Bremerhaven(不来梅哈芬应用科技大学); Shanghai AI Laboratory(上海人工智能实验室); Hong Kong University of Science and Technology(香港科技大学); University of Hong Kong(香港大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Preprint; 16 pages, 7 figures, 8 tables; Project Page at this https URL

点击查看摘要

Abstract:Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
zh

[CV-1] LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

【速读】：该论文旨在解决现有LiDAR数据预训练方法主要集中于稀疏体素表示（sparse voxel representation），而忽略了其他LiDAR表示（如距离图像、原始点云等）所提供的互补属性的问题。为了解决这一问题，论文提出了LiMoE框架，该框架通过将混合专家（Mixture of Experts, MoE）范式引入LiDAR数据表示学习，以协同结合多种表示形式。解决方案的关键在于三个核心阶段：1) 图像到LiDAR的预训练（Image-to-LiDAR Pretraining），通过跨表示形式将图像中的先验知识迁移到点云；2) 对比混合学习（Contrastive Mixture Learning, CML），利用MoE自适应激活每种表示中的相关属性，并将这些混合特征蒸馏到一个统一的3D网络中；3) 语义混合监督（Semantic Mixture Supervision, SMS），通过结合多种表示的语义输出（semantic logits）来提升下游分割任务的性能。实验结果表明，该方法在11个大规模LiDAR数据集上表现出显著的有效性和优越性。

链接: https://arxiv.org/abs/2501.04004
作者: Xiang Xu,Lingdong Kong,Hui Shuai,Liang Pan,Ziwei Liu,Qingshan Liu
机构: Nanjing University of Aeronautics and Astronautics(南京航空航天大学); National University of Singapore(新加坡国立大学); Shanghai AI Laboratory(上海人工智能实验室); Nanjing University of Posts and Telecommunications(南京邮电大学); S-Lab, Nanyang Technological University(南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Preprint; 26 pages, 17 figures, 7 tables; Project Page at this https URL

点击查看摘要

Abstract:LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across 11 large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code and model checkpoints have been made publicly accessible.
zh

[CV-2] Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability Data and Metric Perspectives

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在自动驾驶领域中生成可解释驾驶决策时的可靠性和视觉基础问题。尽管VLMs在自然语言生成方面表现出色，但其是否能够提供基于视觉的、可靠且可解释的驾驶决策仍未被充分验证。论文通过引入DriveBench基准数据集，评估了12种主流VLMs在17种不同设置（包括干净、损坏和仅文本输入）下的表现，涵盖了19,200帧图像、20,498个问答对、三种问题类型和四种主流驾驶任务。研究发现，VLMs往往依赖于通用知识或文本线索生成看似合理的响应，而非真正的视觉基础，尤其是在视觉输入受损或缺失的情况下。这种行为在数据集不平衡和评估指标不足的情况下被掩盖，可能对自动驾驶等安全关键场景带来重大风险。为解决这些问题，论文提出了改进的评估指标，强调稳健的视觉基础和多模态理解，并建议利用VLMs对输入损坏的感知能力来增强其可靠性，从而为开发更可信和可解释的自动驾驶决策系统提供路线图。

链接: https://arxiv.org/abs/2501.04003
作者: Shaoyuan Xie,Lingdong Kong,Yuhao Dong,Chonghao Sima,Wenwei Zhang,Qi Alfred Chen,Ziwei Liu,Liang Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 41 pages, 32 figures, 16 tables; Project Page at this https URL

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs’ awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.
zh

[CV-3] Extraction Of Cumulative Blobs From Dynamic Gestures

【速读】：该论文旨在解决手势识别系统在光线不足或黑暗环境中的局限性问题。手势识别作为一种基于计算机视觉（CV）技术的感知用户界面，通常依赖摄像头捕捉和解释手势，但在光线不足的情况下，系统的性能会受到严重影响。论文提出的关键解决方案是使用夜视摄像头，这种摄像头能够发射不可见的红外光（infrared light），从而在黑暗环境中捕捉手势。通过将摄像头的视频流输入到运行OpenCV模块的Raspberry Pi中，结合机器学习算法来检测、隔离和跟踪动态手势的路径，并最终控制Raspberry Pi的GPIO引脚执行相应操作。这一方案有效克服了传统手势识别系统在黑暗环境中的工作限制。

链接: https://arxiv.org/abs/2501.04002
作者: Rishabh Naulakha,Shubham Gaur,Dhairya Lodha,Mehek Tulsyan,Utsav Kotecha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gesture recognition is a perceptual user interface, which is based on CV technology that allows the computer to interpret human motions as commands, allowing users to communicate with a computer without the use of hands, thus making the mouse and keyboard superfluous. Gesture recognition’s main weakness is a light condition because gesture control is based on computer vision, which heavily relies on cameras. These cameras are used to interpret gestures in 2D and 3D, so the extracted information can vary depending on the source of light. The limitation of the system cannot work in a dark environment. A simple night vision camera can be used as our camera for motion capture as they also blast out infrared light which is not visible to humans but can be clearly seen with a camera that has no infrared filter this majorly overcomes the limitation of systems which cannot work in a dark environment. So, the video stream from the camera is fed into a Raspberry Pi which has a Python program running OpenCV module which is used for detecting, isolating and tracking the path of dynamic gesture, then we use an algorithm of machine learning to recognize the pattern drawn and accordingly control the GPIOs of the raspberry pi to perform some activities.
zh

[CV-4] Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

【速读】：该论文旨在解决多模态大语言模型（multi-modal large language models）在处理图像和视频任务时的局限性，特别是现有模型通常局限于特定模态和任务的问题。论文提出了Sa2VA模型，这是一个统一的模型，能够同时处理图像和视频的密集接地理解（dense grounded understanding）。Sa2VA的关键解决方案在于将SAM-2（一个基础视频分割模型）与LLaVA（一个先进的视觉语言模型）相结合，并将文本、图像和视频统一到一个共享的大语言模型（LLM）标记空间中。通过LLM生成指令标记，指导SAM-2生成精确的掩码，从而实现对静态和动态视觉内容的接地多模态理解。此外，论文还引入了Ref-SAV数据集，该数据集包含超过72k个复杂视频场景中的对象表达，并通过手动验证2k个视频对象来提升模型在复杂环境中的表现。实验结果表明，Sa2VA在多个任务中达到了最先进的性能，特别是在指代视频对象分割（referring video object segmentation）任务中，展示了其在复杂现实应用中的潜力。

链接: https://arxiv.org/abs/2501.04001
作者: Haobo Yuan,Xiangtai Li,Tao Zhang,Zilong Huang,Shilin Xu,Shunping Ji,Yunhai Tong,Lu Qi,Jiashi Feng,Ming-Hsuan Yang
机构: UC Merced; Bytedance Seed; Wuhan University (武汉大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
zh

[CV-5] RAG -Check: Evaluating Multimodal Retrieval Augmented Generation Performance

【速读】：该论文旨在解决多模态检索增强生成（Multi-modal RAG）中可能引入的幻觉问题。具体来说，RAG通过引入外部知识来指导大语言模型（LLMs）生成响应，从而减少幻觉现象。然而，多模态RAG在检索过程中可能会选择不相关的文档或图像作为原始上下文，或者通过视觉语言模型（VLMs）或多模态语言模型（MLLMs）将检索到的图像处理为文本上下文时产生幻觉。为了解决这些问题，论文提出了一种新颖的框架，通过两个性能指标来评估多模态RAG的可靠性：相关性评分（Relevancy Score, RS）和正确性评分（Correctness Score, CS）。RS用于评估检索到的条目与查询的相关性，而CS用于评估生成响应的准确性。通过使用ChatGPT生成的数据库和人工评估样本训练RS和CS模型，实验结果表明这两个模型在测试数据上的准确率均达到约88%。此外，论文还构建了一个包含5000个样本的人工标注数据库，用于评估检索条目的相关性和响应语句的正确性。最终，论文利用RS和CS评估了多种RAG系统的选择和生成性能。

链接: https://arxiv.org/abs/2501.03995
作者: Matin Mortaheb,Mohammad A. Amir Khojastepour,Srimat T. Chakradhar,Sennur Ulukus
机构: University of Maryland(马里兰大学); NEC Laboratories America(NEC美国实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves large language models (LLMs) by using external knowledge to guide response generation, reducing hallucinations. However, RAG, particularly multi-modal RAG, can introduce new hallucination sources: (i) the retrieval process may select irrelevant pieces (e.g., documents, images) as raw context from the database, and (ii) retrieved images are processed into text-based context via vision-language models (VLMs) or directly used by multi-modal language models (MLLMs) like GPT-4o, which may hallucinate. To address this, we propose a novel framework to evaluate the reliability of multi-modal RAG using two performance measures: (i) the relevancy score (RS), assessing the relevance of retrieved entries to the query, and (ii) the correctness score (CS), evaluating the accuracy of the generated response. We train RS and CS models using a ChatGPT-derived database and human evaluator samples. Results show that both models achieve ~88% accuracy on test data. Additionally, we construct a 5000-sample human-annotated database evaluating the relevancy of retrieved pieces and the correctness of response statements. Our RS model aligns with human preferences 20% more often than CLIP in retrieval, and our CS model matches human preferences ~91% of the time. Finally, we assess various RAG systems’ selection and generation performances using RS and CS.
zh

[CV-6] NeuralSVG: An Implicit Representation for Text-to-Vector Generation

【速读】：该论文旨在解决现有文本到矢量图形生成方法中存在的两个主要问题：一是输出过度参数化，二是对矢量图形的分层结构（layered structure）处理不足，导致其在实际应用中的实用性受限。为了解决这些问题，论文提出了NeuralSVG，一种基于隐式神经表示（implicit neural representation）的方法，用于从文本提示生成矢量图形。NeuralSVG的核心创新在于借鉴了神经辐射场（Neural Radiance Fields, NeRFs）的思想，将整个场景编码到一个小型多层感知机（MLP）网络的权重中，并通过分数蒸馏采样（Score Distillation Sampling, SDS）进行优化。为了增强生成SVG的分层结构，论文引入了一种基于dropout的正则化技术，以强化每个形状的独立意义。此外，NeuralSVG还提供了推理时的控制能力，使用户能够基于输入动态调整生成的SVG，而无需重新训练模型。通过广泛的定性和定量评估，论文证明了NeuralSVG在生成结构化和灵活的SVG方面优于现有方法。

链接: https://arxiv.org/abs/2501.03992
作者: Sagi Polaczek,Yuval Alaluf,Elad Richardson,Yael Vinker,Daniel Cohen-Or
机构: Tel Aviv University(特拉维夫大学); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Vector graphics are essential in design, providing artists with a versatile medium for creating resolution-independent and highly editable visual content. Recent advancements in vision-language and diffusion models have fueled interest in text-to-vector graphics generation. However, existing approaches often suffer from over-parameterized outputs or treat the layered structure - a core feature of vector graphics - as a secondary goal, diminishing their practical use. Recognizing the importance of layered SVG representations, we propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts. Inspired by Neural Radiance Fields (NeRFs), NeuralSVG encodes the entire scene into the weights of a small MLP network, optimized using Score Distillation Sampling (SDS). To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. We additionally demonstrate that utilizing a neural representation provides an added benefit of inference-time control, enabling users to dynamically adapt the generated SVG based on user-provided inputs, all with a single learned representation. Through extensive qualitative and quantitative evaluations, we demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.
zh

[CV-7] VLM-driven Behavior Tree for Context-aware Task Planning DATE

【速读】：该论文试图解决在视觉复杂环境中，如何利用生成式AI技术（如大语言模型LLMs和视觉语言模型VLMs）来生成和编辑行为树（Behavior Trees, BTs），以实现上下文感知的机器人操作。解决方案的关键在于通过自提示的视觉条件进行条件控制。具体而言，视觉语言模型生成带有视觉条件节点的行为树，这些条件以自由文本形式表达。另一个视觉语言模型过程将这些文本整合到其提示中，并在机器人执行过程中根据真实世界的图像评估这些条件。该框架在真实世界的咖啡馆场景中进行了验证，展示了其可行性和局限性。

链接: https://arxiv.org/abs/2501.03968
作者: Naoki Wake,Atsushi Kanehira,Jun Takamatsu,Kazuhiro Sasabuchi,Katsushi Ikeuchi
机构: Applied Robotics Research, Microsoft(微软), Redmond, WA, USA
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 10 pages, 11 figures, 5 tables. Last updated on January 7th, 2024

点击查看摘要

Abstract:The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
zh

[CV-8] mporal Feature Weaving for Neonatal Echocardiographic Viewpoint Video Classification

【速读】：该论文旨在解决超声心动图（echocardiogram）中自动视角分类的问题，特别是在资源有限的诊所和医院中，当缺乏专家技术人员时，能够提供更快速的诊断和筛查。论文提出的解决方案的关键在于将视角分类视为视频分类而非图像分类，从而利用时空信息提升分类准确性。具体而言，作者提出了一种结合卷积神经网络（CNN）和门控循环单元（GRU）的架构，并引入了一种新颖的时间特征编织方法（temporal feature weaving），该方法通过仅使用四个连续帧即可在基线图像分类的基础上提高4.33%的准确率，同时计算开销最小。此外，作者还发布了新生儿超声心动图数据集（Neonatal Echocardiogram Dataset, NED），包含16个视角及其相关视频，以促进该领域的进一步研究和发展。

链接: https://arxiv.org/abs/2501.03967
作者: Satchel French,Faith Zhu,Amish Jain,Naimul Khan
机构: Toronto Metropolitan University(多伦多大都会大学); Mount Sinai Hospital(西奈山医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2025

点击查看摘要

Abstract:Automated viewpoint classification in echocardiograms can help under-resourced clinics and hospitals in providing faster diagnosis and screening when expert technicians may not be available. We propose a novel approach towards echocardiographic viewpoint classification. We show that treating viewpoint classification as video classification rather than image classification yields advantage. We propose a CNN-GRU architecture with a novel temporal feature weaving method, which leverages both spatial and temporal information to yield a 4.33% increase in accuracy over baseline image classification while using only four consecutive frames. The proposed approach incurs minimal computational overhead. Additionally, we publish the Neonatal Echocardiogram Dataset (NED), a professionally-annotated dataset providing sixteen viewpoints and associated echocardipgraphy videos to encourage future work and development in this field. Code available at: this https URL
zh

[CV-9] Vision Language Models as Values Detectors

【速读】：该论文试图解决大语言模型（LLMs）在结合文本和视觉输入时，如何更好地与人类感知对齐，特别是在识别家庭环境场景中相关元素的问题。尽管这些模型在基于视觉刺激生成连贯且上下文相关的文本方面表现出色，但其在识别图像中相关元素时与人类感知的一致性仍需进一步探索。论文通过创建一组十二张描绘不同家庭场景的图像，并邀请十四名标注者识别每张图像中的关键元素，然后将这些人类标注结果与五种不同LLMs（包括GPT-4o和四种LLaVA变体）的输出进行比较。研究结果表明，LLaVA 34B表现最佳，但仍存在较大差距。解决方案的关键在于通过改进训练和优化提示（prompts），提升模型在检测图像中价值相关元素的能力，从而增强其在社交机器人、辅助技术和人机交互等应用中的潜力。

链接: https://arxiv.org/abs/2501.03957
作者: Giulio Antonio Abbo,Tony Belpaeme
机构: IDLab-AIRO, Ghent University – imec, Belgium (根特大学 – imec, 比利时)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models’ potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.
zh

[CV-10] Visual question answering: from early developments to recent advances – a survey

【速读】：该论文旨在综述视觉问答（Visual Question Answering, VQA）领域的最新进展，探讨如何通过整合图像处理和语言处理技术（如特征提取、目标检测、文本嵌入、自然语言理解和语言生成）使机器能够回答关于视觉内容的问题。论文的核心解决方案包括对VQA架构进行分类，基于设计选择和关键组件进行系统化分析，并重点回顾了基于深度学习的方法，特别是新兴的大型视觉语言模型（Large Visual Language Models, LVLMs）在多模态任务中的成功应用。此外，论文还探讨了用于评估VQA系统性能的数据集和评价指标，并分析了VQA在实际应用中的潜力与挑战。通过这一综述，论文为研究人员和从业者提供了全面的资源，帮助理解VQA领域的最新进展和未来发展方向。

链接: https://arxiv.org/abs/2501.03939
作者: Ngoc Dung Huynh,Mohamed Reda Bouadjenek,Sunil Aryal,Imran Razzak,Hakim Hacid
机构: Deakin University(迪肯大学); The University of New South Wales(新南威尔士大学); Technology Innovation Institute(技术创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 20

点击查看摘要

Abstract:Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future
zh

[CV-11] CoStruction: Conjoint radiance field optimization for urban scene reconStruction with limited image overlap

【速读】：该论文试图解决在有限图像重叠和复杂城市环境拓扑下，从驾驶序列中重建周围表面几何的挑战。现有的最先进（SoTA）神经隐式表面重建方法在此类场景中表现不佳，常常由于视觉重叠不足或无法准确重建表面和精细结构而失败。为解决这些限制，作者提出了CoStruction，一种新颖的混合隐式表面重建方法，专门针对有限相机重叠的大规模驾驶序列。CoStruction的关键在于利用跨表示不确定性估计来过滤由有限观测引起的模糊几何，并通过联合优化辐射场和引导采样，实现了在复杂城市场景中大面积区域和精细结构的准确重建。实验结果表明，该方法在有限图像重叠的大规模驾驶序列重建中优于现有的SoTA方法。

链接: https://arxiv.org/abs/2501.03932
作者: Fusang Wang,Hala Djeghim,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou
机构: Noah’s Ark, Huawei Paris Research Center, France (诺亚方舟, 华为巴黎研究中心, 法国); CAOR, Mines Paris-PSL, France (CAOR, 巴黎矿业学院-PSL, 法国); IBISC, Evry Paris-Saclay University, France (IBISC, 埃夫里巴黎-萨克雷大学, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing the surrounding surface geometry from recorded driving sequences poses a significant challenge due to the limited image overlap and complex topology of urban environments. SoTA neural implicit surface reconstruction methods often struggle in such setting, either failing due to small vision overlap or exhibiting suboptimal performance in accurately reconstructing both the surface and fine structures. To address these limitations, we introduce CoStruction, a novel hybrid implicit surface reconstruction method tailored for large driving sequences with limited camera overlap. CoStruction leverages cross-representation uncertainty estimation to filter out ambiguous geometry caused by limited observations. Our method performs joint optimization of both radiance fields in addition to guided sampling achieving accurate reconstruction of large areas along with fine structures in complex urban scenarios. Extensive evaluation on major driving datasets demonstrates the superiority of our approach in reconstructing large driving sequences with limited image overlap, outperforming concurrent SoTA methods.
zh

[CV-12] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

【速读】：该论文旨在解决在生成高质量、动态运动的视频时，如何保持身份一致性（identity preservation）的问题。现有的视频扩散模型（video diffusion models）在文本到视频生成（text-to-video generation）方面表现出色，但在生成自然运动的同时保持身份一致性仍然具有挑战性。以往的方法要么需要针对特定人物进行微调，要么难以在身份保持与运动多样性之间取得平衡。为此，论文提出了Magic Mirror框架，其关键解决方案包括三个核心组件：(1) 双分支面部特征提取器（dual-branch facial feature extractor），用于同时捕捉身份和结构特征；(2) 轻量级跨模态适配器（lightweight cross-modal adapter），结合条件自适应归一化（Conditioned Adaptive Normalization）实现高效的身份整合；(3) 两阶段训练策略（two-stage training strategy），结合合成身份对和视频数据进行训练。实验表明，Magic Mirror在保持身份一致性和生成自然运动之间取得了良好平衡，并在多个指标上优于现有方法，同时仅需添加少量参数。

链接: https://arxiv.org/abs/2501.03931
作者: Yuechen Zhang,Yaoyang Liu,Bin Xia,Bohao Peng,Zexin Yan,Eric Lo,Jiaya Jia
机构: CUHK(香港中文大学); HKUST(香港科技大学); SmartMore(思谋科技); CMU(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: It is best viewed in Acrobat. Project Page: this https URL

点击查看摘要

Abstract:We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: this https URL
zh

[CV-13] HYB-VITON: A Hybrid Approach to Virtual Try-On Combining Explicit and Implicit Warping ICASSP2025

【速读】：该论文旨在解决虚拟试衣系统（Virtual Try-On）中现有方法在保留服装细节和生成自然效果之间的权衡问题。现有方法主要分为两类：显式变形（explicit warping）和隐式变形（implicit warping）。显式变形能够保留服装细节，但生成的图像往往不自然；而隐式变形虽然能够生成自然的效果，但在细节保留上表现不佳。论文提出的解决方案HYB-VITON结合了这两种方法的优势，通过引入预处理流程和新的训练选项，既利用了显式变形中保留的服装细节区域，又借助隐式变形实现了自然的图像重建。实验结果表明，HYB-VITON在保留服装细节和生成自然效果方面均优于现有的扩散模型（diffusion-based methods）和最先进的显式变形方法。

链接: https://arxiv.org/abs/2501.03910
作者: Kosuke Takemoto,Takafumi Koshinaka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ICASSP 2025

点击查看摘要

Abstract:Virtual try-on systems have significant potential in e-commerce, allowing customers to visualize garments on themselves. Existing image-based methods fall into two categories: those that directly warp garment-images onto person-images (explicit warping), and those using cross-attention to reconstruct given garments (implicit warping). Explicit warping preserves garment details but often produces unrealistic output, while implicit warping achieves natural reconstruction but struggles with fine details. We propose HYB-VITON, a novel approach that combines the advantages of each method and includes both a preprocessing pipeline for warped garments and a novel training option. These components allow us to utilize beneficial regions of explicitly warped garments while leveraging the natural reconstruction of implicit warping. A series of experiments demonstrates that HYB-VITON preserves garment details more faithfully than recent diffusion-based methods, while producing more realistic results than a state-of-the-art explicit warping method.
zh

[CV-14] Superpixel Boundary Correction for Weakly-Supervised Semantic Segmentation on Histopathology Images

【速读】：该论文试图解决在计算病理学（computational pathology）中，基于类激活图（Class Activation Map, CAM）的弱监督语义分割（Weakly Supervised Semantic Segmentation, WSSS）方法在癌症诊断和亚型分类中存在的空间分辨率低和边界不清晰的问题。为了解决这些问题，作者提出了一种多级超像素校正算法（multi-level superpixel correction algorithm），该算法通过超像素聚类（superpixel clustering）和填充（floodfill）技术来优化CAM的边界。实验结果表明，该方法在乳腺癌分割数据集上取得了显著的效果，平均交并比（mIoU）达到了71.08%，显著改善了肿瘤微环境边界的描绘。

链接: https://arxiv.org/abs/2501.03891
作者: Hongyi Wu,Hong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures

点击查看摘要

Abstract:With the rapid advancement of deep learning, computational pathology has made significant progress in cancer diagnosis and subtyping. Tissue segmentation is a core challenge, essential for prognosis and treatment decisions. Weakly supervised semantic segmentation (WSSS) reduces the annotation requirement by using image-level labels instead of pixel-level ones. However, Class Activation Map (CAM)-based methods still suffer from low spatial resolution and unclear boundaries. To address these issues, we propose a multi-level superpixel correction algorithm that refines CAM boundaries using superpixel clustering and floodfill. Experimental results show that our method achieves great performance on breast cancer segmentation dataset with mIoU of 71.08%, significantly improving tumor microenvironment boundary delineation.
zh

[CV-15] CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds

【速读】：该论文试图解决现有3D大型多模态模型（3D LMMs）在训练数据集中视觉和文本内容信息粒度和清晰度不足的问题，这些问题限制了跨模态理解的精确性。为解决这一问题，论文提出了CL3DOR（基于高分辨率点云的对比学习），其关键解决方案包括增加每个对象的点云密度，并在训练数据集中构建信息丰富的硬负样本（hard negative responses）以惩罚不期望的响应。此外，CL3DOR通过将比值比（odds ratio）作为对比学习的辅助项引入传统的语言建模损失中，进一步提升了模型的性能。实验结果表明，CL3DOR在3D场景理解和推理基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.03879
作者: Keonwoo Kim,Yeongjae Cho,Taebaek Hwang,Minsoo Jo,Sangdo Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR’s key components through extensive experiments.
zh

[CV-16] ZDySS – Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

【速读】：该论文旨在解决动态场景风格化（stylization）中的时空一致性（spatio-temporal consistency）问题。现有方法主要针对静态场景，且通常需要对每种风格图像进行优化，限制了其适应性。论文提出的解决方案是ZDySS，一种零样本（zero-shot）风格化框架，能够在推理阶段泛化到未见过的风格图像。其关键创新在于使用高斯溅射（Gaussian splatting）进行场景表示，并将每个高斯与一个学习到的特征向量关联，从而为任意视角和时间戳渲染特征图。通过在学习的特征向量而非渲染的特征图上进行风格迁移，增强了帧间的时空一致性。该方法在真实动态场景测试中表现出优于现有基准模型的性能和一致性，为实际应用提供了稳健的解决方案。

链接: https://arxiv.org/abs/2501.03875
作者: Abhishek Saroha,Florian Hofherr,Mariia Gladkova,Cecilia Curreli,Or Litany,Daniel Cremers
机构: Technical University of Munich(慕尼黑工业大学); Munich Center for Machine Learning(慕尼黑机器学习中心); Technion(以色列理工学院); Nvidia(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.
zh

[CV-17] Neuromorphic Optical Tracking and Imaging of Randomly Moving Targets through Strongly Scattering Media

【速读】：该论文试图解决在散射介质中随机移动目标的跟踪和光学成像问题，这一问题在许多需要精确定位和识别的应用中具有重要意义。解决方案的关键在于结合事件检测相机（event detecting camera）和多阶段神经形态深度学习策略（multistage neuromorphic deep learning strategy），构建了一个端到端的神经形态光学工程和计算方法。具体而言，光子从密集散射介质中射出后被事件相机检测并转换为像素级的异步脉冲序列（asynchronized spike trains），从而从主导的无信息背景中分离出目标特定信息。这些脉冲数据随后被输入到深度脉冲神经网络（deep spiking neural network, SNN）引擎中，通过两个并行且相互连接的模块在离散时间步长内进行目标跟踪和图像重建。实验结果表明，该方法能够在高计算效率和低功耗的条件下，成功跟踪和重建在密集浑浊介质中随机移动的目标以及空间静止但光学动态的物体。

链接: https://arxiv.org/abs/2501.03874
作者: Ning Zhang,Timothy Shea,Arto Nurmikko
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 22 pages, 6 figures

点击查看摘要

Abstract:Tracking and acquiring simultaneous optical images of randomly moving targets obscured by scattering media remains a challenging problem of importance to many applications that require precise object localization and identification. In this work we develop an end-to-end neuromorphic optical engineering and computational approach to demonstrate how to track and image normally invisible objects by combining an event detecting camera with a multistage neuromorphic deep learning strategy. Photons emerging from dense scattering media are detected by the event camera and converted to pixel-wise asynchronized spike trains - a first step in isolating object-specific information from the dominant uninformative background. Spiking data is fed into a deep spiking neural network (SNN) engine where object tracking and image reconstruction are performed by two separate yet interconnected modules running in parallel in discrete time steps over the event duration. Through benchtop experiments we demonstrate tracking and imaging randomly moving objects in dense turbid media as well as image reconstruction of spatially stationary but optically dynamic objects. Standardized character sets serve as representative proxies for geometrically complex objects, underscoring the method’s generality. The results highlight the advantages of a fully neuromorphic approach in meeting a major imaging technology with high computational efficiency and low power consumption.
zh

[CV-18] Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

【速读】：该论文试图解决在生成高质量视频时，如何实现对视频生成过程的精确控制（如相机操作或内容编辑）的问题。现有的方法通常只能处理单一类型的控制，缺乏应对多样化控制需求的灵活性。论文提出的解决方案是“Diffusion as Shader (DaS)”，其关键在于利用3D控制信号来实现多任务视频控制。与以往局限于2D控制信号的方法不同，DaS通过使用3D跟踪视频作为控制输入，使视频扩散过程具备3D感知能力。这种创新使得DaS能够通过简单地操作3D跟踪视频来实现广泛的视频控制任务，如网格到视频生成、相机控制、运动传递和对象操作等。此外，3D跟踪视频能够有效链接帧，显著增强生成视频的时间一致性。

链接: https://arxiv.org/abs/2501.03847
作者: Zekai Gu,Rui Yan,Jiahao Lu,Peng Li,Zhiyang Dou,Chenyang Si,Zhen Dong,Qifeng Liu,Cheng Lin,Ziwei Liu,Wenping Wang,Yuan Liu
机构: Hong Kong University of Science and Technology(香港科技大学); Zhejiang University(浙江大学); The University of Hong Kong(香港大学); Nanyang Technological University(南洋理工大学); Wuhan University(武汉大学); Texas A&M University(德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL Codes: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
zh

[CV-19] LM-Net: A Light-weight and Multi-scale Network for Medical Image Segmentation

【速读】：该论文旨在解决当前医学图像分割方法在深度探索多尺度信息以及有效结合局部细节纹理与全局上下文语义信息方面的局限性，这些问题导致了过度分割、欠分割以及分割边界模糊的现象。为了解决这些挑战，论文提出了一种新颖的轻量级多尺度架构（LM-Net），该架构结合了卷积神经网络（CNNs）和视觉变换器（ViTs）的优势，以提升分割精度。LM-Net通过轻量级多分支模块在同一层次上捕获多尺度特征，并引入了两个模块——局部特征变换器（LFT）和全局特征变换器（GFT），以在不同层次上同时捕获局部细节纹理和全局语义信息。LFT通过局部窗口自注意力机制捕获局部细节纹理，而GFT则利用全局自注意力机制捕获全局上下文语义。通过结合这些模块，LM-Net实现了局部与全局表征的互补，缓解了医学图像分割中边界模糊的问题。实验结果表明，LM-Net在多个公开数据集上达到了最先进的性能，且计算复杂度较低，证明了其在多种医学图像分割任务中的有效性和适应性。

链接: https://arxiv.org/abs/2501.03838
作者: Zhenkun Lu,Chaoyin She,Wei Wang,Qinghua Huang
机构: College of Electronic Information, Guangxi Minzu University (广西民族大学电子信息学院); Department of Medical Ultrasonics, Institute of Diagnostic and Interventional Ultrasound, The First Affiliated Hospital of Sun Yat-Sen University (中山大学附属第一医院医学超声科, 诊断与介入超声研究所); School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能、光学与电子学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current medical image segmentation approaches have limitations in deeply exploring multi-scale information and effectively combining local detail textures with global contextual semantic information. This results in over-segmentation, under-segmentation, and blurred segmentation boundaries. To tackle these challenges, we explore multi-scale feature representations from different perspectives, proposing a novel, lightweight, and multi-scale architecture (LM-Net) that integrates advantages of both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance segmentation accuracy. LM-Net employs a lightweight multi-branch module to capture multi-scale features at the same level. Furthermore, we introduce two modules to concurrently capture local detail textures and global semantics with multi-scale features at different levels: the Local Feature Transformer (LFT) and Global Feature Transformer (GFT). The LFT integrates local window self-attention to capture local detail textures, while the GFT leverages global self-attention to capture global contextual semantics. By combining these modules, our model achieves complementarity between local and global representations, alleviating the problem of blurred segmentation boundaries in medical image segmentation. To evaluate the feasibility of LM-Net, extensive experiments have been conducted on three publicly available datasets with different modalities. Our proposed model achieves state-of-the-art results, surpassing previous methods, while only requiring 4.66G FLOPs and 5.4M parameters. These state-of-the-art results on three datasets with different modalities demonstrate the effectiveness and adaptability of our proposed LM-Net for various medical image segmentation tasks.
zh

[CV-20] MeshConv3D: Efficient convolution and pooling operators for triangular 3D meshes

链接: https://arxiv.org/abs/2501.03830
作者: Germain Bregeon,Marius Preda,Radu Ispas,Titus Zaharia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

[CV-21] MADation: Face Morphing Attack Detection with Foundation Models WACV2025

【速读】：该论文旨在解决面部识别算法在面对变形攻击（morphing attacks）时的安全问题。尽管近年来面部识别算法在性能上取得了显著提升，但同样的技术进步也可能被用于开发高效的攻击手段，威胁其安全部署。变形攻击检测（MAD）系统的主要目标是在早期阶段检测到这种特定类型的攻击，防止其在关键验证过程中被使用。论文提出了一种基于基础模型（Foundation Models, FM）的解决方案，特别是通过调整CLIP架构并引入LoRA权重，同时训练一个分类头（classification header），从而构建了一个名为MADation的框架。该框架在MAD任务中表现出色，超越了现有的基于FM和Transformer的框架，并在多个评估场景中优于当前的MAD解决方案。MADation的成功关键在于其能够有效利用FM的预训练知识，并通过特定任务的微调实现卓越的泛化能力。

链接: https://arxiv.org/abs/2501.03800
作者: Eduarda Caldeira,Guray Ozgur,Tahar Chettaoui,Marija Ivanovska,Fadi Boutros,Vitomir Struc,Naser Damer
机构: Fraunhofer IGD (弗劳恩霍夫计算机图形研究所), Darmstadt, Germany; University of Ljubljana (卢布尔雅那大学), Ljubljana, Slovenia; TU Darmstadt (达姆施塔特工业大学), Darmstadt, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted at WACV 2025 workshops

点击查看摘要

Abstract:Despite the considerable performance improvements of face recognition algorithms in recent years, the same scientific advances responsible for this progress can also be used to create efficient ways to attack them, posing a threat to their secure deployment. Morphing attack detection (MAD) systems aim to detect a specific type of threat, morphing attacks, at an early stage, preventing them from being considered for verification in critical processes. Foundation models (FM) learn from extensive amounts of unlabeled data, achieving remarkable zero-shot generalization to unseen domains. Although this generalization capacity might be weak when dealing with domain-specific downstream tasks such as MAD, FMs can easily adapt to these settings while retaining the built-in knowledge acquired during pre-training. In this work, we recognize the potential of FMs to perform well in the MAD task when properly adapted to its specificities. To this end, we adapt FM CLIP architectures with LoRA weights while simultaneously training a classification header. The proposed framework, MADation surpasses our alternative FM and transformer-based frameworks and constitutes the first adaption of FMs to the MAD task. MADation presents competitive results with current MAD solutions in the literature and even surpasses them in several evaluation scenarios. To encourage reproducibility and facilitate further research in MAD, we publicly release the implementation of MADation at https: //github.com/gurayozgur/MADation
zh

[CV-22] KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration ICASSP2025

【速读】：该论文旨在解决零样本异常检测（Zero-shot Anomaly Detection, ZSAD）中的两个主要问题：一是现有方法依赖手动设计的固定文本描述或异常提示，导致耗时且易产生语义歧义；二是现有视觉-语言模型（如CLIP）在像素级异常分割方面表现不佳，主要关注全局语义而忽略局部细节。为解决这些问题，论文提出了KAnoCLIP框架，其关键创新在于通过知识驱动的提示学习（Knowledge-Driven Prompt Learning, KnPL）结合大型语言模型（GPT-3.5）的通用知识和视觉问答系统（Llama3）的细粒度图像特定知识，生成可学习的异常提示，从而避免固定文本提示的局限性并提升泛化能力。此外，KAnoCLIP引入了CLIP-VV视觉编码器、双向跨模态交互（Bi-CMCI）和Conv-Adapter等组件，以保留局部视觉语义、增强局部跨模态融合，并实现全局视觉特征与文本信息的对齐，从而显著提升像素级异常检测的性能。

链接: https://arxiv.org/abs/2501.03786
作者: Chengyuan Li,Suyang Zhou,Jieping Kong,Lei Qi,Hui Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) identifies anomalies without needing training samples from the target dataset, essential for scenarios with privacy concerns or limited data. Vision-language models like CLIP show potential in ZSAD but have limitations: relying on manually crafted fixed textual descriptions or anomaly prompts is time-consuming and prone to semantic ambiguity, and CLIP struggles with pixel-level anomaly segmentation, focusing more on global semantics than local details. To address these limitations, We introduce KAnoCLIP, a novel ZSAD framework that leverages vision-language models. KAnoCLIP combines general knowledge from a Large Language Model (GPT-3.5) and fine-grained, image-specific knowledge from a Visual Question Answering system (Llama3) via Knowledge-Driven Prompt Learning (KnPL). KnPL uses a knowledge-driven (KD) loss function to create learnable anomaly prompts, removing the need for fixed text prompts and enhancing generalization. KAnoCLIP includes the CLIP visual encoder with V-V attention (CLIP-VV), Bi-Directional Cross-Attention for Multi-Level Cross-Modal Interaction (Bi-CMCI), and Conv-Adapter. These components preserve local visual semantics, improve local cross-modal fusion, and align global visual features with textual information, enhancing pixel-level anomaly detection. KAnoCLIP achieves state-of-the-art performance in ZSAD across 12 industrial and medical datasets, demonstrating superior generalization compared to existing methods.
zh

[CV-23] Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection

【速读】：该论文旨在解决遥感目标检测中高宽比（aspect ratio）目标的检测难题。现有的遥感目标检测器通常使用大核卷积（large-kernel convolutions）的方形卷积核来提取特征，但这种方法在处理高宽比目标时效果有限。论文提出了一种新的网络架构——Strip R-CNN，其核心创新在于采用了序列正交的大条带卷积（sequential orthogonal large strip convolutions）来更好地捕捉空间信息。此外，通过解耦检测头（decoupling the detection heads）并在定位头中引入条带卷积，进一步增强了目标定位能力。实验结果表明，Strip R-CNN在多个基准数据集（如DOTA、FAIR1M、HRSC2016和DIOR）上显著优于现有方法，尤其是在DOTA-v1.0数据集上达到了82.75%的mAP（mean Average Precision），刷新了当前的最优性能。

链接: https://arxiv.org/abs/2501.03775
作者: Xinbin Yuan,ZhaoHui Zheng,Yuxuan Li,Xialei Liu,Li Liu,Xiang Li,Qibin Hou,Ming-Ming Cheng
机构: VCIP, School of Computer Science, NKU (南开大学计算机科学与技术学院VCIP); NKIARI, Futian, Shenzhen (深圳福田NKIARI); NUTD (NUTD)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing object detection and can detect objects of various aspect ratios well. Based on large strip convolutions, we build a new network architecture called Strip R-CNN, which is simple, efficient, and powerful. Unlike recent remote sensing object detectors that leverage large-kernel convolutions with square shapes, our Strip R-CNN takes advantage of sequential orthogonal large strip convolutions to capture spatial information. In addition, we enhance the localization capability of remote-sensing object detectors by decoupling the detection heads and equipping the localization head with strip convolutions to better localize the target objects. Extensive experiments on several benchmarks, e.g., DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN can largely improve previous works. Notably, our 30M model achieves 82.75% mAP on DOTA-v1.0, setting a new state-of-the-art this http URL is available at this https URL.
zh

[CV-24] AutoFish: Dataset and Benchmark for Fine-grained Analysis of Fish WACV’25

【速读】：该论文旨在解决可持续渔业管理和过度捕捞挑战中的自动化鱼类文档处理问题。为了实现这一目标，论文提出了一个名为AutoFish的公开数据集，专门用于细粒度鱼类分析。该数据集包含1,500张图像，涵盖了454个视觉上相似的鱼类样本，这些样本被放置在不同配置的白色传送带上，并标注了实例分割掩码、ID和长度测量数据。数据收集在受控环境中使用RGB相机完成，标注过程包括手动点标注、由Segment Anything Model (SAM)生成的初始分割掩码以及后续的手动掩码修正。论文通过两种Mask2Former架构变体建立了基线实例分割结果，最佳模型达到了89.15%的mAP（平均精度）。此外，论文还提出了两种基线长度估计方法，其中表现最佳的是基于MobileNetV2的自定义回归模型，在无遮挡图像中的平均绝对误差（MAE）为0.62厘米，在有遮挡图像中为1.38厘米。解决方案的关键在于提供了一个高质量、公开可用的数据集，并结合先进的深度学习模型进行实例分割和长度估计，从而为自动化鱼类文档处理提供了可靠的技术基础。

链接: https://arxiv.org/abs/2501.03767
作者: Stefan Hein Bengtson,Daniel Lehotský,Vasiliki Ismiroglou,Niels Madsen,Thomas B. Moeslund,Malte Pedersen
机构: Visual Analysis and Perception Lab, Aalborg University, Denmark(丹麦奥尔堡大学视觉分析与感知实验室); Pioneer Centre for AI, Copenhagen, Denmark(哥本哈根人工智能先锋中心); Section of Biology and Environmental Science, Aalborg University, Denmark(丹麦奥尔堡大学生物与环境科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In the 3rd Workshop on Maritime Computer Vision (MaCVi) at WACV’25

点击查看摘要

Abstract:Automated fish documentation processes are in the near future expected to play an essential role in sustainable fisheries management and for addressing challenges of overfishing. In this paper, we present a novel and publicly available dataset named AutoFish designed for fine-grained fish analysis. The dataset comprises 1,500 images of 454 specimens of visually similar fish placed in various constellations on a white conveyor belt and annotated with instance segmentation masks, IDs, and length measurements. The data was collected in a controlled environment using an RGB camera. The annotation procedure involved manual point annotations, initial segmentation masks proposed by the Segment Anything Model (SAM), and subsequent manual correction of the masks. We establish baseline instance segmentation results using two variations of the Mask2Former architecture, with the best performing model reaching an mAP of 89.15%. Additionally, we present two baseline length estimation methods, the best performing being a custom MobileNetV2-based regression model reaching an MAE of 0.62cm in images with no occlusion and 1.38cm in images with occlusion. Link to project page: this https URL.
zh

[CV-25] Image Segmentation: Inducing graph-based learning

【速读】：该论文旨在探索图神经网络（Graph Neural Networks, GNNs）在多种图像模态下的语义分割（semantic segmentation）中的潜力，特别是针对几何畸变较大的鱼眼图像（fisheye images）和医学图像中的复杂边界分割问题。论文提出了一种基于GNN的新型U-Net架构（UNet-GNN），并通过在PascalVOC、WoodScape和ISIC2016三个数据集上的实验，验证了其有效性。与传统的卷积神经网络（Convolutional Neural Networks, CNNs）和基于Transformer的SwinUNet不同，GNN通过构建图像特征的图表示（graph representation）并对其进行操作，显式地建模图像区域之间的关系。这种方法能够捕捉长距离依赖性和复杂的空间关系，尤其适用于处理鱼眼图像中的几何畸变和医学图像中的精细边界分割。论文的关键解决方案在于利用GNN的图结构建模能力，提升分割精度，特别是在自动驾驶和医学图像分析等领域的应用。

链接: https://arxiv.org/abs/2501.03765
作者: Aryan Singh,Pepijn Van de Ven,Ciarán Eising,Patrick Denny
机构: Data-Driven Computer Engineering Group, Dept. of Electronic and Computer Engineering, University of Limerick, Ireland (数据驱动计算机工程组，电子与计算机工程系，利默里克大学，爱尔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This study explores the potential of graph neural networks (GNNs) to enhance semantic segmentation across diverse image modalities. We evaluate the effectiveness of a novel GNN-based U-Net architecture on three distinct datasets: PascalVOC, a standard benchmark for natural image segmentation, WoodScape, a challenging dataset of fisheye images commonly used in autonomous driving, introducing significant geometric distortions; and ISIC2016, a dataset of dermoscopic images for skin lesion segmentation. We compare our proposed UNet-GNN model against established convolutional neural networks (CNNs) based segmentation models, including U-Net and U-Net++, as well as the transformer-based SwinUNet. Unlike these methods, which primarily rely on local convolutional operations or global self-attention, GNNs explicitly model relationships between image regions by constructing and operating on a graph representation of the image features. This approach allows the model to capture long-range dependencies and complex spatial relationships, which we hypothesize will be particularly beneficial for handling geometric distortions present in fisheye imagery and capturing intricate boundaries in medical images. Our analysis demonstrates the versatility of GNNs in addressing diverse segmentation challenges and highlights their potential to improve segmentation accuracy in various applications, including autonomous driving and medical image analysis.
zh

[CV-26] Realistic Test-Time Adaptation of Vision-Language Models

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在转导学习（transductive learning）或测试时适应（Test-Time Adaptation, TTA）方法中存在的局限性。这些方法通常假设测试数据分布是理想的，例如假设所有类别都在测试集中出现，从而在实际应用中可能导致模型在零样本（zero-shot）场景下的鲁棒性下降。论文提出了一个更现实的评估框架，包括：（i）在单个批次中适应有效类别数量的变化，以及（ii）在线适应设置中非独立同分布（non-i.i.d.）的测试样本批次。通过全面的评估、对比和消融研究，论文揭示了现有方法在这些现实场景下如何系统性地削弱模型的初始零样本鲁棒性。为解决这一问题，论文提出了一种名为StatA的通用方法，该方法能够处理多种部署场景，包括测试时有效类别数量变化的情况。StatA的关键在于引入了一种新颖的正则化项，专门为VLMs设计，旨在在低数据情况下保留初始文本编码器的知识，从而提升模型在复杂现实场景中的适应能力。

链接: https://arxiv.org/abs/2501.03729
作者: Maxime Zanella,Clément Fuchs,Christophe De Vleeschouwer,Ismail Ben Ayed
机构: UCLouvain, Belgium(比利时鲁汶大学); UMons, Belgium(比利时蒙斯大学); ÉTS Montreal, Canada(加拿大蒙特利尔高等技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models’ initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples’ distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at this https URL.
zh

[CV-27] Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein

【速读】：该论文试图解决在医学图像分析中，肺结构（pulmonary structures）的精确分割问题，特别是在标注数据有限的情况下。现有的深度学习分割技术虽然取得了显著进展，但通常需要大量标注数据进行训练。论文提出了一种名为“语言引导的自适应交叉注意力融合框架”（Language-guided self-adaptive Cross-Attention Fusion Framework）的新方法，利用预训练的视觉-语言基础模型（如CLIP）在下游任务（如分割）中的泛化能力，通过少量标注数据实现高性能分割。解决方案的关键在于采用预训练的CLIP作为强大的特征提取器，生成3D CT扫描的分割结果，并通过专门设计的适配器模块（adapter module）自适应地融合文本和图像表示，从而在肺动静脉分割任务中取得了显著优于现有方法的效果。

链接: https://arxiv.org/abs/2501.03722
作者: Xiaotong Guo,Deqian Yang,Dan Wang,Haochen Zhao,Yuan Li,Zhilin Sui,Tao Zhou,Lijun Zhang,Yanda Meng
机构: Department of Thoracic Surgery, National Clinical Research Center for Cancer/Cancer Hospital Shenzhen Hospital (国家癌症中心/深圳医院胸外科); Key Laboratory of System Software (Chinese Academy of Sciences) and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China (中国科学院系统软件重点实验室及计算机科学国家重点实验室, 中国科学院软件研究所); School of Intelligent Science and Technology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China (中国科学院大学杭州高等研究院智能科学与技术学院); School of Computer Science and Engineering, Beihang University, Beijing, China (北京航空航天大学计算机科学与工程学院); Guangzhou Jiayi Software Technology Co., Ltd. (广州佳易软件技术有限公司); R&D Center, Guangxi Huayi Artificial Intelligence Medical Technology Co., Ltd (广西华医人工智能医疗技术有限公司研发中心); Department of Computer Science, University of Exeter, Exeter, UK (埃克塞特大学计算机科学系); Department of Cardiovascular & Metabolic Medicine, University of Liverpool, Liverpool, UK (利物浦大学心血管与代谢医学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages,3 figures

点击查看摘要

Abstract:Accurate segmentation of pulmonary structures iscrucial in clinical diagnosis, disease study, and treatment planning. Significant progress has been made in deep learning-based segmentation techniques, but most require much labeled data for training. Consequently, developing precise segmentation methods that demand fewer labeled datasets is paramount in medical image analysis. The emergence of pre-trained vision-language foundation models, such as CLIP, recently opened the door for universal computer vision tasks. Exploiting the generalization ability of these pre-trained foundation models on downstream tasks, such as segmentation, leads to unexpected performance with a relatively small amount of labeled data. However, exploring these models for pulmonary artery-vein segmentation is still limited. This paper proposes a novel framework called Language-guided self-adaptive Cross-Attention Fusion Framework. Our method adopts pre-trained CLIP as a strong feature extractor for generating the segmentation of 3D CT scans, while adaptively aggregating the cross-modality of text and image representations. We propose a s pecially designed adapter module to fine-tune pre-trained CLIP with a self-adaptive learning strategy to effectively fuse the two modalities of embeddings. We extensively validate our method on a local dataset, which is the largest pulmonary artery-vein CT dataset to date and consists of 718 labeled data in total. The experiments show that our method outperformed other state-of-the-art methods by a large margin. Our data and code will be made publicly available upon acceptance.
zh

[CV-28] Materialist: Physically Based Editing Using Single-Image Inverse Rendering

【速读】：该论文旨在解决基于单视图的物理渲染（inverse physically based rendering）中的图像编辑问题。传统方法通常需要多视图输入，而本文提出了一种结合学习方法和渐进可微渲染（progressive differentiable rendering）的技术，仅需单张图像即可实现高质量的图像编辑。其核心解决方案包括：首先利用神经网络预测初始材质属性，然后通过渐进可微渲染优化环境贴图（environment map）并细化材质属性，以使渲染结果与输入图像高度匹配。与依赖神经渲染器的单视图方法相比，该方法能够实现更真实的光照与材质交互、精确的阴影和全局光照（global illumination）。此外，该方法支持基于物理的材质编辑、物体插入和重光照等任务，并提出了无需完整场景几何信息的材质透明度编辑方法。与基于Stable Diffusion的方法相比，该方法在可解释性和光折射效果方面表现更优。

链接: https://arxiv.org/abs/2501.03717
作者: Lezhong Wang,Duc Minh Tran,Ruiqi Cui,Thomson TG,Manmohan Chandraker,Jeppe Revall Frisvad
机构: Technical University of Denmark(丹麦技术大学); University of California, San Diego(加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: code will be available at this http URL

点击查看摘要

Abstract:To perform image editing based on single-view, inverse physically based rendering, we present a method combining a learning-based approach with progressive differentiable rendering. Given an image, our method leverages neural networks to predict initial material properties. Progressive differentiable rendering is then used to optimize the environment map and refine the material properties with the goal of closely matching the rendered result to the input image. We require only a single image while other inverse rendering methods based on the rendering equation require multiple views. In comparison to single-view methods that rely on neural renderers, our approach achieves more realistic light material interactions, accurate shadows, and global illumination. Furthermore, with optimized material properties and illumination, our method enables a variety of tasks, including physically based material editing, object insertion, and relighting. We also propose a method for material transparency editing that operates effectively without requiring full scene geometry. Compared with methods based on Stable Diffusion, our approach offers stronger interpretability and more realistic light refraction based on empirical results.
zh

[CV-29] MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

【速读】：该论文旨在解决动态场景中3D高斯泼溅（3D Gaussian Splatting, 3DGS）在存储需求和复杂运动表示方面的挑战。现有方法虽然在渲染质量和速度上表现出色，但在处理复杂真实世界运动时存在存储需求高和运动表示不足的问题。为此，论文提出了MoDecGS框架，通过全局到局部运动分解（Global-to-Local Motion Decomposition, GLMD）来有效捕捉动态运动。关键解决方案包括：1）引入全局规范支架（Global Canonical Scaffold, Global CS）和局部规范支架（Local Canonical Scaffold, Local CS），将静态支架表示扩展到动态视频重建；2）通过全局锚点变形（Global Anchor Deformation, GAD）高效表示复杂运动中的全局动态；3）通过局部高斯变形（Local Gaussian Deformation, LGD）精细调整局部运动；4）引入时间间隔调整（Temporal Interval Adjustment, TIA）自动控制每个局部规范支架在训练期间的时间覆盖范围，以优化时间分段分配。实验表明，MoDecGS在保持或提升渲染质量的同时，模型大小比现有动态3D高斯方法平均减少70%。

链接: https://arxiv.org/abs/2501.03714
作者: Sangwoon Kwak,Joonsoo Kim,Jun Young Jeong,Won-Sik Cheong,Jihyong Oh,Munchurl Kim
机构: Electronics and Telecommunications Research Institute(电子与电信研究所); Korea Advanced Institute of Science and Technology(韩国科学技术院); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The last two authors are co-corresponding authors. Please visit our project page at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.
zh

[CV-30] AuxDepthNet: Real-Time Monocular 3D Object Detection with Depth-Sensitive Features

【速读】：该论文旨在解决单目3D目标检测（Monocular 3D Object Detection）中由于单视角图像缺乏显式深度信息而导致的挑战。现有方法通常依赖外部深度估计器或昂贵的传感器，这不仅增加了计算复杂度，还影响了实时性能。为解决这些问题，论文提出了AuxDepthNet框架，其关键创新在于引入了两个核心模块：辅助深度特征模块（Auxiliary Depth Feature, ADF）和深度位置映射模块（Depth Position Mapping, DPM）。ADF模块通过隐式学习深度敏感特征来提升空间推理能力和计算效率，而DPM模块则将深度位置信息直接嵌入检测过程中，以实现精确的目标定位和3D边界框回归。此外，AuxDepthNet利用DepthFusion Transformer架构，通过深度引导的交互全局整合视觉和深度敏感特征，从而确保检测的鲁棒性和高效性。实验结果表明，AuxDepthNet在KITTI数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.03700
作者: Ruochen Zhang,Hyeung-Sik Choi,Dongwook Jung,Phan Huy Nam Anh,Sang-Ki Jeong,Zihao Zhu
机构: Department of Mechanical Engineering, National Korea Maritime and Ocean University, Busan 49112, Republic of Korea (韩国海洋大学机械工程系); Maritime ICT and Mobility Research Department, Korea Institute of Ocean Science and Technology, Busan 49111, Republic of Korea (韩国海洋科学技术研究院海事ICT与移动研究部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular 3D object detection is a challenging task in autonomous systems due to the lack of explicit depth information in single-view images. Existing methods often depend on external depth estimators or expensive sensors, which increase computational complexity and hinder real-time performance. To overcome these limitations, we propose AuxDepthNet, an efficient framework for real-time monocular 3D object detection that eliminates the reliance on external depth maps or pre-trained depth models. AuxDepthNet introduces two key components: the Auxiliary Depth Feature (ADF) module, which implicitly learns depth-sensitive features to improve spatial reasoning and computational efficiency, and the Depth Position Mapping (DPM) module, which embeds depth positional information directly into the detection process to enable accurate object localization and 3D bounding box regression. Leveraging the DepthFusion Transformer architecture, AuxDepthNet globally integrates visual and depth-sensitive features through depth-guided interactions, ensuring robust and efficient detection. Extensive experiments on the KITTI dataset show that AuxDepthNet achieves state-of-the-art performance, with \textAP_3D scores of 24.72% (Easy), 18.63% (Moderate), and 15.31% (Hard), and \textAP_\textBEV scores of 34.11% (Easy), 25.18% (Moderate), and 21.90% (Hard) at an IoU threshold of 0.7.
zh

[CV-31] Motion-Aware Generative Frame Interpolation

【速读】：该论文试图解决生成式帧插值（Generative Frame Interpolation）在复杂场景中依赖生成模型独立推断输入帧之间对应关系的问题，这种能力在预训练过程中往往未能充分发展。为了解决这一问题，论文提出了一种名为“运动感知生成式帧插值”（Motion-aware Generative Frame Interpolation, MoG）的新框架，通过引入显式的运动指导来显著增强模型的运动感知能力。解决方案的关键在于两个方面：首先，确定了基于流（flow-based）插值模型的中间流（intermediate flow）可以作为有效的运动指导；其次，通过将输入帧的表示通过运动指导进行变形，生成基于指导的中间帧表示，并将这些表示在潜在空间和特征层面无缝集成到生成模型中。实验表明，MoG在真实世界和动画数据集上均显著优于现有方法，实现了更高的视频质量和保真度。

链接: https://arxiv.org/abs/2501.03699
作者: Guozhen Zhang,Yuhan Zhu,Yutao Cui,Xiaotong Zhao,Kai Ma,Limin Wang
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学); Platform and Content Group (PCG), Tencent (腾讯); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative frame interpolation, empowered by large-scale pre-trained video generation models, has demonstrated remarkable advantages in complex scenes. However, existing methods heavily rely on the generative model to independently infer the correspondences between input frames, an ability that is inadequately developed during pre-training. In this work, we propose a novel framework, termed Motion-aware Generative frame interpolation (MoG), to significantly enhance the model’s motion awareness by integrating explicit motion guidance. Specifically we investigate two key questions: what can serve as an effective motion guidance, and how we can seamlessly embed this guidance into the generative model. For the first question, we reveal that the intermediate flow from flow-based interpolation models could efficiently provide task-oriented motion guidance. Regarding the second, we first obtain guidance-based representations of intermediate frames by warping input frames’ representations using guidance, and then integrate them into the model at both latent and feature levels. To demonstrate the versatility of our method, we train MoG on both real-world and animation datasets. Comprehensive evaluations show that our MoG significantly outperforms the existing methods in both domains, achieving superior video quality and improved fidelity.
zh

[CV-32] SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

【速读】：该论文试图解决多图像推理任务（multi-image reasoning tasks）在开源社区中尚未充分探索的问题，主要面临两个挑战：一是构建包含多个相关图像和复杂推理指令的数据集资源消耗大且难以保持质量；二是缺乏针对多图像任务的鲁棒评估基准。为解决这些问题，论文提出了SMIR（Synthetic Multi-Image Reasoning）框架，包括一个高效的多图像推理合成数据生成管道和一个高质量数据集。该管道通过多模态嵌入（multimodal embeddings）提取高度相关的图像，结合视觉和描述信息，并利用开源大语言模型（LLMs）生成高质量的指令。通过这一管道，生成了16万个合成训练样本，提供了一种成本效益高的替代方案。此外，论文还提出了SMIR-BENCH，一个包含200个多样化示例的多图像推理评估基准，涵盖7种复杂任务，并通过多轮对话和视觉语言模型（VLM）评估自由形式响应，全面评估模型在多模态下的表达和推理能力。实验结果表明，使用SMIR数据集微调的模型在多图像推理任务中表现优于基线模型，最高提升8%。

链接: https://arxiv.org/abs/2501.03675
作者: Andrew Li,Rahul Thapa,Rahul Chalamala,Qingyang Wu,Kezhen Chen,James Zou
机构: Together AI; University of California, Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Caltech (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown strong performance in understanding single images, aided by numerous high-quality instruction datasets. However, multi-image reasoning tasks are still under-explored in the open-source community due to two main challenges: (1) scaling datasets with multiple correlated images and complex reasoning instructions is resource-intensive and maintaining quality is difficult, and (2) there is a lack of robust evaluation benchmarks for multi-image tasks. To address these issues, we introduce SMIR, an efficient synthetic data-generation pipeline for multi-image reasoning, and a high-quality dataset generated using this pipeline. Our pipeline efficiently extracts highly correlated images using multimodal embeddings, combining visual and descriptive information and leverages open-source LLMs to generate quality instructions. Using this pipeline, we generated 160K synthetic training samples, offering a cost-effective alternative to expensive closed-source solutions. Additionally, we present SMIR-BENCH, a novel multi-image reasoning evaluation benchmark comprising 200 diverse examples across 7 complex multi-image reasoning tasks. SMIR-BENCH is multi-turn and utilizes a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMIR dataset by fine-tuning several open-source VLMs and evaluating their performance on SMIR-BENCH. Our results show that models trained on our dataset outperform baseline models in multi-image reasoning tasks up to 8% with a much more scalable data pipeline.
zh

[CV-33] Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

【速读】：该论文旨在解决动作质量评估（Action Quality Assessment, AQA）中的两个主要挑战：一是运动员在快速运动时，视觉外观的细微变化难以捕捉，导致细粒度姿态差异的识别困难；二是现有方法通常将视频分割为固定帧，破坏了子动作的时间连续性，导致预测误差。为解决这些问题，论文提出了一种基于层次化姿态引导的多阶段对比回归方法。其关键解决方案包括：1）引入多尺度动态视觉-骨骼编码器（multi-scale dynamic visual-skeleton encoder），以捕捉细粒度的时空视觉和骨骼特征；2）使用过程分割网络（procedure segmentation network）分离不同子动作并获取分割特征；3）通过多模态融合模块（multi-modal fusion module）将分割后的视觉和骨骼特征作为物理结构先验，指导模型学习精细的活动相似性和差异性；4）采用多阶段对比学习回归方法（multi-stage contrastive learning regression）学习判别性表示并输出预测结果。此外，论文还引入了一个新标注的FineDiving-Pose数据集，以提高当前低质量的人体姿态标签。实验结果表明，该方法在FineDiving和MTL-AQA数据集上表现出显著的有效性和优越性。

链接: https://arxiv.org/abs/2501.03674
作者: Mengshi Qi,Hao Ye,Jiaxuan Peng,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (北京邮电大学网络与交换技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at this https URL.
zh

[CV-34] Local Compositional Complexity: How to Detect a Human-readable Messsage

【速读】：该论文试图解决数据复杂性（data complexity）在自然科学及相关领域中缺乏严格且可计算定义的问题。论文提出了一种基于数据结构的复杂性度量框架，关键是将数据的最短描述分为结构化部分和非结构化部分，并以结构化部分的大小作为复杂性评分。通过引入局部组合性（local compositionality）作为特定结构，论文进一步推导出一个更精确且可计算的定义，适用于人类通信领域。实验表明，该方法能够有效区分有意义信号与噪声或重复信号，并可能用于判断外星信号是否包含信息。

链接: https://arxiv.org/abs/2501.03664
作者: Louis Mahon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data complexity is an important concept in the natural sciences and related areas, but lacks a rigorous and computable definition. In this paper, we focus on a particular sense of complexity that is high if the data is structured in a way that could serve to communicate a message. In this sense, human speech, written language, drawings, diagrams and photographs are high complexity, whereas data that is close to uniform throughout or populated by random values is low complexity. We describe a general framework for measuring data complexity based on dividing the shortest description of the data into a structured and an unstructured portion, and taking the size of the former as the complexity score. We outline an application of this framework in statistical mechanics that may allow a more objective characterisation of the macrostate and entropy of a physical system. Then, we derive a more precise and computable definition geared towards human communication, by proposing local compositionality as an appropriate specific structure. We demonstrate experimentally that this method can distinguish meaningful signals from noise or repetitive signals in auditory, visual and text domains, and could potentially help determine whether an extra-terrestrial signal contained a message.
zh

[CV-35] DehazeGS: Seeing Through Fog with 3D Gaussian Splatting

【速读】：该论文试图解决在雾天场景下，由于散射和衰减效应导致的图像质量下降问题，进而影响新视角合成任务的重建和渲染质量。现有的基于神经辐射场（NeRF）的去雾重建算法虽然能够处理这一问题，但其依赖深度全连接神经网络和逐射线采样策略，导致计算成本较高，且在恢复雾天场景的细节方面存在困难。论文提出的解决方案DehazeGS，通过显式建模点云为3D高斯分布（3D Gaussian Splatting），利用物理精确的前向渲染过程来解释雾天图像的形成机制。该方法能够从多视角雾天图像中分解并渲染出无雾背景，同时在学习过程中联合优化大气光和散射系数，并在推理阶段消除散射和衰减对高斯分布的影响，直接投影到2D平面上以获得清晰的视图。实验结果表明，DehazeGS在渲染质量和计算效率方面均达到了当前最先进的水平。

链接: https://arxiv.org/abs/2501.03659
作者: Jinze Yu,Yiqun Wang,Zhengda Lu,Jianwei Guo,Yong Li,Hongxing Qin,Xiaopeng Zhang
机构: 1College of Computer Science, Chongqing University(重庆大学计算机学院); 2School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); 3School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院); 4MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,4 figures

点击查看摘要

Abstract:Current novel view synthesis tasks primarily rely on high-quality and clear images. However, in foggy scenes, scattering and attenuation can significantly degrade the reconstruction and rendering quality. Although NeRF-based dehazing reconstruction algorithms have been developed, their use of deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Moreover, NeRF’s implicit representation struggles to recover fine details from hazy scenes. In contrast, recent advancements in 3D Gaussian Splatting achieve high-quality 3D scene reconstruction by explicitly modeling point clouds into 3D Gaussians. In this paper, we propose leveraging the explicit Gaussian representation to explain the foggy image formation process through a physically accurate forward rendering process. We introduce DehazeGS, a method capable of decomposing and rendering a fog-free background from participating media using only muti-view foggy images as input. We model the transmission within each Gaussian distribution to simulate the formation of fog. During this process, we jointly learn the atmospheric light and scattering coefficient while optimizing the Gaussian representation of the hazy scene. In the inference stage, we eliminate the effects of scattering and attenuation on the Gaussians and directly project them onto a 2D plane to obtain a clear view. Experiments on both synthetic and real-world foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance in terms of both rendering quality and computational efficiency.
zh

[CV-36] Advancing the Understanding of Fine-Grained 3D Forest Structures using Digital Cousins and Simulation-to-Reality: Methods and Datasets

【速读】：该论文试图解决森林资源监测和生态系统研究中缺乏大规模、标注数据集的问题，这一问题限制了先进智能技术在该领域的广泛应用。为解决这一挑战，论文提出了一种基于数字孪生（Digital Cousins）和仿真到现实（Simulation-to-Reality, Sim2Real）概念的全自动合成数据生成与处理框架。该框架具有多功能性和可扩展性，能够适应任何规模和平台。通过这一框架，作者创建了Boreal3D数据集，这是目前全球最大的森林点云数据集，包含1000个高度逼真且结构多样的森林样地，涵盖四个不同平台，总计48,403棵树和超过353亿个点。每个点都标注了语义、实例和视角信息，每棵树则描述了直径、冠幅、叶面积和总体积等结构参数。实验结果表明，通过特定策略，基于合成数据预训练的模型在应用于真实森林数据集时性能显著提升，尤其是在仅使用20%真实数据进行微调的情况下，模型性能可与完全基于真实数据训练的模型相媲美。这一框架和数据集为大规模3D森林场景理解和结构参数估计研究提供了重要资源。

链接: https://arxiv.org/abs/2501.03637
作者: Jing Liu,Duanchu Wang,Haoran Gong,Chongyu Wang,Jihua Zhu,Di Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding and analyzing the spatial semantics and structure of forests is essential for accurate forest resource monitoring and ecosystem research. However, the lack of large-scale and annotated datasets has limited the widespread use of advanced intelligent techniques in this field. To address this challenge, a fully automated synthetic data generation and processing framework based on the concepts of Digital Cousins and Simulation-to-Reality (Sim2Real) is proposed, offering versatility and scalability to any size and platform. Using this process, we created the Boreal3D, the world’s largest forest point cloud dataset. It includes 1000 highly realistic and structurally diverse forest plots across four different platforms, totaling 48,403 trees and over 35.3 billion points. Each point is labeled with semantic, instance, and viewpoint information, while each tree is described with structural parameters such as diameter, crown width, leaf area, and total volume. We designed and conducted extensive experiments to evaluate the potential of Boreal3D in advancing fine-grained 3D forest structure analysis in real-world applications. The results demonstrate that with certain strategies, models pre-trained on synthetic data can significantly improve performance when applied to real forest datasets. Especially, the findings reveal that fine-tuning with only 20% of real-world data enables the model to achieve performance comparable to models trained exclusively on entire real-world data, highlighting the value and potential of our proposed framework. The Boreal3D dataset, and more broadly, the synthetic data augmentation framework, is poised to become a critical resource for advancing research in large-scale 3D forest scene understanding and structural parameter estimation.
zh

[CV-37] Exploring Optimal Latent Trajetory for Zero-shot Image Editing

【速读】：该论文试图解决文本驱动图像编辑中的两个核心问题：可编辑性（editability）和保真度（fidelity）。现有的前沿方法通常采用“反演-再编辑”（inversion-then-editing）的流程，即将源图像反演为近似高斯噪声 ( z_T )，然后基于目标提示进行采样。然而，作者认为使用接近高斯噪声作为编辑的支点并不理想，因为它几乎失去了所有结构保真度。通过初步实验，作者发现某些中间反演的潜在表示（intermediate-inverted latents）能够在可编辑性和保真度之间实现更好的平衡。基于此，作者提出了一种新的编辑范式ZZEdit，该范式在保持结构的同时，对足够用于编辑的潜在表示进行温和的目标引导。具体而言，ZZEdit通过搜索反演轨迹上第一个对目标提示响应大于源提示的点来定位编辑支点，然后采用ZigZag过程在该支点上进行温和的目标引导，通过迭代去噪和反演逐步接近目标，同时保持保真度。最后，为了保持反演和去噪步骤的数量一致，作者在目标提示下进行纯采样过程。实验结果表明，ZZEdit在多种图像编辑场景中相比“反演-再编辑”流程具有显著优势。

链接: https://arxiv.org/abs/2501.03631
作者: Maomao Li,Yu Li,Yunfei Liu,Dong Xu
机构: The University of Hong Kong (香港大学); International Digital Economy Academy (IDEA) (国际数字经济学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Editability and fidelity are two essential demands for text-driven image editing, which expects that the editing area should align with the target prompt and the rest should remain unchanged separately. The current cutting-edge editing methods usually obey an “inversion-then-editing” pipeline, where the source image is first inverted to an approximate Gaussian noise z_T , based on which a sampling process is conducted using the target prompt. Nevertheless, we argue that it is not a good choice to use a near-Gaussian noise as a pivot for further editing since it almost lost all structure fidelity. We verify this by a pilot experiment, discovering that some intermediate-inverted latents can achieve a better trade-off between editability and fidelity than the fully-inverted z_T . Based on this, we propose a novel editing paradigm dubbed ZZEdit, which gentlely strengthens the target guidance on a sufficient-for-editing while structure-preserving latent. Specifically, we locate such an editing pivot by searching the first point on the inversion trajectory which has larger response levels toward the target prompt than the source one. Then, we propose a ZigZag process to perform mild target guiding on this pivot, which fulfills denoising and inversion iteratively, approaching the target while still holding fidelity. Afterwards, to achieve the same number of inversion and denoising steps, we perform a pure sampling process under the target prompt. Extensive experiments highlight the effectiveness of our ZZEdit in diverse image editing scenarios compared with the “inversion-then-editing” pipeline.
zh

[CV-38] MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

【速读】：该论文旨在解决基于扩散模型（diffusion models）的虚拟试衣方法在训练成本和推理时间上的高开销问题。现有方法通常需要额外的参考网络或图像编码器来处理多个条件图像输入，导致训练成本高，并且需要超过25个推理步骤，导致推理时间较长。论文提出的解决方案MC-VTON通过利用扩散变换器（diffusion transformer, DiT）的内在骨干网络，简化了网络结构和输入条件，仅需掩码人物图像和服装图像作为输入，无需额外的参考网络或图像编码器。此外，MC-VTON通过蒸馏扩散技术将推理步骤减少至8步，显著降低了推理时间和参数开销。实验结果表明，MC-VTON在细节保真度、网络简化、参数效率和推理步骤减少等方面均优于现有基线方法。

链接: https://arxiv.org/abs/2501.03630
作者: Junsheng Luan,Guangyuan Li,Lei Zhao,Wei Xing
机构: Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.
zh

[CV-39] CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Low Quality Medical Images

【速读】：该论文试图解决在低质量医学图像分割任务中，现有混合 CNN-Transformer 模型在整合卷积神经网络（CNNs）和 Transformer 时，过度关注空间特征而忽略通道特征（channel features）的问题。通道特征的有效提取对于模型捕捉上下文信息及提升表示能力至关重要。为解决这一问题，作者提出了一种名为 CFFormer 的混合 CNN-Transformer 模型，并引入了两个关键模块：跨特征通道注意力（Cross Feature Channel Attention, CFCA）模块和 X-空间特征融合（X-Spatial Feature Fusion, XFF）模块。CFCA 模块用于过滤和促进来自 CNN 编码器和 Transformer 编码器的通道特征之间的交互，而 XFF 模块则有效减少了空间特征中的显著语义信息差异，实现了平滑且一致的空间特征融合。通过在多模态数据集上的实验验证，该模型在模糊边界和低对比度数据集上表现尤为突出，显著优于当前的最先进方法。

链接: https://arxiv.org/abs/2501.03629
作者: Jiaxuan Li,Qing Xu,Xiangjian He,Ziyu Liu,Daokun Zhang,Ruili Wang,Rong Qu,Guoping Qiu
机构: School of Computer Science, University of Nottingham Ningbo China(宁波诺丁汉大学计算机科学学院); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); School of Mathematical and Computational Sciences, Massey University(梅西大学数学与计算科学学院); University of Nottingham(诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The article consists of 15 pages, including 10 figures and 7 tables. The code will be made open-source once the article is accepted by the journal

点击查看摘要

Abstract:Hybrid CNN-Transformer models are designed to combine the advantages of Convolutional Neural Networks (CNNs) and Transformers to efficiently model both local information and long-range dependencies. However, most research tends to focus on integrating the spatial features of CNNs and Transformers, while overlooking the critical importance of channel features. This is particularly significant for model performance in low-quality medical image segmentation. Effective channel feature extraction can significantly enhance the model’s ability to capture contextual information and improve its representation capabilities. To address this issue, we propose a hybrid CNN-Transformer model, CFFormer, and introduce two modules: the Cross Feature Channel Attention (CFCA) module and the X-Spatial Feature Fusion (XFF) module. The model incorporates dual encoders, with the CNN encoder focusing on capturing local features and the Transformer encoder modeling global features. The CFCA module filters and facilitates interactions between the channel features from the two encoders, while the XFF module effectively reduces the significant semantic information differences in spatial features, enabling a smooth and cohesive spatial feature fusion. We evaluate our model across eight datasets covering five modalities to test its generalization capability. Experimental results demonstrate that our model outperforms current state-of-the-art (SOTA) methods, with particularly superior performance on datasets characterized by blurry boundaries and low contrast.
zh

[CV-40] Deep Learning-based Compression Detection for explainable Face Image Quality Assessment ICPR

【速读】：该论文旨在解决人脸图像质量评估问题，特别是由于JPEG和JPEG 2000压缩算法引入的压缩伪影（compression artefacts）对人脸识别性能的负面影响。为了提供可解释和可操作的反馈，论文提出了一种基于深度神经网络的方法来检测这些压缩伪影。关键解决方案包括：首先，使用无伪影的人脸图像通过JPEG和JPEG 2000算法进行压缩，并利用PSNR（峰值信噪比）和SSIM（结构相似性）指标生成训练标签；其次，训练单一神经网络分别检测JPEG和JPEG 2000压缩伪影。实验结果表明，使用PSNR标签训练的网络在检测准确率上达到了2-3%的错误率。此外，通过丢弃具有严重压缩伪影的人脸图像，可以显著降低开源和商业人脸识别系统的错误率。为了优化资源消耗，该方法基于EfficientNetV2架构，并集成到OFIQ软件中。

链接: https://arxiv.org/abs/2501.03619
作者: Laurin Jonientz,Johannes Merkle,Christian Rathgeb,Benjamin Tams,Georg Merz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd Workshop on Fairness in Biometric Systems (FAIRBIO) at International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:The assessment of face image quality is crucial to ensure reliable face recognition. In order to provide data subjects and operators with explainable and actionable feedback regarding captured face images, relevant quality components have to be measured. Quality components that are known to negatively impact the utility of face images include JPEG and JPEG 2000 compression artefacts, among others. Compression can result in a loss of important image details which may impair the recognition performance. In this work, deep neural networks are trained to detect the compression artefacts in a face images. For this purpose, artefact-free facial images are compressed with the JPEG and JPEG 2000 compression algorithms. Subsequently, the PSNR and SSIM metrics are employed to obtain training labels based on which neural networks are trained using a single network to detect JPEG and JPEG 2000 artefacts, respectively. The evaluation of the proposed method shows promising results: in terms of detection accuracy, error rates of 2-3% are obtained for utilizing PSNR labels during training. In addition, we show that error rates of different open-source and commercial face recognition systems can be significantly reduced by discarding face images exhibiting severe compression artefacts. To minimize resource consumption, EfficientNetV2 serves as basis for the presented algorithm, which is available as part of the OFIQ software.
zh

[CV-41] BTMTrack: Robust RGB-T Tracking via Dual-template Bridging and Temporal-Modal Candidate Elimination

【速读】：该论文旨在解决RGB-T（RGB-热红外）跟踪中现有方法在有效整合时间信息和执行高效跨模态交互方面的不足，特别是在低光照和恶劣天气等复杂场景下的动态目标适应性受限问题。解决方案的关键在于提出了BTMTrack框架，其核心包括双模板骨干网络（dual-template backbone network）和时序-模态候选消除策略（Temporal-Modal Candidate Elimination, TMCE）。双模板骨干网络有效整合了时间信息，而TMCE策略通过评估时间和模态相关性，使模型聚焦于目标相关的特征，减少计算开销并避免无关背景噪声。此外，论文还提出了时序双模板桥接模块（Temporal Dual Template Bridging, TDTB），通过动态过滤的特征促进精确的跨模态融合，进一步增强了模板与搜索区域之间的交互。实验结果表明，该方法在LasHeR、RGBT210和RGBT234数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.03616
作者: Zhongxuan Zhang,Bi Zeng,Xinyu Ni,Yimin Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-T tracking leverages the complementary strengths of RGB and thermal infrared (TIR) modalities to address challenging scenarios such as low illumination and adverse weather. However, existing methods often fail to effectively integrate temporal information and perform efficient cross-modal interactions, which constrain their adaptability to dynamic targets. In this paper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of our approach lies in the dual-template backbone network and the Temporal-Modal Candidate Elimination (TMCE) strategy. The dual-template backbone effectively integrates temporal information, while the TMCE strategy focuses the model on target-relevant tokens by evaluating temporal and modal correlations, reducing computational overhead and avoiding irrelevant background noise. Building upon this foundation, we propose the Temporal Dual Template Bridging (TDTB) module, which facilitates precise cross-modal fusion through dynamically filtered tokens. This approach further strengthens the interaction between templates and the search region. Extensive experiments conducted on three benchmark datasets demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art performance, with a 72.3% precision rate on the LasHeR test set and competitive results on RGBT210 and RGBT234 datasets.
zh

[CV-42] VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation

【速读】：该论文试图解决双手机器人操作（bimanual dexterous manipulation）中的挑战，特别是由于每只手的高自由度（DoFs）及其协调问题导致的复杂任务难以泛化的问题。现有的单手机器人操作技术通常依赖于人类示范来指导强化学习（RL）方法，但这些方法难以推广到涉及多个子技能的复杂双手任务中。论文提出的解决方案VTAO-BiManip框架结合了视觉-触觉-动作预训练（visual-tactile-action pretraining）与物体理解，通过课程强化学习（curriculum RL）实现类人的双手操作。关键创新在于引入了手部运动数据，提供了比二元触觉反馈更有效的双手协调指导。预训练模型通过掩码多模态输入预测未来动作、物体姿态和大小，促进了跨模态正则化。此外，论文采用了两阶段课程强化学习方法，以稳定训练过程。该方法在模拟和真实环境中的瓶盖拧开任务上进行了评估，成功率达到比现有视觉-触觉预训练方法高出20%以上。

链接: https://arxiv.org/abs/2501.03606
作者: Zhengnan Sun,Zhaotai Shi,Jiayin Chen,Qingtao Liu,Yu Cui,Qi Ye,Jiming Chen
机构: College of Control Science and Engineering, Zhejiang University (浙江大学控制科学与工程学院); State Key Laboratory of Industrial Control Technology, Zhejiang University (浙江大学工业控制技术国家重点实验室); Key Lab of CS&AUS of Zhejiang Province (浙江省CS&AUS重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we introduce VTAO-BiManip, a novel framework that combines visual-tactile-action pretraining with object understanding to facilitate curriculum RL to enable human-like bimanual manipulation. We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination than binary tactile feedback. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bottle-cap unscrewing task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.
zh

[CV-43] ConcealGS: Concealing Invisible Copyright Information in 3D Gaussian Splatting

【速读】：该论文旨在解决在3D高斯泼溅（3D Gaussian Splatting, 3D-GS）格式中嵌入隐式信息的技术挑战。传统的视觉数据（如图像和视频）以及基于神经辐射场（NeRF）的格式已有成熟的版权保护技术，但针对新兴的3D-GS格式的隐写术（steganographic techniques）尚未得到充分探索。为此，作者提出了ConcealGS方法，通过引入基于3D-GS的知识蒸馏（knowledge distillation）和梯度优化策略，克服了基于NeRF模型的局限性，增强了隐式信息的鲁棒性并提升了3D重建的质量。实验结果表明，ConcealGS不仅能够成功恢复隐式信息，而且对渲染质量几乎没有影响，为未来在3D模型中嵌入不可见且可恢复信息提供了新的解决方案。

链接: https://arxiv.org/abs/2501.03605
作者: Yifeng Yang,Hengyu Liu,Chenxin Li,Yining Sun,Wuyang Li,Yifan Liu,Yiyang Lin,Yixuan Yuan,Nanyang Ye
机构: Shanghai Jiao Tong University(上海交通大学); The Chinese University of Hong Kong(香港中文大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:With the rapid development of 3D reconstruction technology, the widespread distribution of 3D data has become a future trend. While traditional visual data (such as images and videos) and NeRF-based formats already have mature techniques for copyright protection, steganographic techniques for the emerging 3D Gaussian Splatting (3D-GS) format have yet to be fully explored. To address this, we propose ConcealGS, an innovative method for embedding implicit information into 3D-GS. By introducing the knowledge distillation and gradient optimization strategy based on 3D-GS, ConcealGS overcomes the limitations of NeRF-based models and enhances the robustness of implicit information and the quality of 3D reconstruction. We evaluate ConcealGS in various potential application scenarios, and experimental results have demonstrated that ConcealGS not only successfully recovers implicit information but also has almost no impact on rendering quality, providing a new approach for embedding invisible and recoverable information into 3D models in the future.
zh

[CV-44] BASIC: Semi-supervised Multi-organ Segmentation with Balanced Subclass Regularization and Semantic-conflict Penalty

【速读】：该论文试图解决多器官分割（MoS）任务中由于器官大小差异导致的类别不平衡问题，这一问题在半监督学习（SSL）框架下尤为突出。为了解决这一问题，论文提出了一种创新的半监督网络BASIC（BAlanced Subclass regularIzation and semantic-Conflict penalty mechanism）。其关键解决方案包括两个方面：首先，通过构建一个基于平衡子类的辅助子类分割（SCS）任务，以多任务学习的方式深入挖掘无偏信息，从而辅助主任务的学习；其次，设计了一种平衡子类正则化方法，利用SCS任务的教师预测来监督MoS任务的学生预测，从而有效传递无偏知识并缓解类别不平衡的影响。此外，论文还引入了语义冲突惩罚机制，对与错误父类冲突的SCS预测施加更重的惩罚，从而为MoS预测提供更精确的约束。实验结果表明，BASIC在两个公开数据集（WORD和MICCAI FLARE 2022）上均优于其他最先进的方法。

链接: https://arxiv.org/abs/2501.03580
作者: Zhenghao Feng,Lu Wen,Yuanyuan Xu,Binyu Yan,Xi Wu,Jiliu Zhou,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has shown notable potential in relieving the heavy demand of dense prediction tasks on large-scale well-annotated datasets, especially for the challenging multi-organ segmentation (MoS). However, the prevailing class-imbalance problem in MoS caused by the substantial variations in organ size exacerbates the learning difficulty of the SSL network. To address this issue, in this paper, we propose an innovative semi-supervised network with BAlanced Subclass regularIzation and semantic-Conflict penalty mechanism (BASIC) to effectively learn the unbiased knowledge for semi-supervised MoS. Concretely, we construct a novel auxiliary subclass segmentation (SCS) task based on priorly generated balanced subclasses, thus deeply excavating the unbiased information for the main MoS task with the fashion of multi-task learning. Additionally, based on a mean teacher framework, we elaborately design a balanced subclass regularization to utilize the teacher predictions of SCS task to supervise the student predictions of MoS task, thus effectively transferring unbiased knowledge to the MoS subnetwork and alleviating the influence of the class-imbalance problem. Considering the similar semantic information inside the subclasses and their corresponding original classes (i.e., parent classes), we devise a semantic-conflict penalty mechanism to give heavier punishments to the conflicting SCS predictions with wrong parent classes and provide a more accurate constraint to the MoS predictions. Extensive experiments conducted on two publicly available datasets, i.e., the WORD dataset and the MICCAI FLARE 2022 dataset, have verified the superior performance of our proposed BASIC compared to other state-of-the-art methods.
zh

[CV-45] Cosmos World Foundation Model Platform for Physical AI

【速读】：该论文旨在解决物理人工智能（Physical AI）在现实世界应用中的训练问题，特别是如何通过数字化的方式为其构建世界模型（world model）和政策模型（policy model）。论文提出的解决方案是开发一个名为“Cosmos World Foundation Model Platform”的平台，该平台帮助开发者为其物理人工智能系统构建定制化的世界模型。关键点在于，该平台提供了一个通用的世界基础模型（world foundation model），该模型可以通过微调（fine-tuning）适应不同的下游应用场景。平台的核心组件包括视频处理管道、预训练的世界基础模型、预训练模型的后续训练示例以及视频标记器（video tokenizers）。此外，平台采用开源和开放权重（open-weight）的方式，旨在促进物理人工智能开发者解决社会中的关键问题。

链接: https://arxiv.org/abs/2501.03575
作者: NVIDIA:Niket Agarwal,Arslan Ali,Maciej Bala,Yogesh Balaji,Erik Barker,Tiffany Cai,Prithvijit Chattopadhyay,Yongxin Chen,Yin Cui,Yifan Ding,Daniel Dworakowski,Jiaojiao Fan,Michele Fenzi,Francesco Ferroni,Sanja Fidler,Dieter Fox,Songwei Ge,Yunhao Ge,Jinwei Gu,Siddharth Gururani,Ethan He,Jiahui Huang,Jacob Huffman,Pooya Jannaty,Jingyi Jin,Seung Wook Kim,Gergely Klár,Grace Lam,Shiyi Lan,Laura Leal-Taixe,Anqi Li,Zhaoshuo Li,Chen-Hsuan Lin,Tsung-Yi Lin,Huan Ling,Ming-Yu Liu,Xian Liu,Alice Luo,Qianli Ma,Hanzi Mao,Kaichun Mo,Arsalan Mousavian,Seungjun Nah,Sriharsha Niverty,David Page,Despoina Paschalidou,Zeeshan Patel,Lindsey Pavao,Morteza Ramezanali,Fitsum Reda,Xiaowei Ren,Vasanth Rao Naik Sabavat,Ed Schmerling,Stella Shi,Bartosz Stefaniak,Shitao Tang,Lyne Tchapmi,Przemek Tredak,Wei-Cheng Tseng,Jibin Varghese,Hao Wang,Haoxiang Wang,Heng Wang,Ting-Chun Wang,Fangyin Wei,Xinyue Wei,Jay Zhangjie Wu,Jiashu Xu,Wei Yang,Lin Yen-Chen,Xiaohui Zeng,Yu Zeng,Jing Zhang,Qinsheng Zhang,Yuxuan Zhang,Qingqing Zhao,Artur Zolkowski
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via this https URL.
zh

[CV-46] Evaluating Image Caption via Cycle-consistent Text-to-Image Generation

【速读】：该论文试图解决图像描述（image captioning）模型评估中依赖参考描述（reference captions）的问题，这些参考描述不仅获取成本高，而且存在显著的多样性和主观性。现有的无参考评估指标大多关注于描述与图像之间的跨模态评估，但基于对比学习的多模态系统在表示上普遍存在模态差距（modality gap），这削弱了如CLIPScore等跨模态指标的可靠性。为解决这一问题，论文提出了CAMScore，一种循环无参考自动评估指标。CAMScore通过利用文本到图像模型从描述生成图像，并将生成的图像与原始图像进行对比评估，从而规避模态差距。此外，CAMScore设计了一个三级评估框架，涵盖像素级、语义级和目标级视角，以提供更细粒度的信息，实现更全面的评估。实验结果表明，CAMScore在多个基准数据集上相比现有的基于参考和无参考指标，与人类判断的相关性更高，验证了该框架的有效性。

链接: https://arxiv.org/abs/2501.03567
作者: Tianyu Cui,Jinbin Bai,Guohua Wang,Qingguo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Ye Shi
机构: ShanghaiTech University(上海科技大学); AI Business, Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.
zh

[CV-47] Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis

【速读】：该论文试图解决在医学影像自动诊断中，现有基于监督学习的方法依赖于大量手动标注数据的问题，尤其是对于罕见异常类型的诊断。现有的视觉-语言对齐（Vision-Language Alignment, VLA）方法虽然在零样本学习（zero-shot learning）方面具有潜力，但其视觉和文本嵌入在对齐后仍形成两个分离的聚类，存在较大的语义鸿沟。为解决这一问题，论文提出了一个桥接语义对齐（Bridged Semantic Alignment, BrgSA）框架。该框架的关键在于：首先，利用大语言模型对医学报告进行语义总结，提取高层语义信息；其次，设计了一个跨模态知识交互（Cross-Modal Knowledge Interaction, CMKI）模块，通过跨模态知识库作为语义桥梁，促进视觉和文本模态之间的交互，缩小语义鸿沟并提升对齐效果。实验结果表明，BrgSA在公开基准数据集和自定义数据集上均达到了最先进的性能，显著提升了罕见异常类型的零样本诊断能力。

链接: https://arxiv.org/abs/2501.03565
作者: Haoran Lai,Zihang Jiang,Qingsong Yao,Rongsheng Wang,Zhiyang He,Xiaodong Tao,Wei Wei,Weifu Lv,S.Kevin Zhou
机构: School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China (中国科学技术大学生物医学工程学院, 生命科学与医学部); Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学苏州高等研究院); Stanford University (斯坦福大学); Medical Business Department, iFlytek Co.Ltd (科大讯飞医疗事业部); The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China (中国科学技术大学第一附属医院, 生命科学与医学部); Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China (中国科学技术大学第一附属医院放射科, 生命科学与医学部); Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advance Research, USTC (中国科学技术大学苏州高等研究院医学影像、机器人、分析计算与学习中心); State Key Laboratory of Precision and Intelligent Chemistry, University of Science and Technology of China (中国科学技术大学精密化学与智能化学国家重点实验室); Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (中国科学院计算技术研究所智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D medical images such as Computed tomography (CT) are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction (CMKI) module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.
zh

[CV-48] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

【速读】：该论文试图解决文本到图像（Text-to-Image, T2I）模型在生成不适宜工作场所（Not-Safe-For-Work, NSFW）内容时的滥用问题，这一问题引发了严重的伦理担忧。为了解决这一问题，论文提出了一种名为PromptGuard的新型内容审核技术。其关键解决方案是借鉴大语言模型（Large Language Models, LLMs）中的系统提示机制，通过优化一个安全软提示（safety soft prompt），使其在T2I模型的文本嵌入空间中充当隐式系统提示。这个通用软提示（P*）能够直接审核NSFW输入，从而在不影响推理效率或引入代理模型的情况下，实现安全且逼真的图像生成。实验结果表明，PromptGuard在三个数据集上有效减少了NSFW内容的生成，同时保持了高质量的正常输出，并且在速度上比现有内容审核方法快7.8倍，达到了5.84%的最优不安全比例。

链接: https://arxiv.org/abs/2501.03544
作者: Lingzhi Yuan,Xinfeng Li,Chejian Xu,Guanhong Tao,Xiaojun Jia,Yihao Huang,Wei Dong,Yang Liu,XiaoFeng Wang,Bo Li
机构: University of Chicago(芝加哥大学); Nanyang Technological University(南洋理工大学); University of Illinois at Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校); The University of Utah(犹他大学); Indiana University Bloomington(印第安纳大学伯明顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. Extensive experiments across three datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 7.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
zh

[CV-49] Anomaly Triplet-Net: Progress Recognition Model Using Deep Metric Learning Considering Occlusion for Manual Assembly Work

【速读】：该论文旨在解决工厂环境中产品装配过程的进度识别问题，特别是在存在遮挡（occlusion）的情况下。解决方案的关键在于提出了一种基于深度度量学习（deep metric learning）的进度识别方法。具体而言，首先通过基于深度学习的物体检测方法从工厂固定摄像头获取的图像中检测目标装配产品，并裁剪出检测区域。接着，利用基于深度度量学习的分类方法对裁剪后的图像进行处理，估计产品装配工作的粗略进度步骤。为了考虑遮挡情况，论文提出了一种名为Anomaly Triplet-Net的进度估计模型，该模型在Triplet Loss中加入了异常样本（anomaly samples），以提高进度估计的准确性。实验结果表明，使用Anomaly Triplet-Net的进度估计方法达到了82.9%的成功率，验证了该系统的有效性。

链接: https://arxiv.org/abs/2501.03533
作者: Takumi Kitsukawa,Kazuma Miura,Shigeki Yumoto,Sarthak Pathak,Alessandro Moro,Kazunori Umeda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been peer-reviewed, revised, and published in Advanced Robotics

点击查看摘要

Abstract:In this paper, a progress recognition method consider occlusion using deep metric learning is proposed to visualize the product assembly process in a factory. First, the target assembly product is detected from images acquired from a fixed-point camera installed in the factory using a deep learning-based object detection method. Next, the detection area is cropped from the image. Finally, by using a classification method based on deep metric learning on the cropped image, the progress of the product assembly work is estimated as a rough progress step. As a specific progress estimation model, we propose an Anomaly Triplet-Net that adds anomaly samples to Triplet Loss for progress estimation considering occlusion. In experiments, an 82.9% success rate is achieved for the progress estimation method using Anomaly Triplet-Net. We also experimented with the practicality of the sequence of detection, cropping, and progression estimation, and confirmed the effectiveness of the overall system. Comments: This paper has been peer-reviewed, revised, and published in Advanced Robotics Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.03533 [cs.CV] (or arXiv:2501.03533v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.03533 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Advanced Robotics(2024) Related DOI: https://doi.org/10.1080/01691864.2024.2422968 Focus to learn more DOI(s) linking to related resources Submission history From: Kazuma Miura [view email] [v1] Tue, 7 Jan 2025 05:12:49 UTC (1,223 KB)
zh

[CV-50] xHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes ICCV

【速读】：该论文旨在解决从单目帧序列中重建具有高保真纹理的动态真实世界物体的3D模型这一难题。这一挑战主要源于阴影、间接光照以及由于手-物体交互导致的遮挡引起的物体姿态估计不准确等因素。为了解决这些问题，作者提出了一种新颖的方法，该方法预测手对环境可见性和物体表面反照率（albedo）的间接光照的影响。解决方案的关键在于首先通过辐射场（radiance fields）的复合渲染学习物体、手和背景的几何形状和低保真纹理，同时优化手和物体的姿态以实现准确的物体姿态估计。随后，作者通过细化基于物理的渲染参数（包括粗糙度、镜面反射、反照率、手的可见性、皮肤颜色反射和环境光照）来生成精确的反照率以及准确的手部光照和阴影区域。该方法在纹理重建方面超越了现有技术，并且是首次在物体纹理重建中考虑了手-物体交互的影响。

链接: https://arxiv.org/abs/2501.03525
作者: Alakh Aggarwal,Ningna Wang,Xiaohu Guo
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted at ICCVM 2025 and will appear in the proceedings of IEEE TVCG as part of the conference

点击查看摘要

Abstract:Reconstructing 3D models of dynamic, real-world objects with high-fidelity textures from monocular frame sequences has been a challenging problem in recent years. This difficulty stems from factors such as shadows, indirect illumination, and inaccurate object-pose estimations due to occluding hand-object interactions. To address these challenges, we propose a novel approach that predicts the hand’s impact on environmental visibility and indirect illumination on the object’s surface albedo. Our method first learns the geometry and low-fidelity texture of the object, hand, and background through composite rendering of radiance fields. Simultaneously, we optimize the hand and object poses to achieve accurate object-pose estimations. We then refine physics-based rendering parameters - including roughness, specularity, albedo, hand visibility, skin color reflections, and environmental illumination - to produce precise albedo, and accurate hand illumination and shadow regions. Our approach surpasses state-of-the-art methods in texture reconstruction and, to the best of our knowledge, is the first to account for hand-object interactions in object texture reconstruction.
zh

[CV-51] An Empirical Study of Accuracy-Robustness Tradeoff and Training Efficiency in Self-Supervised Learning

【速读】：该论文旨在解决自监督学习（Self-Supervised Learning, SSL）在对抗训练（adversarial training）中效率低下的问题，特别是训练周期长、收敛速度慢的挑战。为了解决这一问题，论文提出了基于多裁剪（multi-crop）采样的鲁棒 EMP-SSL 框架，通过增加每张图像的裁剪数量来加速学习过程。与传统对比学习不同，鲁棒 EMP-SSL 结合了不变性项（invariance term）和正则化（regularization），并减少了训练周期，从而提高了时间效率。此外，论文进一步扩展了这一方法，提出了成本免费的对抗多裁剪自监督学习（Cost-Free Adversarial Multi-Crop Self-Supervised Learning, CF-AMC-SSL）方法，通过引入免费的对抗训练（free adversarial training）来减少训练时间，同时提升干净样本的准确性和对抗鲁棒性。实验结果表明，CF-AMC-SSL 在收敛速度和性能平衡方面优于传统的多裁剪嵌入聚合方法，展示了其在自监督学习中的实际应用潜力。

链接: https://arxiv.org/abs/2501.03507
作者: Fatemeh Ghofrani,Pooyan Jamshidi
机构: College of Engineering and Computing, University of South Carolina (南卡罗来纳大学工程学院与计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies. Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.03507 [cs.CV] (or arXiv:2501.03507v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.03507 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-52] Can Deep Learning Trigger Alerts from Mobile-Captured Images?

【速读】：该论文旨在解决如何利用移动设备摄像头图像数据进行实时空气质量评估和推荐的问题。解决方案的关键在于开发了一种基于回归的卷积神经网络（Convolutional Neural Network, CNN）模型，该模型通过利用输出参数之间的内在关系，专门用于空气质量预测。实验结果表明，该模型在预测2种和5种污染物时的均方误差（Mean Squared Error, MSE）分别为0.0077和0.0112，优于现有模型。此外，论文还验证了在训练阶段通过数据增强（data augmentation）引入更多变化的常见做法，结果显示原始数据集与增强数据集之间的准确性差异极小。最后，论文实现了一个实时、用户友好的仪表板，动态显示从移动摄像头图像中得出的空气质量指数（Air Quality Index, AQI）和污染物值，并根据用户的健康状况推荐是否适合前往某个地点。总体而言，该研究在数据增强技术验证、基于CNN的回归建模以及通过移动技术进行以用户为中心的空气质量监测方面做出了重要贡献。

链接: https://arxiv.org/abs/2501.03499
作者: Pritisha Sarkar,Duranta Durbaar Vishal Saha,Mousumi Saha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our research presents a comprehensive approach to leveraging mobile camera image data for real-time air quality assessment and recommendation. We develop a regression-based Convolutional Neural Network model and tailor it explicitly for air quality prediction by exploiting the inherent relationship between output parameters. As a result, the Mean Squared Error of 0.0077 and 0.0112 obtained for 2 and 5 pollutants respectively outperforms existing models. Furthermore, we aim to verify the common practice of augmenting the original dataset with a view to introducing more variation in the training phase. It is one of our most significant contributions that our experimental results demonstrate minimal accuracy differences between the original and augmented datasets. Finally, a real-time, user-friendly dashboard is implemented which dynamically displays the Air Quality Index and pollutant values derived from captured mobile camera images. Users’ health conditions are considered to recommend whether a location is suitable based on current air quality metrics. Overall, this research contributes to verification of data augmentation techniques, CNN-based regression modelling for air quality prediction, and user-centric air quality monitoring through mobile technology. The proposed system offers practical solutions for individuals to make informed environmental health and well-being decisions.
zh

[CV-53] xtualize Visual Prompt for Image Editing via Diffusion Bridge AAAI2025

【速读】：该论文旨在解决当前视觉提示（visual prompt）方法在图像编辑中的可扩展性和泛化性问题。现有方法依赖于预训练的文本引导的图像到图像生成模型，需要文本、编辑前和编辑后的图像三元组进行重新训练，这一过程限制了编辑的灵活性和适用性。论文提出了一种基于单一文本到图像模型的框架，无需依赖显式的图像到图像模型，从而增强了方法的通用性和可扩展性。关键解决方案包括：通过概率流常微分方程（probability-flow ordinary equation）构建扩散桥（diffusion bridge），在文本引导下实现编辑前后图像分布之间的转换；通过优化文本嵌入（text embeddings）将视觉提示中的编辑转换自适应地文本化；同时引入差分注意力控制（differential attention control），使文本嵌入仅捕捉精细的编辑变换，从而实现对多种图像的高质量编辑。实验结果表明，该方法在泛化性、上下文一致性和高保真度方面具有竞争力。

链接: https://arxiv.org/abs/2501.03495
作者: Pengcheng Xu,Qingnan Fan,Fei Kou,Shuai Qin,Hong Gu,Ruoyu Zhao,Charles Ling,Boyu Wang
机构: 1. University of Western Ontario (西安大略大学); 2. Tencent (腾讯); 3. University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025

点击查看摘要

Abstract:Visual prompt, a pair of before-and-after edited images, can convey indescribable imagery transformations and prosper in image editing. However, current visual prompt methods rely on a pretrained text-guided image-to-image generative model that requires a triplet of text, before, and after images for retraining over a text-to-image model. Such crafting triplets and retraining processes limit the scalability and generalization of editing. In this paper, we present a framework based on any single text-to-image model without reliance on the explicit image-to-image model thus enhancing the generalizability and scalability. Specifically, by leveraging the probability-flow ordinary equation, we construct a diffusion bridge to transfer the distribution between before-and-after images under the text guidance. By optimizing the text via the bridge, the framework adaptively textualizes the editing transformation conveyed by visual prompts into text embeddings without other models. Meanwhile, we introduce differential attention control during text optimization, which disentangles the text embedding from the invariance of the before-and-after images and makes it solely capture the delicate transformation and generalize to edit various images. Experiments on real images validate competitive results on the generalization, contextual coherence, and high fidelity for delicate editing with just one image pair as the visual prompt.
zh

[CV-54] SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

【速读】：该论文试图解决在个性化图像生成（subject-driven text-to-image generation）中如何保持主体（subject）保真度（fidelity）的问题。现有方法通常通过学习主体表示并将其融入提示嵌入（prompt embedding）来引导图像生成，但在保持主体保真度方面存在困难。为解决这一问题，论文提出了一种名为SceneBooth的新框架，该框架通过固定输入的主体图像并生成其背景图像来实现主体保留的文本到图像生成。SceneBooth的关键解决方案包括两个核心组件：多模态布局生成模块（multimodal layout generation module）和背景绘制模块（background painting module）。前者通过生成与文本描述、对象短语和主体视觉信息对齐的场景布局来确定主体的位置和比例；后者则将ControlNet和Gated Self-Attention两种适配器集成到潜在扩散模型（latent diffusion model）中，以生成与主体和场景布局相协调的背景。通过这种方式，SceneBooth确保了输出图像中主体外观的准确保留。实验结果表明，SceneBooth在主体保留、图像协调和整体质量方面显著优于基线方法。

链接: https://arxiv.org/abs/2501.03490
作者: Shang Chai,Zihang Lin,Min Zhou,Xubin Li,Liansheng Zhuang,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject’s appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.
zh

[CV-55] VOILA: Complexity-Aware Universal Segmentation of CT images by Voxel Interacting with Language AAAI2025

【速读】：该论文旨在解决CT图像通用分割（universal segmentation）中的两个主要问题：3D图像与文本提示（text prompts）之间的信息密度不平衡，以及标准全连接层分割方法在处理多类别时面临的挑战和泛化能力差的问题。为解决这些问题，论文提出了VOxel Interacting with LAnguage方法（VOILA）。其关键解决方案包括：1）将体素（voxels）与语言对齐到一个共享的表示空间，并基于余弦相似度对体素进行分类；2）开发了Voxel-Language Interaction框架，以减轻前景-背景差异和目标体积变化引起的类别不平衡影响；3）提出了Complexity-Aware Sampling方法，通过从可训练的高斯混合分布生成伪热图（pseudo-heatmaps），专注于难以分割的区域。实验结果表明，VOILA在减少参数和计算成本的同时，显著提升了性能，并在无需额外微调的情况下展示了跨数据集的强泛化能力。

链接: https://arxiv.org/abs/2501.03482
作者: Zishuo Wan,Yu Gao,Wanyuan Pang,Dawei Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Satisfactory progress has been achieved recently in universal segmentation of CT images. Following the success of vision-language methods, there is a growing trend towards utilizing text prompts and contrastive learning to develop universal segmentation models. However, there exists a significant imbalance in information density between 3D images and text prompts. Moreover, the standard fully connected layer segmentation approach faces significant challenges in handling multiple classes and exhibits poor generalizability. To address these challenges, we propose the VOxel Interacting with LAnguage method (VOILA) for universal CT image segmentation. Initially, we align voxels and language into a shared representation space and classify voxels on the basis of cosine similarity. Subsequently, we develop the Voxel-Language Interaction framework to mitigate the impact of class imbalance caused by foreground-background discrepancies and variations in target volumes. Furthermore, a Complexity-Aware Sampling method is proposed to focus on region hard to segment, achieved by generating pseudo-heatmaps from a trainable Gaussian mixture distribution. Our results indicate the proposed VOILA is capable to achieve improved performance with reduced parameters and computational cost during training. Furthermore, it demonstrates significant generalizability across diverse datasets without additional fine-tuning.
zh

[CV-56] Hyperbolic Binary Neural Network

【速读】：该论文试图解决二进制神经网络（Binary Neural Network, BNN）在优化过程中面临的约束优化问题。传统的二进制神经网络通常将权重和激活值量化为1位，并在二值化空间中进行优化，而一般的神经网络则在连续空间中进行无约束优化。为了解决这一挑战，论文提出了基于双曲几何（hyperbolic geometry）框架的双曲二进制神经网络（Hyperbolic Binary Neural Network, HBNN）。其关键解决方案包括：1）利用黎曼指数映射（Riemannian exponential map）将双曲空间中的约束优化问题转换为欧几里得空间中的无约束优化问题；2）提出了指数参数化聚类（Exponential Parametrization Cluster, EPC）方法，通过基于微分同胚（diffeomorphism）的域收缩，增加权重翻转的概率，从而最大化BNN中的信息增益。实验结果表明，HBNN在CIFAR10、CIFAR100和ImageNet数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.03471
作者: Jun Chen,Jingyang Xiang,Tianxin Huang,Xiangrui Zhao,Yong Liu
机构: National Special Education Resource Center for Children with Autism, Zhejiang Normal University (浙江师范大学国家特殊教育儿童自闭症资源中心); Institute of Cyber-Systems and Control, Zhejiang University (浙江大学网络系统与控制研究所); School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Binary Neural Network (BNN) converts full-precision weights and activations into their extreme 1-bit counterparts, making it particularly suitable for deployment on lightweight mobile devices. While binary neural networks are typically formulated as a constrained optimization problem and optimized in the binarized space, general neural networks are formulated as an unconstrained optimization problem and optimized in the continuous space. This paper introduces the Hyperbolic Binary Neural Network (HBNN) by leveraging the framework of hyperbolic geometry to optimize the constrained problem. Specifically, we transform the constrained problem in hyperbolic space into an unconstrained one in Euclidean space using the Riemannian exponential map. On the other hand, we also propose the Exponential Parametrization Cluster (EPC) method, which, compared to the Riemannian exponential map, shrinks the segment domain based on a diffeomorphism. This approach increases the probability of weight flips, thereby maximizing the information gain in BNNs. Experimental results on CIFAR10, CIFAR100, and ImageNet classification datasets with VGGsmall, ResNet18, and ResNet34 models illustrate the superior performance of our HBNN over state-of-the-art methods.
zh

[CV-57] Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning

【速读】：该论文旨在解决图像表示学习中的自监督学习（Self-supervised Learning, SSL）问题，特别是在大规模无标注数据集上提升下游任务的性能。论文提出了一种新颖的自监督学习方法，称为信息最大化软变量离散化（Information-Maximized Soft Variable Discretization, IMSVD）。该方法的关键在于通过对潜在空间中的每个变量进行软离散化，从而估计其在训练批次上的概率分布，并通过信息度量直接指导学习过程。基于多视图假设（MultiView assumption），论文提出了一个信息论目标函数，用于学习具有变换不变性、非冗余且最小化冗余的表示特征。此外，论文推导了一个联合交叉熵损失函数，用于自监督图像表示学习，理论上在减少特征冗余方面优于现有方法。值得注意的是，IMSVD方法在统计上实现了对比学习的效果，并在多个下游任务中展示了其准确性和效率的优势。通过变量离散化，IMSVD优化的嵌入特征在变量级别上提供了独特的可解释性，并具有适应其他学习范式的潜力。

链接: https://arxiv.org/abs/2501.03469
作者: Chuang Niu,Wenjun Xia,Hongming Shan,Ge Wang
机构: Department of Biomedical Engineering, Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute (伦斯勒理工学院生物医学工程系，生物技术与跨学科研究中心); Institute of Science and Technology for Brain-inspired Intelligence and MOE Frontiers Center for Brain Science and Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University (复旦大学脑启发智能科学与技术研究所，教育部脑科学前沿中心，计算神经科学与脑启发智能重点实验室); Shanghai Center for Brain Science and Brain-inspired Technology (上海脑科学与类脑研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today’s vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at this https URL.
zh

[CV-58] ScaleMAI: Accelerating the Development of Trusted Datasets and AI Models

【速读】：该论文试图解决医学人工智能（Medical AI, MAI）研究中数据集构建耗时且效率低下的问题。传统方法中，数据创建和模型开发被视为分离的、顺序的步骤，导致AI应用的延迟。论文提出的解决方案是ScaleMAI，一种集成AI的数据管理和标注代理，通过自增强循环机制同时提升数据质量和AI性能，从而将开发时间从数年缩短至数月。ScaleMAI的关键在于其渐进式的人机协作迭代过程，能够生成高质量的大规模数据集，并训练出接近专家水平的AI模型。以胰腺肿瘤检测为例，ScaleMAI成功构建了包含25,362个CT扫描的数据集，并通过迭代优化显著提升了肿瘤检测、分割和分类的性能。

链接: https://arxiv.org/abs/2501.03410
作者: Wenxuan Li,Pedro R. A. S. Bassi,Tianyu Lin,Yu-Cheng Chou,Xinze Zhou,Yucheng Tang,Fabian Isensee,Kang Wang,Qi Chen,Xiaowei Xu,Xiaoxi Chen,Lizhou Wu,Qilong Wu,Yannick Kirchhoff,Maximilian Rokuss,Saikat Roy,Yuxuan Zhao,Dexin Yu,Kai Ding,Constantin Ulrich,Klaus Maier-Hein,Yang Yang,Alan L. Yuille,Zongwei Zhou
机构: Johns Hopkins University(约翰霍普金斯大学); University of Bologna(博洛尼亚大学); Italian Institute of Technology(意大利技术研究院); NVIDIA(NVIDIA); DKFZ(德国癌症研究中心); University of California, San Francisco(加州大学旧金山分校); University of Chinese Academy of Sciences(中国科学院大学); Guangdong Provincial People’s Hospital(广东省人民医院); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); The First Affiliated Hospital of Shandong First Medical University(山东第一医科大学第一附属医院); National University of Singapore(新加坡国立大学); Qilu Hospital of Shandong University(山东大学齐鲁医院); Johns Hopkins Medicine(约翰霍普金斯医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building trusted datasets is critical for transparent and responsible Medical AI (MAI) research, but creating even small, high-quality datasets can take years of effort from multidisciplinary teams. This process often delays AI benefits, as human-centric data creation and AI-centric model development are treated as separate, sequential steps. To overcome this, we propose ScaleMAI, an agent of AI-integrated data curation and annotation, allowing data quality and AI performance to improve in a self-reinforcing cycle and reducing development time from years to months. We adopt pancreatic tumor detection as an example. First, ScaleMAI progressively creates a dataset of 25,362 CT scans, including per-voxel annotations for benign/malignant tumors and 24 anatomical structures. Second, through progressive human-in-the-loop iterations, ScaleMAI provides Flagship AI Model that can approach the proficiency of expert annotators (30-year experience) in detecting pancreatic tumors. Flagship Model significantly outperforms models developed from smaller, fixed-quality datasets, with substantial gains in tumor detection (+14%), segmentation (+5%), and classification (72%) on three prestigious benchmarks. In summary, ScaleMAI transforms the speed, scale, and reliability of medical dataset creation, paving the way for a variety of impactful, data-driven applications.
zh

[CV-59] Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

【速读】：该论文旨在解决3D高斯泼溅（3D Gaussian Splatting）方法在实际应用中面临的高数据存储需求问题。尽管该方法在3D场景表示中具有高质量的渲染和快速处理能力，但其庞大的数据量限制了其广泛应用。为此，论文提出了一种高效的压缩技术，通过紧凑表示显著减少存储开销。解决方案的关键在于引入了一种统一架构，该架构通过渐进式三平面结构（progressive tri-plane structure）将点云数据与特征平面相结合。该方法利用2D特征平面实现连续空间表示，并通过在频域中引入熵建模（entropy modeling）进一步优化这些表示，特别针对标准视频编解码器进行了设计。此外，论文还提出了通道级比特分配（channel-wise bit allocation）策略，以在比特率消耗和特征平面表示之间实现更好的权衡。通过这些技术，模型能够有效利用特征平面内的空间相关性，从而在使用标准、不可微分的视频编解码器时提升率失真性能（rate-distortion performance）。实验结果表明，该方法在保持高渲染质量的同时，显著优于现有方法的数据紧凑性。

链接: https://arxiv.org/abs/2501.03399
作者: Soonbin Lee,Fangwen Shu,Yago Sanchez,Thomas Schierl,Cornelius Hellge
机构: Fraunhofer Heinrich-Hertz-Institute (HHI), Germany (弗劳恩霍夫海因里希-赫兹研究所, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting is a recognized method for 3D scene representation, known for its high rendering quality and speed. However, its substantial data requirements present challenges for practical applications. In this paper, we introduce an efficient compression technique that significantly reduces storage overhead by using compact representation. We propose a unified architecture that combines point cloud data and feature planes through a progressive tri-plane structure. Our method utilizes 2D feature planes, enabling continuous spatial representation. To further optimize these representations, we incorporate entropy modeling in the frequency domain, specifically designed for standard video codecs. We also propose channel-wise bit allocation to achieve a better trade-off between bitrate consumption and feature plane representation. Consequently, our model effectively leverages spatial correlations within the feature planes to enhance rate-distortion performance using standard, non-differentiable video codecs. Experimental results demonstrate that our method outperforms existing methods in data compactness while maintaining high rendering quality. Our project page is available at this https URL
zh

[CV-60] DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Generative Learning on 3D Meshes

【速读】：该论文试图解决在3D网格表面上生成连续信号分布的挑战，特别是针对曲线流形表面上的信号生成问题。传统方法通常依赖于将3D网格展开为2D或采用场表示，而本文提出的DoubleDiffusion框架则通过结合热耗散扩散（heat dissipation diffusion）和去噪扩散（denoising diffusion）来实现直接在3D网格表面上的生成式学习。解决方案的关键在于利用拉普拉斯-贝尔特拉米算子（Laplacian-Beltrami operator）来处理特征，从而在保持网格结构的同时实现几何感知的信号扩散。这种方法不仅能够生成复杂的RGB信号分布，还能在不同几何形状上实现基于类别的形状条件纹理生成，为3D表面上的基于扩散的生成建模开辟了新的研究方向。

链接: https://arxiv.org/abs/2501.03397
作者: Xuyang Wang,Ziang Cheng,Zhenyu Li,Jiayu Yang,Haorui Ji,Pan Ji,Mehrtash Harandi,Richard Hartley,Hongdong Li
机构: The Australian National University(澳大利亚国立大学); Tencent XR Vision Labs(腾讯XR视觉实验室); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学); Monash University(莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes DoubleDiffusion, a novel framework that combines heat dissipation diffusion and denoising diffusion for direct generative learning on 3D mesh surfaces. Our approach addresses the challenges of generating continuous signal distributions residing on a curve manifold surface. Unlike previous methods that rely on unrolling 3D meshes into 2D or adopting field representations, DoubleDiffusion leverages the Laplacian-Beltrami operator to process features respecting the mesh structure. This combination enables effective geometry-aware signal diffusion across the underlying geometry. As shown in Fig.~\reffig:teaser, we demonstrate that DoubleDiffusion has the ability to generate RGB signal distributions on complex 3D mesh surfaces and achieves per-category shape-conditioned texture generation across different shape geometry. Our work contributes a new direction in diffusion-based generative modeling on 3D surfaces, with potential applications in the field of 3D asset generation.
zh

[CV-61] License Plate Images Generation with Diffusion Models

【速读】：该论文试图解决由于隐私法规（如《通用数据保护条例》(General Data Protection Regulation, GDPR)）限制，导致公开可用的车牌识别（License Plate Recognition, LPR）数据集数量有限的问题。为了解决这一挑战，论文提出了一种基于扩散模型（diffusion models）的合成数据生成方法，用于生成逼真的车牌图像。关键解决方案是通过训练扩散模型生成合成车牌图像，并利用这些合成数据进行车牌识别任务的实验验证。实验结果表明，尽管使用合成数据训练的模型在初始性能上与使用真实数据训练的模型存在差距，但通过扩展训练数据集并引入伪标签合成数据，车牌识别的准确率相比基线提高了3%。此外，论文还公开了一个包含10,000张合成车牌图像的数据集，供进一步研究使用。

链接: https://arxiv.org/abs/2501.03374
作者: Mariia Shpir,Nadiya Shvai,Amir Nakib
机构: National University of Kyiv-Mohyla Academy (基辅-莫希拉国立学院); Cyclope.ai, VINCI Autoroutes (Cyclope.ai, VINCI 高速公路); University Paris Est Créteil, Laboratoire LISSI (巴黎东克雷泰伊大学, LISSI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the evident practical importance of license plate recognition (LPR), corresponding research is limited by the volume of publicly available datasets due to privacy regulations such as the General Data Protection Regulation (GDPR). To address this challenge, synthetic data generation has emerged as a promising approach. In this paper, we propose to synthesize realistic license plates (LPs) using diffusion models, inspired by recent advances in image and video generation. In our experiments a diffusion model was successfully trained on a Ukrainian LP dataset, and 1000 synthetic images were generated for detailed analysis. Through manual classification and annotation of the generated images, we performed a thorough study of the model output, such as success rate, character distributions, and type of failures. Our contributions include experimental validation of the efficacy of diffusion models for LP synthesis, along with insights into the characteristics of the generated data. Furthermore, we have prepared a synthetic dataset consisting of 10,000 LP images, publicly available at this https URL. Conducted experiments empirically confirm the usefulness of synthetic data for the LPR task. Despite the initial performance gap between the model trained with real and synthetic data, the expansion of the training data set with pseudolabeled synthetic data leads to an improvement in LPR accuracy by 3% compared to baseline.
zh

[CV-62] FTA-FTL: A Fine-Tuned Aggregation Federated Transfer Learning Scheme for Lithology Microscopic Image Classification

【速读】：该论文试图解决在岩性（Lithology）微观图像分类任务中，由于数据隐私和数据集规模限制所带来的挑战。具体来说，岩性识别是表征油藏的关键活动，而处理岩性微观图像是研究化石、矿物以及页岩油勘探地质评估的重要技术。然而，收集和生成大规模数据集存在显著困难，且由于数据隐私问题，个人、组织和行业公司通常不愿意共享敏感数据。

解决方案的关键在于结合迁移学习（Transfer Learning）和联邦学习（Federated Learning, FL）技术。首先，论文通过迁移学习在小型数据集上进行岩性微观图像分类，并比较了多种预训练的深度学习模型架构。其次，论文提出了一种联邦迁移学习（Federated Transfer Learning, FTL）方案，并设计了一种精细调优聚合策略（Fine-Tuned Aggregation strategy for Federated Learning, FTA-FTL）。该策略能够在多个去中心化的边缘服务器上训练高精度的中央模型，而无需传输敏感数据，从而保护数据隐私并增强安全性。实验结果表明，所提出的FTA-FTL算法在岩性微观图像分类任务中能够达到与集中式实现相近的效果。

链接: https://arxiv.org/abs/2501.03349
作者: Keyvan RahimiZadeh,Ahmad Taheri,Jan Baumbach,Esmael Makarian,Abbas Dehghani,Bahman Ravaei,Bahman Javadi,Amin Beheshti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lithology discrimination is a crucial activity in characterizing oil reservoirs, and processing lithology microscopic images is an essential technique for investigating fossils and minerals and geological assessment of shale oil exploration. In this way, Deep Learning (DL) technique is a powerful approach for building robust classifier models. However, there is still a considerable challenge to collect and produce a large dataset. Transfer-learning and data augmentation techniques have emerged as popular approaches to tackle this problem. Furthermore, due to different reasons, especially data privacy, individuals, organizations, and industry companies often are not willing to share their sensitive data and information. Federated Learning (FL) has emerged to train a highly accurate central model across multiple decentralized edge servers without transferring sensitive data, preserving sensitive data, and enhancing security. This study involves two phases; the first phase is to conduct Lithology microscopic image classification on a small dataset using transfer learning. In doing so, various pre-trained DL model architectures are comprehensively compared for the classification task. In the second phase, we formulated the classification task to a Federated Transfer Learning (FTL) scheme and proposed a Fine-Tuned Aggregation strategy for Federated Learning (FTA-FTL). In order to perform a comprehensive experimental study, several metrics such as accuracy, f1 score, precision, specificity, sensitivity (recall), and confusion matrix are taken into account. The results are in excellent agreement and confirm the efficiency of the proposed scheme, and show that the proposed FTA-FTL algorithm is capable enough to achieve approximately the same results obtained by the centralized implementation for Lithology microscopic images classification task.
zh

[CV-63] Mobile Augmented Reality Framework with Fusional Localization and Pose Estimation

【速读】：该论文旨在解决移动增强现实（AR）系统在室内环境中的定位和姿态估计问题。传统的基于GPS的移动AR系统在室内环境中表现不佳，而基于视觉的姿态估计方法则需要持续跟踪预定义的标记，且距离较短，影响了用户体验。论文提出了一种有效的室内移动AR框架，其关键解决方案包括融合定位方法和新的姿态估计实现。通过融合多种定位技术，该框架提高了整体匹配率，从而提升了AR显示的准确性。实验结果表明，该框架在平均采样网格长度为0.5米时，实现了较低的平均误差距离（0.61-0.81米）和较高的匹配准确率（77%-82%），优于仅基于图像或Wi-Fi信号的方法。

链接: https://arxiv.org/abs/2501.03336
作者: Songlin Hou,Fangzhou Lin,Yunmei Huang,Zhe Peng,Bin Xiao
机构: Department of Computer Science, Worcester Polytechnic Institute (伍斯特理工学院); Department of Robotics Engineering, Worcester Polytechnic Institute (伍斯特理工学院); Unprecedented-scale Data Analytics Center, Tohoku University (东北大学); Department of Forestry and Natural Resources, Purdue University (普渡大学); Department of Computing, The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figues

点击查看摘要

Abstract:As a novel way of presenting information, augmented reality (AR) enables people to interact with the physical world in a direct and intuitive way. While there are some mobile AR products implemented with specific hardware at a high cost, the software approaches of AR implementation on mobile platforms(such as smartphones, tablet PC, etc.) are still far from practical use. GPS-based mobile AR systems usually perform poorly due to the inaccurate positioning in the indoor environment. Previous vision-based pose estimation methods need to continuously track predefined markers within a short distance, which greatly degrade user experience. This paper first conducts a comprehensive study of the state-of-the-art AR and localization systems on mobile platforms. Then, we propose an effective indoor mobile AR framework. In the framework, a fusional localization method and a new pose estimation implementation are developed to increase the overall matching rate and thus improving AR display accuracy. Experiments show that our framework has higher performance than approaches purely based on images or Wi-Fi signals. We achieve low average error distances (0.61-0.81m) and accurate matching rates (77%-82%) when the average sampling grid length is set to 0.5m.
zh

[CV-64] CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets WACV

【速读】：该论文试图解决跨学习（cross-learning）中的两个主要挑战：训练数据的不均匀或不足，以及重新训练大型预训练模型时资源匮乏的问题。为了解决这些问题，论文提出了一种新的模型无关插件架构，称为CM3T（Cross-Modal Multi-Task Transformer），该架构通过适配器（adapters）和前缀调优（prefix tuning）技术，将基于Transformer的模型适应于新信息或缺失信息的场景。关键解决方案包括引入两种适配器模块：用于迁移学习的多头视觉适配器（multi-head vision adapters）和用于多模态学习的交叉注意力适配器（cross-attention adapters）。这些适配器的引入使得训练过程更加高效，因为主干网络和其他插件不需要与这些新增模块一起进行微调。实验结果表明，CM3T在视频输入处理中仅需12.8%的可训练参数，并且在处理两种额外模态时仅需22.3%的可训练参数，就能达到甚至超越现有最先进模型的性能。

链接: https://arxiv.org/abs/2501.03332
作者: Tanay Agrawal,Mohammed Guermal,Michal Balazia,Francois Bremond
机构: INRIA(法国国家信息与自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Final paper accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, February, 2025. 10 pages

点击查看摘要

Abstract:Challenges in cross-learning involve inhomogeneous or even inadequate amount of training data and lack of resources for retraining large pretrained models. Inspired by transfer learning techniques in NLP, adapters and prefix tuning, this paper presents a new model-agnostic plugin architecture for cross-learning, called CM3T, that adapts transformer-based models to new or missing information. We introduce two adapter blocks: multi-head vision adapters for transfer learning and cross-attention adapters for multimodal learning. Training becomes substantially efficient as the backbone and other plugins do not need to be finetuned along with these additions. Comparative and ablation studies on three datasets Epic-Kitchens-100, MPIIGroupInteraction and UDIVA v0.5 show efficacy of this framework on different recording settings and tasks. With only 12.8% trainable parameters compared to the backbone to process video input and only 22.3% trainable parameters for two additional modalities, we achieve comparable and even better results than the state-of-the-art. CM3T has no specific requirements for training or pretraining and is a step towards bridging the gap between a general model and specific practical applications of video classification.
zh

[CV-65] Plant Leaf Disease Detection and Classification Using Deep Learning: A Review and A Proposed System on Bangladeshs Perspective

【速读】：该论文试图解决农业中植物病害的检测和分类问题，特别是在孟加拉国，植物病害严重影响了农业生产、减少贫困和确保粮食安全。传统的病害检测方法依赖于肉眼观察，往往在病害已经严重时才能发现，导致使用无机化学物质或农药的效果不佳。论文提出了一种基于深度学习（Deep Learning）的叶片图像分类技术，通过卷积神经网络（CNN）模型来精确识别和分类植物病害。解决方案的关键在于利用从Kaggle收集的17,430张图像数据集，涵盖三种作物（甜椒、番茄和马铃薯）的14种病害类别，训练和测试CNN模型。该模型在病害检测和分类方面表现出色，具有在作物病害管理中应用的潜力。

链接: https://arxiv.org/abs/2501.03305
作者: Md. Jalal Uddin Chowdhury,Zumana Islam Mou,Rezwana Afrin,Shafkat Kibria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A very crucial part of Bangladeshi people’s employment, GDP contribution, and mainly livelihood is agriculture. It plays a vital role in decreasing poverty and ensuring food security. Plant diseases are a serious stumbling block in agricultural production in Bangladesh. At times, humans can’t detect the disease from an infected leaf with the naked eye. Using inorganic chemicals or pesticides in plants when it’s too late leads in vain most of the time, deposing all the previous labor. The deep-learning technique of leaf-based image classification, which has shown impressive results, can make the work of recognizing and classifying all diseases trouble-less and more precise. In this paper, we’ve mainly proposed a better model for the detection of leaf diseases. Our proposed paper includes the collection of data on three different kinds of crops: bell peppers, tomatoes, and potatoes. For training and testing the proposed CNN model, the plant leaf disease dataset collected from Kaggle is used, which has 17,430 images. The images are labeled with 14 separate classes of damage. The developed CNN model performs efficiently and could successfully detect and classify the tested diseases. The proposed CNN model may have great potency in crop disease management.
zh

[CV-66] OpenLKA: an open dataset of lane keeping assist from market autonomous vehicles

【速读】：该论文旨在解决车道保持辅助系统（Lane Keeping Assist, LKA）在实际操作中的性能和安全问题，特别是由于缺乏真实世界的测试和全面数据，导致其操作特性和安全性能尚未得到充分研究。为了解决这一问题，作者在美国佛罗里达州坦帕市对主流LKA系统进行了广泛测试，采用了一种创新的方法，收集了包括完整的控制器局域网（Controller Area Network, CAN）消息、视频、感知数据和横向轨迹数据在内的综合数据集。这些数据通过高质量的前置摄像头和先进的视觉检测与轨迹规划算法获取，涵盖了复杂道路几何、恶劣天气、模糊车道标线等多种挑战性条件。此外，作者还利用视觉语言模型（Vision Language Model, VLM）对视频进行注释，捕捉天气、光照和交通特征。基于这一数据集，论文提供了LKA操作特性和安全性能的实证分析，并提出了改进建议，如优化道路几何和维护路面标线，以及通过VLM微调和链式思维推理开发更人性化的LKA系统。

链接: https://arxiv.org/abs/2501.03287
作者: Yuhang Wang,Abdulaziz Alhuraish,Shengming Yuan,Shuyi Wang,Hao Zhou
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Lane Keeping Assist (LKA) system has become a standard feature in recent car models. While marketed as providing auto-steering capabilities, the system’s operational characteristics and safety performance remain underexplored, primarily due to a lack of real-world testing and comprehensive data. To fill this gap, we extensively tested mainstream LKA systems from leading U.S. automakers in Tampa, Florida. Using an innovative method, we collected a comprehensive dataset that includes full Controller Area Network (CAN) messages with LKA attributes, as well as video, perception, and lateral trajectory data from a high-quality front-facing camera equipped with advanced vision detection and trajectory planning algorithms. Our tests spanned diverse, challenging conditions, including complex road geometry, adverse weather, degraded lane markings, and their combinations. A vision language model (VLM) further annotated the videos to capture weather, lighting, and traffic features. Based on this dataset, we present an empirical overview of LKA’s operational features and safety performance. Key findings indicate: (i) LKA is vulnerable to faint markings and low pavement contrast; (ii) it struggles in lane transitions (merges, diverges, intersections), often causing unintended departures or disengagements; (iii) steering torque limitations lead to frequent deviations on sharp turns, posing safety risks; and (iv) LKA systems consistently maintain rigid lane-centering, lacking adaptability on tight curves or near large vehicles such as trucks. We conclude by demonstrating how this dataset can guide both infrastructure planning and self-driving technology. In view of LKA’s limitations, we recommend improvements in road geometry and pavement maintenance. Additionally, we illustrate how the dataset supports the development of human-like LKA systems via VLM fine-tuning and Chain of Thought reasoning.
zh

[CV-67] Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition ICML2024

【速读】：该论文旨在解决现有视频理解研究在复杂视频中难以实现深入理解和推理的问题，主要由于两个关键瓶颈的未充分探索：细粒度的时空感知理解和认知层次的视频场景理解。论文提出了一种新颖的解决方案，首先引入了一个新的视频多模态大语言模型（Multimodal Large Language Model, MLLM）——MotionEpic，该模型通过整合视频时空场景图（Spatial-Temporal Scene Graph, STSG）表示，实现了细粒度的像素级时空视频定位。在此基础上，论文进一步开发了视频思维链（Video-of-Thought, VoT）推理框架。VoT继承了思维链（Chain-of-Thought, CoT）的核心思想，将复杂任务分解为更简单且可管理的子问题，并从低层次的像素感知逐步解决到高层次的认知解释。通过在各种复杂视频问答基准上的广泛实验，该框架显著提升了现有最先进技术的性能。这是首次成功应用CoT技术实现人类水平视频推理的尝试，展示了其在更广泛视频理解场景中的巨大潜力。

链接: https://arxiv.org/abs/2501.03230
作者: Hao Fei,Shengqiong Wu,Wei Ji,Hanwang Zhang,Meishan Zhang,Mong-Li Lee,Wynne Hsu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2024

点击查看摘要

Abstract:Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Project is open at this https URL
zh

[CV-68] Explainable AI model reveals disease-related mechanisms in single-cell RNA-seq data

【速读】：该论文试图解决神经退行性疾病（NDDs）机制不明确且缺乏有效治疗的问题。通过结合单核RNA测序（snRNA-seq）和神经网络（NN）模型，论文提出了一种基于可解释人工智能（XAI）的方法，旨在识别疾病相关基因并解释疾病进展的机制。解决方案的关键在于将神经网络模型与SHAP（SHapley Additive exPlanations）方法结合，并通过基因集富集分析（GSEA）比较差异基因表达分析（DGE）和SHAP方法，从而提供更全面的疾病相关基因和通路的识别。研究结果表明，DGE和SHAP方法在识别疾病相关基因和通路上既有重叠也有差异，进一步验证了XAI方法在疾病研究中的广泛适用性。

链接: https://arxiv.org/abs/2501.03923
作者: Mohammad Usman,Olga Varea,Petia Radeva,Josep Canals,Jordi Abante,Daniel Ortiz
机构: Dept. of Mathematics & Computer Science, Universitat de Barcelona (巴塞罗那大学数学与计算机科学系); Dept. of Biomedical Sciences, Faculty of Medicine and Health Sciences, Institute of Neuroscience, University of Barcelona (巴塞罗那大学生物医学科学系, 医学与健康科学学院, 神经科学研究所); Creatio, Production and Validation Center of advanced therapies, Universitat de Barcelona (巴塞罗那大学先进疗法生产与验证中心); Institut d’Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS) (奥古斯特·皮·苏涅尔生物医学研究所)
类目: Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neurodegenerative diseases (NDDs) are complex and lack effective treatment due to their poorly understood mechanism. The increasingly used data analysis from Single nucleus RNA Sequencing (snRNA-seq) allows to explore transcriptomic events at a single cell level, yet face challenges in interpreting the mechanisms underlying a disease. On the other hand, Neural Network (NN) models can handle complex data to offer insights but can be seen as black boxes with poor interpretability. In this context, explainable AI (XAI) emerges as a solution that could help to understand disease-associated mechanisms when combined with efficient NN models. However, limited research explores XAI in single-cell data. In this work, we implement a method for identifying disease-related genes and the mechanistic explanation of disease progression based on NN model combined with SHAP. We analyze available Huntington’s disease (HD) data to identify both HD-altered genes and mechanisms by adding Gene Set Enrichment Analysis (GSEA) comparing two methods, differential gene expression analysis (DGE) and NN combined with SHAP approach. Our results show that DGE and SHAP approaches offer both common and differential sets of altered genes and pathways, reinforcing the usefulness of XAI methods for a broader perspective of disease.
zh

[CV-69] SELMA3D challenge: Self-supervised learning for 3D light-sheet microscopy image segmentation

【速读】：该论文试图解决在光片显微镜（light sheet microscopy）和三维成像技术中，深度学习模型在分割（segmentation）任务中对领域转移（domain shift）高度敏感的问题。具体来说，当模型应用于训练分布之外的数据时，其准确性显著下降。为了解决这一问题，论文提出了通过自监督学习（self-supervised learning）来训练更具泛化能力的分割模型。解决方案的关键在于利用SELMA3D挑战赛提供的大规模光片显微镜图像数据集，该数据集包含了来自小鼠和人类大脑的35个大型三维图像和315个标注的小块图像，用于微调、初步测试和最终测试。通过自监督学习，参与挑战的团队能够显著提升分割模型的性能和泛化能力。

链接: https://arxiv.org/abs/2501.03880
作者: Ying Chen,Rami Al-Maskari,Izabela Horvath,Mayar Ali,Luciano Höher,Kaiyuan Yang,Zengming Lin,Zhiwei Zhai,Mengzhe Shen,Dejin Xun,Yi Wang,Tony Xu,Maged Goubran,Yunheng Wu,Ali Erturk,Johannes C. Paetzold
机构: Institute for Tissue Engineering and Regenerative Medicine, Helmholtz Center Munich, German Research Center for Environmental Health, Neuherberg, Germany; TUM School of Computation, Information and Technology (CIT), Technical University of Munich, Munich, Germany; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland; Shanghai University of Finance and Economics, Shanghai, China; BGI Research, Shenzhen, China; National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, China; Sunnybrook Research Institute, University of Toronto, Toronto, Canada; Graduate School of Informatics, Nagoya University, Nagoya, Japan; Department of Computing, Imperial College London, United Kingdom
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 1st version

点击查看摘要

Abstract:Recent innovations in light sheet microscopy, paired with developments in tissue clearing techniques, enable the 3D imaging of large mammalian tissues with cellular resolution. Combined with the progress in large-scale data analysis, driven by deep learning, these innovations empower researchers to rapidly investigate the morphological and functional properties of diverse biological samples. Segmentation, a crucial preliminary step in the analysis process, can be automated using domain-specific deep learning models with expert-level performance. However, these models exhibit high sensitivity to domain shifts, leading to a significant drop in accuracy when applied to data outside their training distribution. To address this limitation, and inspired by the recent success of self-supervised learning in training generalizable models, we organized the SELMA3D Challenge during the MICCAI 2024 conference. SELMA3D provides a vast collection of light-sheet images from cleared mice and human brains, comprising 35 large 3D images-each with over 1000^3 voxels-and 315 annotated small patches for finetuning, preliminary testing and final testing. The dataset encompasses diverse biological structures, including vessel-like and spot-like structures. Five teams participated in all phases of the challenge, and their proposed methods are reviewed in this paper. Quantitative and qualitative results from most participating teams demonstrate that self-supervised learning on large datasets improves segmentation model performance and generalization. We will continue to support and extend SELMA3D as an inaugural MICCAI challenge focused on self-supervised learning for 3D microscopy image segmentation.
zh

[CV-70] Semise: Semi-supervised learning for severity representation in medical image

【速读】：该论文试图解决医学影像分析中数据稀缺（data scarcity）的问题，特别是在标注数据有限的情况下如何提升特征提取能力。解决方案的关键在于提出了一种名为SEMISE的新方法，该方法结合了自监督学习（self-supervised learning）和监督学习（supervised learning），通过利用标注数据和增强数据（augmented data）来增强编码器（encoder）提取有意义特征的能力。这种集成方法生成了更具信息量的表示（informative representations），从而在下游任务（downstream tasks）中显著提升了性能，具体表现为分类任务提升了12%，分割任务提升了3%。这些结果表明，SEMISE在医学影像分析中具有显著潜力，能够为医疗应用提供更准确的解决方案。

链接: https://arxiv.org/abs/2501.03848
作者: Dung T. Tran,Hung Vu,Anh Tran,Hieu Pham,Hong Nguyen,Phong Nguyen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:This paper introduces SEMISE, a novel method for representation learning in medical imaging that combines self-supervised and supervised learning. By leveraging both labeled and augmented data, SEMISE addresses the challenge of data scarcity and enhances the encoder’s ability to extract meaningful features. This integrated approach leads to more informative representations, improving performance on downstream tasks. As result, our approach achieved a 12% improvement in classification and a 3% improvement in segmentation, outperforming existing methods. These results demonstrate the potential of SIMESE to advance medical image analysis and offer more accurate solutions for healthcare applications, particularly in contexts where labeled data is limited.
zh

[CV-71] MedFocusCLIP : Improving few shot classification in medical datasets using pixel wise attention

【速读】：该论文试图解决在细粒度视觉分类任务（如医学图像分类）中，现有视觉提示调优（Visual Prompt Tuning）方法表现不足的问题。由于这些任务中存在较大的类间差异和较小的类内差异，传统的提示调优方法难以有效引导模型关注图像中的关键区域。为此，论文提出了一种新的解决方案，即利用 Segment Anything Model 2 (SAM2) 的高级分割能力作为视觉提示线索，帮助 CLIP（Contrastive Language-Image Pretraining）视觉编码器聚焦于图像中的相关区域。通过这种方式，模型能够在少样本（few-shot）和细粒度分类设置中，避免被视觉上相似的背景特征干扰，从而专注于高度区分的区域。实验结果表明，该方法在多个医学数据集（如 X 光、CT 扫描和 MRI 图像）上显著提升了分类准确率，并提供了可解释的分类性能解释。

链接: https://arxiv.org/abs/2501.03839
作者: Aadya Arora,Vinay Namboodiri
机构: Indian Institute Of Technology Gandhinagar(印度理工学院甘地纳加尔); University Of Bath(巴斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.
zh

[CV-72] SCC-YOLO: An Improved Object Detector for Assisting in Brain Tumor Diagnosis

【速读】：该论文旨在解决脑肿瘤检测中的准确性问题，特别是在医学影像中如何更有效地识别脑肿瘤。解决方案的关键在于提出了一种新的SCC-YOLO架构，该架构通过将SCConv（Spatial and Channel Convolution）注意力机制集成到YOLOv9模型中，以减少特征中的空间和通道冗余，从而增强图像特征的学习能力。实验结果表明，SCC-YOLO在Br35H数据集和自制的Brain_Tumor_Dataset上分别比YOLOv9提高了0.3%和0.5%的mAp50（mean Average Precision at 50% IoU），达到了脑肿瘤检测的最新水平。

链接: https://arxiv.org/abs/2501.03836
作者: Runci Bai
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain tumors can result in neurological dysfunction, alterations in cognitive and psychological states, increased intracranial pressure, and the occurrence of seizures, thereby presenting a substantial risk to human life and health. The You Only Look Once(YOLO) series models have demonstrated superior accuracy in object detection for medical imaging. In this paper, we develop a novel SCC-YOLO architecture by integrating the SCConv attention mechanism into YOLOv9. The SCConv module reconstructs an efficient convolutional module by reducing spatial and channel redundancy among features, thereby enhancing the learning of image features. We investigate the impact of intergrating different attention mechanisms with the YOLOv9 model on brain tumor image detection using both the Br35H dataset and our self-made dataset(Brain_Tumor_Dataset). Experimental results show that on the Br35H dataset, SCC-YOLO achieved a 0.3% improvement in mAp50 compared to YOLOv9, while on our self-made dataset, SCC-YOLO exhibited a 0.5% improvement over YOLOv9. SCC-YOLO has reached state-of-the-art performance in brain tumor detection. Source code is available at : this https URL
zh

[CV-73] Deep Sylvester Posterior Inference for Adaptive Compressed Sensing in Ultrasound Imaging

【速读】：该论文旨在解决超声成像中通过减少扫描线数量来提高帧率、视野、能量效率和数据传输速度的问题。现有的方法通常采用静态子采样方案，结合基于稀疏性或深度学习的恢复技术。本文提出了一种自适应子采样方法，通过使用Sylvester Normalizing Flow编码器实时推断部分观测下的近似贝叶斯后验分布，从而最大化内在信息增益。利用贝叶斯后验和用于未来观测的深度生成模型，确定能够最大化子采样观测与下一帧视频之间互信息的子采样方案。该方法在EchoNet心脏超声视频数据集上进行了评估，结果表明其性能优于均匀和可变密度随机采样以及等距扫描线，平均绝对重建误差降低了15%。此外，后验推断和采样方案生成仅需0.015秒（66Hz），足以满足实时2D超声成像应用的需求。

链接: https://arxiv.org/abs/2501.03825
作者: Simon W. Penninga,Hans van Gorp,Ruud J.G. van Sloun
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultrasound images are commonly formed by sequential acquisition of beam-steered scan-lines. Minimizing the number of required scan-lines can significantly enhance frame rate, field of view, energy efficiency, and data transfer speeds. Existing approaches typically use static subsampling schemes in combination with sparsity-based or, more recently, deep-learning-based recovery. In this work, we introduce an adaptive subsampling method that maximizes intrinsic information gain in-situ, employing a Sylvester Normalizing Flow encoder to infer an approximate Bayesian posterior under partial observation in real-time. Using the Bayesian posterior and a deep generative model for future observations, we determine the subsampling scheme that maximizes the mutual information between the subsampled observations, and the next frame of the video. We evaluate our approach using the EchoNet cardiac ultrasound video dataset and demonstrate that our active sampling method outperforms competitive baselines, including uniform and variable-density random sampling, as well as equidistantly spaced scan-lines, improving mean absolute reconstruction error by 15%. Moreover, posterior inference and the sampling scheme generation are performed in just 0.015 seconds (66Hz), making it fast enough for real-time 2D ultrasound imaging applications.
zh

[CV-74] Re-Visible Dual-Domain Self-Supervised Deep Unfolding Network for MRI Reconstruction

【速读】：该论文试图解决磁共振成像（MRI）在临床应用中因采集时间过长而受限的问题。尽管深度学习（deep learning）方法已被提出用于加速数据采集并展示了良好的性能，但这些方法依赖于高质量的全采样数据集进行监督训练，而这些数据集的收集既耗时又昂贵，限制了其广泛应用。另一方面，自监督（self-supervised）方法通过仅使用欠采样数据进行学习提供了另一种选择，但现有方法大多依赖于进一步分割的欠采样k空间数据作为模型输入，导致有价值信息的丢失，且未充分结合图像先验（image priors），导致重建性能下降。

论文提出的解决方案关键在于：1）引入了一种新颖的“可重见双域自监督深度展开网络”（re-visible dual-domain self-supervised deep unfolding network），通过结合可重见双域损失（re-visible dual-domain loss），在训练过程中充分利用所有欠采样k空间数据，以减少因进一步分割造成的信息丢失；2）设计了一种基于Chambolle和Pock近端点算法（Chambolle and Pock Proximal Point Algorithm, DUN-CP-PPA）的深度展开网络，实现端到端重建，并结合成像物理和图像先验指导重建过程；3）通过空间-频率特征提取（Spatial-Frequency Feature Extraction, SFFE）模块捕获全局和局部特征表示，增强模型学习全面图像先验的能力。实验结果表明，该方法在重建性能上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2501.03737
作者: Hao Zhang,Qi Wang,Jian Sun,Zhijie Wen,Jun Shi,Shihui Ying
机构: Department of Mathematics, School of Science, Shanghai University (上海大学数学系, 理学院); Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University (上海大学应用数学与力学研究所, 力学与工程科学学院); School of Mathematics and Statistics, Xi’an Jiaotong University (西安交通大学数学与统计学院); School of Communication and Information Engineering, Shanghai University (上海大学通信与信息工程学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) is widely used in clinical practice, but suffered from prolonged acquisition time. Although deep learning methods have been proposed to accelerate acquisition and demonstrate promising performance, they rely on high-quality fully-sampled datasets for training in a supervised manner. However, such datasets are time-consuming and expensive-to-collect, which constrains their broader applications. On the other hand, self-supervised methods offer an alternative by enabling learning from under-sampled data alone, but most existing methods rely on further partitioned under-sampled k-space data as model’s input for training, resulting in a loss of valuable information. Additionally, their models have not fully incorporated image priors, leading to degraded reconstruction performance. In this paper, we propose a novel re-visible dual-domain self-supervised deep unfolding network to address these issues when only under-sampled datasets are available. Specifically, by incorporating re-visible dual-domain loss, all under-sampled k-space data are utilized during training to mitigate information loss caused by further partitioning. This design enables the model to implicitly adapt to all under-sampled k-space data as input. Additionally, we design a deep unfolding network based on Chambolle and Pock Proximal Point Algorithm (DUN-CP-PPA) to achieve end-to-end reconstruction, incorporating imaging physics and image priors to guide the reconstruction process. By employing a Spatial-Frequency Feature Extraction (SFFE) block to capture global and local feature representation, we enhance the model’s efficiency to learn comprehensive image priors. Experiments conducted on the fastMRI and IXI datasets demonstrate that our method significantly outperforms state-of-the-art approaches in terms of reconstruction performance.
zh

[CV-75] A Value Mapping Virtual Staining Framework for Large-scale Histological Imaging

【速读】：该论文旨在解决虚拟染色技术在大规模图像处理中的两个主要问题：边界不一致性和伪影（artifacts）的产生，以及不同染色模态之间转换时损失函数和超参数调整的复杂性。为了解决这些问题，作者提出了一种通用的虚拟染色框架，称为基于值映射约束的生成对抗网络（Value Mapping Generative Adversarial Network, VM-GAN）。该框架通过引入基于值映射约束的损失函数，确保不同病理模态之间虚拟染色的准确性。此外，作者还提出了一种基于置信度的分块方法（confidence-based tiling method），以应对分块处理导致的边界不一致性问题。实验结果表明，该方法在不同染色协议的数据集上表现出优越的定量指标和视觉感知效果。

链接: https://arxiv.org/abs/2501.03592
作者: Junjia Wang,Bo Xiong,You Zhou,Xun Cao,Zhan Ma
机构: School of Electronic Science and Engineering, Nanjing University (南京大学电子科学与工程学院); National Engineering Research Center of Visual Technology, Peking University (北京大学视觉技术国家工程研究中心); School of Computer Science, Peking University (北京大学计算机学院); Medical School, Nanjing University (南京大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:The emergence of virtual staining technology provides a rapid and efficient alternative for researchers in tissue pathology. It enables the utilization of unlabeled microscopic samples to generate virtual replicas of chemically stained histological slices, or facilitate the transformation of one staining type into another. The remarkable performance of generative networks, such as CycleGAN, offers an unsupervised learning approach for virtual coloring, overcoming the limitations of high-quality paired data required in supervised learning. Nevertheless, large-scale color transformation necessitates processing large field-of-view images in patches, often resulting in significant boundary inconsistency and artifacts. Additionally, the transformation between different colorized modalities typically needs further efforts to modify loss functions and tune hyperparameters for independent training of networks. In this study, we introduce a general virtual staining framework that is adaptable to various conditions. We propose a loss function based on the value mapping constraint to ensure the accuracy of virtual coloring between different pathological modalities, termed the Value Mapping Generative Adversarial Network (VM-GAN). Meanwhile, we present a confidence-based tiling method to address the challenge of boundary inconsistency arising from patch-wise processing. Experimental results on diverse data with varying staining protocols demonstrate that our method achieves superior quantitative indicators and improved visual perception.
zh

[CV-76] Enhanced Tuberculosis Bacilli Detection using Attention-Residual U-Net and Ensemble Classification

【速读】：该论文旨在解决结核病（Tuberculosis, TB）诊断中从明场显微镜痰涂片图像中检测结核杆菌的自动化程度低、分割性能不足以及分类准确率有限的问题。解决方案的关键在于提出了一种高效的混合方法，结合了深度学习用于图像分割和集成模型用于分类。具体而言，论文引入了一种增强的U-Net模型，该模型结合了注意力机制（attention blocks）和残差连接（residual connections），以精确分割显微镜痰涂片图像，从而有效提取感兴趣区域（Regions of Interest, ROIs）。随后，这些ROIs通过一个包含支持向量机（Support Vector Machine, SVM）、随机森林（Random Forest）和极端梯度提升（Extreme Gradient Boost, XGBoost）的集成分类器进行分类，实现了对图像中结核杆菌的准确识别。实验结果表明，该方法在分割性能、分类准确率和自动化程度方面均优于现有方法。

链接: https://arxiv.org/abs/2501.03539
作者: Greeshma K,Vishnukumar S
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tuberculosis (TB), caused by Mycobacterium tuberculosis, remains a critical global health issue, necessitating timely diagnosis and treatment. Current methods for detecting tuberculosis bacilli from bright field microscopic sputum smear images suffer from low automation, inadequate segmentation performance, and limited classification accuracy. This paper proposes an efficient hybrid approach that combines deep learning for segmentation and an ensemble model for classification. An enhanced U-Net model incorporating attention blocks and residual connections is introduced to precisely segment microscopic sputum smear images, facilitating the extraction of Regions of Interest (ROIs). These ROIs are subsequently classified using an ensemble classifier comprising Support Vector Machine (SVM), Random Forest, and Extreme Gradient Boost (XGBoost), resulting in an accurate identification of bacilli within the images. Experiments conducted on a newly created dataset, along with public datasets, demonstrate that the proposed model achieves superior segmentation performance, higher classification accuracy, and enhanced automation compared to existing methods.
zh

[CV-77] Efficient and Accurate Tuberculosis Diagnosis: Attention Residual U-Net and Vision Transformer Based Detection Framework

【速读】：该论文试图解决结核病（Tuberculosis, TB）诊断中显微镜检测自动化程度低、分割质量不一致以及分类精度受限的问题。结核病是由结核分枝杆菌（Mycobacterium tuberculosis）引起的传染病，尽管可预防和治愈，但在低收入和中等收入国家仍然是一个重大的全球健康威胁。显微镜检测通过直接观察痰涂片样本中的结核分枝杆菌，提供了一种经济有效的早期检测和治疗方法。然而，显微镜检测过程劳动密集，自动化程度有限，影响了诊断的效率和可靠性。

论文提出的解决方案包括两个关键步骤：首先，采用一种改进的U-Net模型，结合注意力机制（attention blocks）和残差连接（residual connections），对显微镜下的痰涂片图像进行分割，提取感兴趣区域（Regions of Interest, ROIs）。其次，使用一种定制的视觉Transformer（Vision Transformer），即TBViT，对提取的ROIs进行分类，以提高结核分枝杆菌的精确检测。实验结果表明，该方法在分割性能、分类精度和自动化水平上均显著优于现有方法。

链接: https://arxiv.org/abs/2501.03538
作者: Greeshma K,Vishnukumar S
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tuberculosis (TB), an infectious disease caused by Mycobacterium tuberculosis, continues to be a major global health threat despite being preventable and curable. This burden is particularly high in low and middle income countries. Microscopy remains essential for diagnosing TB by enabling direct visualization of Mycobacterium tuberculosis in sputum smear samples, offering a cost effective approach for early detection and effective treatment. Given the labour-intensive nature of microscopy, automating the detection of bacilli in microscopic images is crucial to improve both the expediency and reliability of TB diagnosis. The current methodologies for detecting tuberculosis bacilli in bright field microscopic sputum smear images are hindered by limited automation capabilities, inconsistent segmentation quality, and constrained classification precision. This paper proposes a twostage deep learning methodology for tuberculosis bacilli detection, comprising bacilli segmentation followed by classification. In the initial phase, an advanced U-Net model employing attention blocks and residual connections is proposed to segment microscopic sputum smear images, enabling the extraction of Regions of Interest (ROIs). The extracted ROIs are then classified using a Vision Transformer, which we specifically customized as TBViT to enhance the precise detection of bacilli within the images. For the experiments, a newly developed dataset of microscopic sputum smear images derived from Ziehl-Neelsen-stained slides is used in conjunction with existing public datasets. The qualitative and quantitative evaluation of the experiments using various metrics demonstrates that the proposed model achieves significantly improved segmentation performance, higher classification accuracy, and a greater level of automation, surpassing existing methods.
zh

[CV-78] FgC2F-UDiff: Frequency-guided and Coarse-to-fine Unified Diffusion Model for Multi-modality Missing MRI Synthesis

【速读】：该论文试图解决多模态磁共振成像（MRI）在脑肿瘤诊断和治疗中因扫描时间限制、扫描损坏、伪影、运动及对比剂不耐受等原因导致的模态缺失问题。为了解决这一问题，论文提出了一种新颖的统一合成模型，即频率引导和从粗到细的统一扩散模型（FgC2F-UDiff）。该模型的关键解决方案包括三个方面：首先，通过从粗到细的统一网络（CUN），将去噪过程分为粗和细两个阶段，充分利用扩散模型的迭代去噪特性，从全局到细节逐步提升合成图像的保真度；其次，采用频率引导协作策略（FCS），利用适当的频率信息作为先验知识，指导高度非线性映射的学习；最后，通过特定加速混合机制（SHM）整合特定机制，加速扩散模型并增强多对多合成的可行性。实验结果表明，FgC2F-UDiff模型在两个数据集上表现出优越的性能，并通过包括PSNR、SSIM、LPIPS和FID在内的综合评估验证了其有效性。

链接: https://arxiv.org/abs/2501.03526
作者: Xiaojiao Xiao,Qinmin Vivian Hu,Guanghui Wang
机构: IEEE Publication Technology Group (IEEE出版技术组)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modality magnetic resonance imaging (MRI) is essential for the diagnosis and treatment of brain tumors. However, missing modalities are commonly observed due to limitations in scan time, scan corruption, artifacts, motion, and contrast agent intolerance. Synthesis of missing MRI has been a means to address the limitations of modality insufficiency in clinical practice and research. However, there are still some challenges, such as poor generalization, inaccurate non-linear mapping, and slow processing speeds. To address the aforementioned issues, we propose a novel unified synthesis model, the Frequency-guided and Coarse-to-fine Unified Diffusion Model (FgC2F-UDiff), designed for multiple inputs and outputs. Specifically, the Coarse-to-fine Unified Network (CUN) fully exploits the iterative denoising properties of diffusion models, from global to detail, by dividing the denoising process into two stages, coarse and fine, to enhance the fidelity of synthesized images. Secondly, the Frequency-guided Collaborative Strategy (FCS) harnesses appropriate frequency information as prior knowledge to guide the learning of a unified, highly non-linear mapping. Thirdly, the Specific-acceleration Hybrid Mechanism (SHM) integrates specific mechanisms to accelerate the diffusion model and enhance the feasibility of many-to-many synthesis. Extensive experimental evaluations have demonstrated that our proposed FgC2F-UDiff model achieves superior performance on two datasets, validated through a comprehensive assessment that includes both qualitative observations and quantitative metrics, such as PSNR SSIM, LPIPS, and FID.
zh

[CV-79] Salient Region Matching for Fully Automated MR-TRUS Registration

【速读】：该论文旨在解决前列腺癌（prostate cancer）诊断中磁共振成像（MR）和经直肠超声（TRUS）图像的自动配准问题，以提高靶向活检的准确性。解决方案的关键在于提出了一种显著区域匹配框架，该框架包括三个主要步骤：前列腺分割、刚性对齐和可变形配准。首先，通过分别在MR和TRUS图像上使用两个分割网络进行前列腺分割，并利用预测的显著区域进行刚性对齐。随后，刚性对齐后的MR和TRUS图像作为可变形配准的初始化。可变形配准网络采用双流编码器结构，并引入了跨模态空间注意力模块，以促进多模态特征学习。此外，网络还采用了显著区域匹配损失函数，综合考虑了前列腺区域内的结构和强度相似性。实验结果表明，该方法在公开的MR-TRUS数据集上取得了优于多种前沿方法的配准效果。

链接: https://arxiv.org/abs/2501.03510
作者: Zetian Feng,Dong Ni,Yi Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prostate cancer is a leading cause of cancer-related mortality in men. The registration of magnetic resonance (MR) and transrectal ultrasound (TRUS) can provide guidance for the targeted biopsy of prostate cancer. In this study, we propose a salient region matching framework for fully automated MR-TRUS registration. The framework consists of prostate segmentation, rigid alignment and deformable registration. Prostate segmentation is performed using two segmentation networks on MR and TRUS respectively, and the predicted salient regions are used for the rigid alignment. The rigidly-aligned MR and TRUS images serve as initialization for the deformable registration. The deformable registration network has a dual-stream encoder with cross-modal spatial attention modules to facilitate multi-modality feature learning, and a salient region matching loss to consider both structure and intensity similarity within the prostate region. Experiments on a public MR-TRUS dataset demonstrate that our method achieves satisfactory registration results, outperforming several cutting-edge methods. The code is publicly available at this https URL.
zh

[CV-80] DGSSA: Domain generalization with structural and stylistic augmentation for retinal vessel segmentation

【速读】：该论文试图解决视网膜血管图像分割中由于成像设备和患者人口统计学差异导致的域偏移问题，这些问题使得传统方法在未见过的数据域上表现不佳。解决方案的关键在于提出了一种名为DGSSA的新方法，通过结合结构和风格增强策略来提升模型的泛化能力。具体而言，该方法利用空间殖民算法生成多样化的血管样结构，这些结构通过改进的Pix2Pix模型生成伪视网膜图像，从而使分割模型能够学习到更广泛的结构分布。此外，通过PixMix实现随机光度增强并引入不确定性扰动，进一步丰富了风格多样性，显著增强了模型对不同成像条件的适应能力。该方法在DRIVE、CHASEDB、HRF和STARE四个具有挑战性的数据集上进行了严格评估，展示了超越现有方法的最先进性能，验证了其在自动化视网膜血管分析中的临床应用潜力。

链接: https://arxiv.org/abs/2501.03466
作者: Bo Liu,Yudong Zhang,Shuihua Wang,Siyue Li,Jin Hong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal vascular morphology is crucial for diagnosing diseases such as diabetes, glaucoma, and hypertension, making accurate segmentation of retinal vessels essential for early intervention. Traditional segmentation methods assume that training and testing data share similar distributions, which can lead to poor performance on unseen domains due to domain shifts caused by variations in imaging devices and patient demographics. This paper presents a novel approach, DGSSA, for retinal vessel image segmentation that enhances model generalization by combining structural and style augmentation strategies. We utilize a space colonization algorithm to generate diverse vascular-like structures that closely mimic actual retinal vessels, which are then used to generate pseudo-retinal images with an improved Pix2Pix model, allowing the segmentation model to learn a broader range of structure distributions. Additionally, we utilize PixMix to implement random photometric augmentations and introduce uncertainty perturbations, thereby enriching stylistic diversity and significantly enhancing the model’s adaptability to varying imaging conditions. Our framework has been rigorously evaluated on four challenging datasets-DRIVE, CHASEDB, HRF, and STARE-demonstrating state-of-the-art performance that surpasses existing methods. This validates the effectiveness of our proposed approach, highlighting its potential for clinical application in automated retinal vessel analysis.
zh

[CV-81] Activating Associative Disease-Aware Vision Token Memory for LLM -Based X-ray Report Generation

【速读】：该论文旨在解决基于X射线图像的医疗报告生成问题，特别是在现有大型语言模型未能充分利用视觉图像区域中的有效信息，导致生成的报告在语言上流畅但缺乏对关键疾病的准确描述。论文提出了一种新颖的关联记忆增强的X射线报告生成模型，该模型通过模拟专业医生撰写医疗报告的过程，综合考虑全局和局部视觉信息的挖掘，并结合历史报告信息来更好地完成当前报告的撰写。解决方案的关键在于：首先，利用分类模型及其激活图挖掘与疾病高度相关的视觉区域，并学习疾病查询标记；其次，通过视觉Hopfield网络建立疾病相关标记的记忆关联，并通过报告Hopfield网络检索报告记忆信息。这一过程基于大型语言模型生成高质量报告，并在多个基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2501.03458
作者: Xiao Wang,Fuling Wang,Haowen Wang,Bo Jiang,Chuanfu Li,Yaowei Wang,Yonghong Tian,Jin Tang
机构: Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei 230601, China; Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, Hefei 230601, China; School of Computer Science and Technology, Anhui University, Hefei 230601, China; First Affiliated Hospital of Anhui University of Chinese Medicine, Hefei 230022, China; Peng Cheng Laboratory, Shenzhen, China; Harbin Institute of Technology, Shenzhen, China; National Engineering Laboratory for Video Technology, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: In Peer Review

点击查看摘要

Abstract:X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \urlthis https URL.
zh

[CV-82] A Self-supervised Diffusion Bridge for MRI Reconstruction

【速读】：该论文试图解决传统扩散桥（Diffusion Bridges, DBs）在图像重建任务中依赖高质量参考图像的问题，这种依赖限制了其在缺乏高质量参考图像场景下的应用。为了解决这一问题，论文提出了一种名为SelfDB的自监督方法，该方法能够在没有高质量参考图像的情况下，直接利用可用的噪声测量数据进行训练。SelfDB的关键在于通过进一步对可用测量数据进行两次子采样，并训练一个神经网络来逆转相应的退化过程，从而利用这些测量数据作为训练目标。该方法在压缩感知MRI（Compressed Sensing MRI）上的验证表明，其性能优于去噪扩散模型。

链接: https://arxiv.org/abs/2501.03430
作者: Harry Gao,Weijie Gan,Yuyang Hu,Hongyu An,Ulugbek S. Kamilov
机构: Washington University in St. Louis, MO, USA (圣路易斯华盛顿大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion bridges (DBs) are a class of diffusion models that enable faster sampling by interpolating between two paired image distributions. Training traditional DBs for image reconstruction requires high-quality reference images, which limits their applicability to settings where such references are unavailable. We propose SelfDB as a novel self-supervised method for training DBs directly on available noisy measurements without any high-quality reference images. SelfDB formulates the diffusion process by further sub-sampling the available measurements two additional times and training a neural network to reverse the corresponding degradation process by using the available measurements as the training targets. We validate SelfDB on compressed sensing MRI, showing its superior performance compared to the denoising diffusion models.
zh

[CV-83] Quantum Feature-Empowered Deep Classification for Fast Mangrove Mapping

【速读】：该论文旨在解决红树林映射（Mangrove Mapping, MM）问题，特别是在环境监测中的分类任务。传统的基于指数的方法通常将像素视为空间独立的，而卷积神经网络（Convolutional Neural Networks, CNNs）虽然能够利用空间连续性信息提升分类性能，但仍存在改进空间。论文提出了一种新的解决方案，即通过引入量子特征（quantum features）来进一步增强CNN的分类能力。具体而言，CNN计算的是仿射映射特征（affine-mapping features），而量子神经网络（Quantum Neural Network, QNN）则提供了一种基于酉计算（unitary-computing）的特征，从而为最终决策（分类）提供了新的视角。论文设计了一个纠缠的空间-光谱量子特征提取模块，并确保量子特征提供的是真正新颖的信息（不受传统CNN特征影响），通过一个独立的量子神经元网络轨道来实现。最终，提取的纯量子信息与传统特征信息融合，共同做出最终决策。该方案的关键在于提出的量子增强深度网络（Quantum-Empowered Deep Network, QEDNet），其轻量级设计确保了性能提升源于CNN与QNN的协同作用，而非参数增加。

链接: https://arxiv.org/abs/2501.03360
作者: Chia-Hsiang Lin,Po-Wei Tang,Alfredo R. Huete
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This work has been accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS)

点击查看摘要

Abstract:A mangrove mapping (MM) algorithm is an essential classification tool for environmental monitoring. The recent literature shows that compared with other index-based MM methods that treat pixels as spatially independent, convolutional neural networks (CNNs) are crucial for leveraging spatial continuity information, leading to improved classification performance. In this work, we go a step further to show that quantum features provide radically new information for CNN to further upgrade the classification results. Simply speaking, CNN computes affine-mapping features, while quantum neural network (QNN) offers unitary-computing features, thereby offering a fresh perspective in the final decision-making (classification). To address the challenging MM problem, we design an entangled spatial-spectral quantum feature extraction module. Notably, to ensure that the quantum features contribute genuinely novel information (unaffected by traditional CNN features), we design a separate network track consisting solely of quantum neurons with built-in interpretability. The extracted pure quantum information is then fused with traditional feature information to jointly make the final decision. The proposed quantum-empowered deep network (QEDNet) is very lightweight, so the improvement does come from the cooperation between CNN and QNN (rather than parameter augmentation). Extensive experiments will be conducted to demonstrate the superiority of QEDNet.
zh

人工智能

[AI-0] Synthetic Data Privacy Metrics

链接: https://arxiv.org/abs/2501.03941
作者: Amy Steier,Lipika Ramaswamy,Andre Manoel,Alexa Haushalter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:Recent advancements in generative AI have made it possible to create synthetic datasets that can be as accurate as real-world data for training AI models, powering statistical insights, and fostering collaboration with sensitive datasets while offering strong privacy guarantees. Effectively measuring the empirical privacy of synthetic data is an important step in the process. However, while there is a multitude of new privacy metrics being published every day, there currently is no standardization. In this paper, we review the pros and cons of popular metrics that include simulations of adversarial attacks. We also review current best practices for amending generative models to enhance the privacy of the data they create (e.g. differential privacy).

[AI-1] Exploring the Potential of Large Language Models in Public Transportation: San Antonio Case Study AAAI2025

链接: https://arxiv.org/abs/2501.03904
作者: Ramya Jonnala,Gongbo Liang,Jeong Yang,Izzat Alsmadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: This work is accepted to AAAI 2025 Workshop on AI for Urban Planning. arXiv admin note: substantial text overlap with arXiv:2407.11003

点击查看摘要

Abstract:The integration of large language models (LLMs) into public transit systems presents a transformative opportunity to enhance urban mobility. This study explores the potential of LLMs to revolutionize public transportation management within the context of San Antonio’s transit system. Leveraging the capabilities of LLMs in natural language processing and data analysis, we investigate their capabilities to optimize route planning, reduce wait times, and provide personalized travel assistance. By utilizing the General Transit Feed Specification (GTFS) and other relevant data, this research aims to demonstrate how LLMs can potentially improve resource allocation, elevate passenger satisfaction, and inform data-driven decision-making in transit operations. A comparative analysis of different ChatGPT models was conducted to assess their ability to understand transportation information, retrieve relevant data, and provide comprehensive responses. Findings from this study suggest that while LLMs hold immense promise for public transit, careful engineering and fine-tuning are essential to realizing their full potential. San Antonio serves as a case study to inform the development of LLM-powered transit systems in other urban environments.

[AI-2] Explainable Reinforcement Learning via Temporal Policy Decomposition

链接: https://arxiv.org/abs/2501.03902
作者: Franco Ruggeri,Alessio Russo,Rafia Inam,Karl Henrik Johansson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:We investigate the explainability of Reinforcement Learning (RL) policies from a temporal perspective, focusing on the sequence of future outcomes associated with individual actions. In RL, value functions compress information about rewards collected across multiple trajectories and over an infinite horizon, allowing a compact form of knowledge representation. However, this compression obscures the temporal details inherent in sequential decision-making, presenting a key challenge for interpretability. We present Temporal Policy Decomposition (TPD), a novel explainability approach that explains individual RL actions in terms of their Expected Future Outcome (EFO). These explanations decompose generalized value functions into a sequence of EFOs, one for each time step up to a prediction horizon of interest, revealing insights into when specific outcomes are expected to occur. We leverage fixed-horizon temporal difference learning to devise an off-policy method for learning EFOs for both optimal and suboptimal actions, enabling contrastive explanations consisting of EFOs for different state-action pairs. Our experiments demonstrate that TPD generates accurate explanations that (i) clarify the policy’s future strategy and anticipated trajectory for a given action and (ii) improve understanding of the reward composition, facilitating fine-tuning of the reward function to align with human expectations.

[AI-3] Neural DNF-MT: A Neuro-symbolic Approach for Learning Interpretable and Editable Policies AAMAS2025

链接: https://arxiv.org/abs/2501.03888
作者: Kexin Gu Baugh,Luke Dickens,Alessandra Russo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: AAMAS 2025

点击查看摘要

Abstract:Although deep reinforcement learning has been shown to be effective, the model’s black-box nature presents barriers to direct policy interpretation. To address this problem, we propose a neuro-symbolic approach called neural DNF-MT for end-to-end policy learning. The differentiable nature of the neural DNF-MT model enables the use of deep actor-critic algorithms for training. At the same time, its architecture is designed so that trained models can be directly translated into interpretable policies expressed as standard (bivalent or probabilistic) logic programs. Moreover, additional layers can be included to extract abstract features from complex observations, acting as a form of predicate invention. The logic representations are highly interpretable, and we show how the bivalent representations of deterministic policies can be edited and incorporated back into a neural model, facilitating manual intervention and adaptation of learned policies. We evaluate our approach on a range of tasks requiring learning deterministic or stochastic behaviours from various forms of observations. Our empirical results show that our neural DNF-MT model performs at the level of competing black-box methods whilst providing interpretable policies.

[AI-4] hree-dimensional attention Transformer for state evaluation in real-time strategy games

链接: https://arxiv.org/abs/2501.03832
作者: Yanqing Ye,Weilong Yang,Kai Qiu,Jie Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Situation assessment in Real-Time Strategy (RTS) games is crucial for understanding decision-making in complex adversarial environments. However, existing methods remain limited in processing multi-dimensional feature information and temporal dependencies. Here we propose a tri-dimensional Space-Time-Feature Transformer (TSTF Transformer) architecture, which efficiently models battlefield situations through three independent but cascaded modules: spatial attention, temporal attention, and feature attention. On a dataset comprising 3,150 adversarial experiments, the 8-layer TSTF Transformer demonstrates superior performance: achieving 58.7% accuracy in the early game (~4% progress), significantly outperforming the conventional Timesformer’s 41.8%; reaching 97.6% accuracy in the mid-game (~40% progress) while maintaining low performance variation (standard deviation 0.114). Meanwhile, this architecture requires fewer parameters (4.75M) compared to the baseline model (5.54M). Our study not only provides new insights into situation assessment in RTS games but also presents an innovative paradigm for Transformer-based multi-dimensional temporal modeling.

[AI-5] Online Reinforcement Learning-Based Dynamic Adaptive Evaluation Function for Real-Time Strategy Tasks

链接: https://arxiv.org/abs/2501.03824
作者: Weilong Yang,Jie Zhang,Xunyun Liu,Yanqing Ye
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Effective evaluation of real-time strategy tasks requires adaptive mechanisms to cope with dynamic and unpredictable environments. This study proposes a method to improve evaluation functions for real-time responsiveness to battle-field situation changes, utilizing an online reinforcement learning-based dynam-ic weight adjustment mechanism within the real-time strategy game. Building on traditional static evaluation functions, the method employs gradient descent in online reinforcement learning to update weights dynamically, incorporating weight decay techniques to ensure stability. Additionally, the AdamW optimizer is integrated to adjust the learning rate and decay rate of online reinforcement learning in real time, further reducing the dependency on manual parameter tun-ing. Round-robin competition experiments demonstrate that this method signifi-cantly enhances the application effectiveness of the Lanchester combat model evaluation function, Simple evaluation function, and Simple Sqrt evaluation function in planning algorithms including IDABCD, IDRTMinimax, and Port-folio AI. The method achieves a notable improvement in scores, with the en-hancement becoming more pronounced as the map size increases. Furthermore, the increase in evaluation function computation time induced by this method is kept below 6% for all evaluation functions and planning algorithms. The pro-posed dynamic adaptive evaluation function demonstrates a promising approach for real-time strategy task evaluation.

[AI-6] Self-Adaptive ERP: Embedding NLP into Petri-Net creation and Model Matching

链接: https://arxiv.org/abs/2501.03795
作者: Ahmed Maged,Gamal Kassem
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Enterprise Resource Planning (ERP) consultants play a vital role in customizing systems to meet specific business needs by processing large amounts of data and adapting functionalities. However, the process is resource-intensive, time-consuming, and requires continuous adjustments as business demands evolve. This research introduces a Self-Adaptive ERP Framework that automates customization using enterprise process models and system usage analysis. It leverages Artificial Intelligence (AI) Natural Language Processing (NLP) for Petri nets to transform business processes into adaptable models, addressing both structural and functional matching. The framework, built using Design Science Research (DSR) and a Systematic Literature Review (SLR), reduces reliance on manual adjustments, improving ERP customization efficiency and accuracy while minimizing the need for consultants.

[AI-7] Neural Deconstruction Search for Vehicle Routing Problems

链接: https://arxiv.org/abs/2501.03715
作者: André Hottung,Paula Wong-Chung,Kevin Tierney
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive construction approaches generate solutions to vehicle routing problems in a step-by-step fashion, leading to high-quality solutions that are nearing the performance achieved by handcrafted, operations research techniques. In this work, we challenge the conventional paradigm of sequential solution construction and introduce an iterative search framework where solutions are instead deconstructed by a neural policy. Throughout the search, the neural policy collaborates with a simple greedy insertion algorithm to rebuild the deconstructed solutions. Our approach surpasses the performance of state-of-the-art operations research methods across three challenging vehicle routing problems of various problem sizes.

[AI-8] Exploring Molecule Generation Using Latent Space Graph Diffusion

链接: https://arxiv.org/abs/2501.03696
作者: Prashanth Pombala,Gerrit Grossmann,Verena Wolf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating molecular graphs is a challenging task due to their discrete nature and the competitive objectives involved. Diffusion models have emerged as SOTA approaches in data generation across various modalities. For molecular graphs, graph neural networks (GNNs) as a diffusion backbone have achieved impressive results. Latent space diffusion, where diffusion occurs in a low-dimensional space via an autoencoder, has demonstrated computational efficiency. However, the literature on latent space diffusion for molecular graphs is scarce, and no commonly accepted best practices exist. In this work, we explore different approaches and hyperparameters, contrasting generative flow models (denoising diffusion, flow matching, heat dissipation) and architectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high sensitivity to the choice of approach and design decisions. Code is made available at this http URL.

[AI-9] MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

链接: https://arxiv.org/abs/2501.03689
作者: Haojie Wei,Jun Yuan,Rui Zhang,Quanyu Dai,Yueguo Chen
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.

[AI-10] SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks

链接: https://arxiv.org/abs/2501.03676
作者: Zheng Chun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures, 4 tables

点击查看摘要

Abstract:In this work, we build upon the offline reinforcement learning algorithm TD7, which incorporates State-Action Learned Embeddings (SALE) and LAP, and propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. The ensemble Q-networks effectively address the challenge of out-of-distribution actions by introducing penalties that guide the actor network to focus on in-distribution actions. Meanwhile, the gradient diversity penalty encourages diverse Q-value gradients, further suppressing overestimation for out-of-distribution actions. Additionally, our method retains an adjustable behavior cloning (BC) term that directs the actor network toward dataset actions during early training stages, while gradually reducing its influence as the precision of the Q-ensemble improves. These enhancements work synergistically to improve training stability and accuracy. Experimental results on the D4RL MuJoCo benchmarks demonstrate that our algorithm achieves superior convergence speed, stability, and performance compared to existing methods.

[AI-11] Effective and Efficient Mixed Precision Quantization of Speech Foundation Models ICASSP2025

链接: https://arxiv.org/abs/2501.03643
作者: Haoning Xu,Zhaoqing Li,Zengrui Jin,Huimeng Wang,Youjun Chen,Guinan Li,Mengzhe Geng,Shujie Hu,Jiajun Deng,Xunying Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: To appear at IEEE ICASSP 2025

点击查看摘要

Abstract:This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the lossless compression ratio by factors up to 1.7x and 1.9x over the respective uniform-precision and two-stage mixed-precision quantized baselines that perform precision learning and model parameters quantization in separate and disjointed stages, while incurring no statistically word error rate (WER) increase over the 32-bit full-precision models. The system compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both produce lower WERs. The best-performing 3.5-bit mixed-precision quantized HuBERT-large model produces a lossless compression ratio of 8.6x over the 32-bit full-precision system.

[AI-12] MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction

链接: https://arxiv.org/abs/2501.03635
作者: Mei Wu,Yiqian Lin,Tianfan Jiang,Wenchao Weng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by 2025 lEEE International Conference on Acoustics, speech, and signal Processing (lCASSP2025)

点击查看摘要

Abstract:In recent years, traffic flow prediction has played a crucial role in the management of intelligent transportation systems. However, traditional forecasting methods often model non-Euclidean low-dimensional traffic data as a simple graph with single-type nodes and edges, failing to capture similar trends among nodes of the same type. To address this limitation, this paper proposes MHGNet, a novel framework for modeling spatiotemporal multi-heterogeneous graphs. Within this framework, the STD Module decouples single-pattern traffic data into multi-pattern traffic data through feature mappings of timestamp embedding matrices and node embedding matrices. Subsequently, the Node Clusterer leverages the Euclidean distance between nodes and different types of limit points to perform clustering with O(N) time complexity. The nodes within each cluster undergo residual subgraph convolution within the spatiotemporal fusion subgraphs generated by the DSTGG Module, followed by processing in the SIE Module for node repositioning and redistribution of weights. To validate the effectiveness of MHGNet, this paper conducts extensive ablation studies and quantitative evaluations on four widely used benchmarks, demonstrating its superior performance.

[AI-13] RecKG: Knowledge Graph for Recommender Systems

链接: https://arxiv.org/abs/2501.03598
作者: Junhyuk Kwon,Seokho Ahn,Young-Duk Seo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by The 39th ACM/SIGAPP Symposium On Applied Computing(SAC) 2024

点击查看摘要

Abstract:Knowledge graphs have proven successful in integrating heterogeneous data across various domains. However, there remains a noticeable dearth of research on their seamless integration among heterogeneous recommender systems, despite knowledge graph-based recommender systems garnering extensive research attention. This study aims to fill this gap by proposing RecKG, a standardized knowledge graph for recommender systems. RecKG ensures the consistent representation of entities across different datasets, accommodating diverse attribute types for effective data integration. Through a meticulous examination of various recommender system datasets, we select attributes for RecKG, ensuring standardized formatting through consistent naming conventions. By these characteristics, RecKG can seamlessly integrate heterogeneous data sources, enabling the discovery of additional semantic information within the integrated knowledge graph. We apply RecKG to standardize real-world datasets, subsequently developing an application for RecKG using a graph database. Finally, we validate RecKG’s achievement in interoperability through a qualitative evaluation between RecKG and other studies.

[AI-14] STContext: A Multifaceted Dataset for Developing Context-aware Spatio-temporal Crowd Mobility Prediction Models

链接: https://arxiv.org/abs/2501.03583
作者: Liyue Chen,Jiangyi Fang,Tengfei Liu,Fangyuan Gao,Leye Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP) models leverage contextual features (e.g., weather) to identify unusual crowd mobility patterns and enhance prediction accuracy. However, the best practice for incorporating contextual features remains unclear due to inconsistent usage of contextual features in different papers. Developing a multifaceted dataset with rich types of contextual features and STCFP scenarios is crucial for establishing a principled context modeling paradigm. Existing open crowd flow datasets lack an adequate range of contextual features, which poses an urgent requirement to build a multifaceted dataset to fill these research gaps. To this end, we create STContext, a multifaceted dataset for developing context-aware STCFP models. Specifically, STContext provides nine spatio-temporal datasets across five STCFP scenarios and includes ten contextual features, including weather, air quality index, holidays, points of interest, road networks, etc. Besides, we propose a unified workflow for incorporating contextual features into deep STCFP methods, with steps including feature transformation, dependency modeling, representation fusion, and training strategies. Through extensive experiments, we have obtained several useful guidelines for effective context modeling and insights for future research. The STContext is open-sourced at this https URL.

[AI-15] Applying Large Language Models in Knowledge Graph-based Enterprise Modeling: Challenges and Opportunities

链接: https://arxiv.org/abs/2501.03566
作者: Benedikt Reitemeyer,Hans-Georg Fill
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The role of large language models (LLMs) in enterprise modeling has recently started to shift from academic research to that of industrial applications. Thereby, LLMs represent a further building block for the machine-supported generation of enterprise models. In this paper we employ a knowledge graph-based approach for enterprise modeling and investigate the potential benefits of LLMs in this context. In addition, the findings of an expert survey and ChatGPT-4o-based experiments demonstrate that LLM-based model generations exhibit minimal variability, yet remain constrained to specific tasks, with reliability declining for more intricate tasks. The survey results further suggest that the supervision and intervention of human modeling experts are essential to ensure the accuracy and integrity of the generated models.

[AI-16] Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

链接: https://arxiv.org/abs/2501.03562
作者: Tianyang Duan,Zongyuan Zhang,Zheng Lin,Yue Gao,Ling Xiong,Yong Cui,Hongbin Liang,Xianhao Chen,Heming Cui,Dong Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.

[AI-17] Deep Learning within Tabular Data: Foundations Challenges Advances and Future Directions

链接: https://arxiv.org/abs/2501.03540
作者: Weijieying Ren,Tianxiang Zhao,Yuqing Huang,Vasant Honavar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation learning, structured around three foundational design elements: training data, neural architectures, and learning objectives. Unlike prior surveys that focus primarily on either architecture design or learning strategies, we adopt a holistic perspective that emphasizes the universality and robustness of representation learning methods across diverse downstream tasks. We examine recent advances in data augmentation and generation, specialized neural network architectures tailored to tabular data, and innovative learning objectives that enhance representation quality. Additionally, we highlight the growing influence of self-supervised learning and the adaptation of transformer-based foundation models for tabular data. Our review is based on a systematic literature search using rigorous inclusion criteria, encompassing 127 papers published since 2020 in top-tier conferences and journals. Through detailed analysis and comparison, we identify emerging trends, critical gaps, and promising directions for future research, aiming to guide the development of more generalizable and effective tabular data representation methods.

[AI-18] SenseRAG : Constructing Environmental Knowledge Bases with Proactive Querying for LLM -Based Autonomous Driving WACV

链接: https://arxiv.org/abs/2501.03535
作者: Xuewen Luo,Fan Ding,Fengze Yang,Yang Zhou,Junnyong Loo,Hwa Hui Tew,Chenxi Liu
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: This paper has been accepted for presentation at WACV Workshop LLMAD 2025

点击查看摘要

Abstract:This study addresses the critical need for enhanced situational awareness in autonomous driving (AD) by leveraging the contextual reasoning capabilities of large language models (LLMs). Unlike traditional perception systems that rely on rigid, label-based annotations, it integrates real-time, multimodal sensor data into a unified, LLMs-readable knowledge base, enabling LLMs to dynamically understand and respond to complex driving environments. To overcome the inherent latency and modality limitations of LLMs, a proactive Retrieval-Augmented Generation (RAG) is designed for AD, combined with a chain-of-thought prompting mechanism, ensuring rapid and context-rich understanding. Experimental results using real-world Vehicle-to-everything (V2X) datasets demonstrate significant improvements in perception and prediction performance, highlighting the potential of this framework to enhance safety, adaptability, and decision-making in next-generation AD systems.

[AI-19] Vocal Tract Length Warped Features for Spoken Keyword Spotting

链接: https://arxiv.org/abs/2501.03523
作者: Achintya kr. Sarkar,Priyanka Dwivedi,Zheng-Hua Tan
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we propose several methods that incorporate vocal tract length (VTL) warped features for spoken keyword spotting (KWS). The first method, VTL-independent KWS, involves training a single deep neural network (DNN) that utilizes VTL features with various warping factors. During training, a specific VTL feature is randomly selected per epoch, allowing the exploration of VTL variations. During testing, the VTL features with different warping factors of a test utterance are scored against the DNN and combined with equal weight. In the second method scores the conventional features of a test utterance (without VTL warping) against the DNN. The third method, VTL-concatenation KWS, concatenates VTL warped features to form high-dimensional features for KWS. Evaluations carried out on the English Google Command dataset demonstrate that the proposed methods improve the accuracy of KWS.

[AI-20] Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment AAAI2025

链接: https://arxiv.org/abs/2501.03486
作者: Prashant Trivedi,Souradip Chakraborty,Avinash Reddy,Vaneet Aggarwal,Amrit Singh Bedi,George K. Atia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27 pages, Accepted in AAAI 2025

点击查看摘要

Abstract:The alignment of large language models (LLMs) with human values is critical as these models become increasingly integrated into various societal and decision-making processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters, but these approaches are often computationally expensive and impractical when models are frozen or inaccessible for parameter modification. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment. While the existing literature has shown empirical promise of prompt optimization, its theoretical underpinning remains under-explored. We address this gap by formulating prompt optimization as an optimization problem and try to provide theoretical insights into the optimality of such a framework. To analyze the performance of the prompt optimization, we study theoretical suboptimality bounds and provide insights in terms of how prompt optimization depends upon the given prompter and target model. We also provide empirical validation through experiments on various datasets, demonstrating that prompt optimization can effectively align LLMs, even when parameter fine-tuning is not feasible.

[AI-21] LHGNN: Local-Higher Order Graph Neural Networks For Audio Classification and Tagging

链接: https://arxiv.org/abs/2501.03464
作者: Shubhr Singh,Emmanouil Benetos,Huy Phan,Dan Stowell
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, this work introduces the Local- Higher Order Graph Neural Network (LHGNN), a graph based model that enhances feature understanding by integrating local neighbourhood information with higher-order data from Fuzzy C-Means clusters, thereby capturing a broader spectrum of audio relationships. Evaluation of the model on three publicly available audio datasets shows that it outperforms Transformer-based models across all benchmarks while operating with substantially fewer parameters. Moreover, LHGNN demonstrates a distinct advantage in scenarios lacking ImageNet pretraining, establishing its effectiveness and efficiency in environments where extensive pretraining data is unavailable.

[AI-22] Radar Signal Recognition through Self-Supervised Learning and Domain Adaptation

链接: https://arxiv.org/abs/2501.03461
作者: Zi Huang,Akila Pemasiri,Simon Denman,Clinton Fookes,Terrence Martin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 5 pages, 9 figures

点击查看摘要

Abstract:Automatic radar signal recognition (RSR) plays a pivotal role in electronic warfare (EW), as accurately classifying radar signals is critical for informing decision-making processes. Recent advances in deep learning have shown significant potential in improving RSR performance in domains with ample annotated data. However, these methods fall short in EW scenarios where annotated RF data are scarce or impractical to obtain. To address these challenges, we introduce a self-supervised learning (SSL) method which utilises masked signal modelling and RF domain adaption to enhance RSR performance in environments with limited RF samples and labels. Specifically, we investigate pre-training masked autoencoders (MAE) on baseband in-phase and quadrature (I/Q) signals from various RF domains and subsequently transfer the learned representation to the radar domain, where annotated data are limited. Empirical results show that our lightweight self-supervised ResNet model with domain adaptation achieves up to a 17.5% improvement in 1-shot classification accuracy when pre-trained on in-domain signals (i.e., radar signals) and up to a 16.31% improvement when pre-trained on out-of-domain signals (i.e., comm signals), compared to its baseline without SSL. We also provide reference results for several MAE designs and pre-training strategies, establishing a new benchmark for few-shot radar signal classification.

[AI-23] SALT: Sales Autocompletion Linked Business Tables Dataset NEURIPS2024

链接: https://arxiv.org/abs/2501.03413
作者: Tassilo Klein,Clemens Biehl,Margarida Costa,Andre Sres,Jonas Kolk,Johannes Hoffart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: Table Representation Learning Workshop at NeurIPS 2024

点击查看摘要

Abstract:Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts.

[AI-24] Enhanced Importance Sampling through Latent Space Exploration in Normalizing Flows AAAI2025

链接: https://arxiv.org/abs/2501.03394
作者: Liam A. Kruse,Alexandros E. Tzikas,Harrison Delecki,Mansur M. Arief,Mykel J. Kochenderfer
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Importance sampling is a rare event simulation technique used in Monte Carlo simulations to bias the sampling distribution towards the rare event of interest. By assigning appropriate weights to sampled points, importance sampling allows for more efficient estimation of rare events or tails of distributions. However, importance sampling can fail when the proposal distribution does not effectively cover the target distribution. In this work, we propose a method for more efficient sampling by updating the proposal distribution in the latent space of a normalizing flow. Normalizing flows learn an invertible mapping from a target distribution to a simpler latent distribution. The latent space can be more easily explored during the search for a proposal distribution, and samples from the proposal distribution are recovered in the space of the target distribution via the invertible mapping. We empirically validate our methodology on simulated robotics applications such as autonomous racing and aircraft ground collision avoidance.

[AI-25] Over-the-Air Fair Federated Learning via Multi-Objective Optimization

链接: https://arxiv.org/abs/2501.03392
作者: Shayan Mohajer Hamidi,Ali Bereyhi,Saba Asaad,H. Vincent Poor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In federated learning (FL), heterogeneity among the local dataset distributions of clients can result in unsatisfactory performance for some, leading to an unfair model. To address this challenge, we propose an over-the-air fair federated learning algorithm (OTA-FFL), which leverages over-the-air computation to train fair FL models. By formulating FL as a multi-objective minimization problem, we introduce a modified Chebyshev approach to compute adaptive weighting coefficients for gradient aggregation in each communication round. To enable efficient aggregation over the multiple access channel, we derive analytical solutions for the optimal transmit scalars at the clients and the de-noising scalar at the parameter server. Extensive experiments demonstrate the superiority of OTA-FFL in achieving fairness and robust performance compared to existing methods.

[AI-26] Existential Crisis: A Social Robots Reason for Being

链接: https://arxiv.org/abs/2501.03376
作者: Dora Medgyesy,Joella Galas,Julian van Pol,Rustam Eynaliyev,Thijs Vollebregt
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:As Robots become ever more important in our daily lives there’s growing need for understanding how they’re perceived by people. This study aims to investigate how the user perception of robots is influenced by displays of personality. Using LLMs and speech to text technology, we designed a within-subject study to compare two conditions: a personality-driven robot and a purely task-oriented, personality-neutral robot. Twelve participants, recruited from Socially Intelligent Robotics course at Vrije Universiteit Amsterdam, interacted with a robot Nao tasked with asking them a set of medical questions under both conditions. After completing both interactions, the participants completed a user experience questionnaire measuring their emotional states and robot perception using standardized questionnaires from the SRI and Psychology literature.

[AI-27] Rethinking Byzantine Robustness in Federated Recommendation from Sparse Aggregation Perspective AAAI2025

链接: https://arxiv.org/abs/2501.03301
作者: Zhongjian Zhang,Mengmei Zhang,Xiao Wang,Lingjuan Lyu,Bo Yan,Junping Du,Chuan Shi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: accepted by AAAI 2025

点击查看摘要

Abstract:To preserve user privacy in recommender systems, federated recommendation (FR) based on federated learning (FL) emerges, keeping the personal data on the local client and updating a model collaboratively. Unlike FL, FR has a unique sparse aggregation mechanism, where the embedding of each item is updated by only partial clients, instead of full clients in a dense aggregation of general FL. Recently, as an essential principle of FL, model security has received increasing attention, especially for Byzantine attacks, where malicious clients can send arbitrary updates. The problem of exploring the Byzantine robustness of FR is particularly critical since in the domains applying FR, e.g., e-commerce, malicious clients can be injected easily by registering new accounts. However, existing Byzantine works neglect the unique sparse aggregation of FR, making them unsuitable for our problem. Thus, we make the first effort to investigate Byzantine attacks on FR from the perspective of sparse aggregation, which is non-trivial: it is not clear how to define Byzantine robustness under sparse aggregations and design Byzantine attacks under limited knowledge/capability. In this paper, we reformulate the Byzantine robustness under sparse aggregation by defining the aggregation for a single item as the smallest execution unit. Then we propose a family of effective attack strategies, named Spattack, which exploit the vulnerability in sparse aggregation and are categorized along the adversary’s knowledge and capability. Extensive experimental results demonstrate that Spattack can effectively prevent convergence and even break down defenses under a few malicious clients, raising alarms for securing FR systems.

[AI-28] A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

链接: https://arxiv.org/abs/2501.03295
作者: Shuo Tong,Runyuan Guo,Wenqing Wang,Xueqiong Tian,Lingyun Wei,Lin Zhang,Huayong Wu,Ding Liu,Youmin Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated significant potential across various domains, notably through In-Context Learning (ICL), which enables high-performance task execution with minimal input-label demonstrations and no prior training. This paper aims to replace supervised learning with the emerging ICL paradigm for soft sensor modeling to address existing challenges and explore new avenues for advancement. To achieve this, we propose a novel framework called the Few-shot Uncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes the Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware Few-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial Knowledge Vector Storage to enhance LLMs’ domain-specific knowledge, enabling zero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based context demonstrations of structured data to prompt LLMs to execute ICL for predicting and propose a context sample retrieval augmentation strategy to improve performance. Additionally, we explored LLMs’ AIGC and probabilistic characteristics to propose self-explanation and uncertainty quantification methods for constructing a trustworthy soft sensor. Extensive experiments demonstrate that our method achieved state-of-the-art predictive performance, strong robustness, and flexibility, effectively mitigates training instability found in traditional methods. To the best of our knowledge, this is the first work to establish soft sensor utilizing LLMs.

[AI-29] Multi-Modal One-Shot Federated Ensemble Learning for Medical Data with Vision Large Language Model

链接: https://arxiv.org/abs/2501.03292
作者: Naibo Wang,Yuchen Deng,Shichen Fan,Jianwei Yin,See-Kiong Ng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has attracted considerable interest in the medical domain due to its capacity to facilitate collaborative model training while maintaining data privacy. However, conventional FL methods typically necessitate multiple communication rounds, leading to significant communication overhead and delays, especially in environments with limited bandwidth. One-shot federated learning addresses these issues by conducting model training and aggregation in a single communication round, thereby reducing communication costs while preserving privacy. Among these, one-shot federated ensemble learning combines independently trained client models using ensemble techniques such as voting, further boosting performance in non-IID data scenarios. On the other hand, existing machine learning methods in healthcare predominantly use unimodal data (e.g., medical images or textual reports), which restricts their diagnostic accuracy and comprehensiveness. Therefore, the integration of multi-modal data is proposed to address these shortcomings. In this paper, we introduce FedMME, an innovative one-shot multi-modal federated ensemble learning framework that utilizes multi-modal data for medical image analysis. Specifically, FedMME capitalizes on vision large language models to produce textual reports from medical images, employs a BERT model to extract textual features from these reports, and amalgamates these features with visual features to improve diagnostic accuracy. Experimental results show that our method demonstrated superior performance compared to existing one-shot federated learning methods in healthcare scenarios across four datasets with various data distributions. For instance, it surpasses existing one-shot federated learning approaches by more than 17.5% in accuracy on the RSNA dataset when applying a Dirichlet distribution with ( \alpha = 0.3).

[AI-30] A Decision-Based Heterogenous Graph Attention Network for Multi-Class Fake News Detection

链接: https://arxiv.org/abs/2501.03290
作者: Batool Lakzaei,Mostafa Haghir Chehreghani,Alireza Bagheri
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:A promising tool for addressing fake news detection is Graph Neural Networks (GNNs). However, most existing GNN-based methods rely on binary classification, categorizing news as either real or fake. Additionally, traditional GNN models use a static neighborhood for each node, making them susceptible to issues like over-squashing. In this paper, we introduce a novel model named Decision-based Heterogeneous Graph Attention Network (DHGAT) for fake news detection in a semi-supervised setting. DHGAT effectively addresses the limitations of traditional GNNs by dynamically optimizing and selecting the neighborhood type for each node in every layer. It represents news data as a heterogeneous graph where nodes (news items) are connected by various types of edges. The architecture of DHGAT consists of a decision network that determines the optimal neighborhood type and a representation network that updates node embeddings based on this selection. As a result, each node learns an optimal and task-specific computational graph, enhancing both the accuracy and efficiency of the fake news detection process. We evaluate DHGAT on the LIAR dataset, a large and challenging dataset for multi-class fake news detection, which includes news items categorized into six classes. Our results demonstrate that DHGAT outperforms existing methods, improving accuracy by approximately 4% and showing robustness with limited labeled data.

[AI-31] CodeVision: Detecting LLM -Generated Code Using 2D Token Probability Maps and Vision Models

链接: https://arxiv.org/abs/2501.03288
作者: Zhenyu Xu,Victor S. Sheng
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) like ChatGPT has significantly improved automated code generation, enhancing software development efficiency. However, this introduces challenges in academia, particularly in distinguishing between human-written and LLM-generated code, which complicates issues of academic integrity. Existing detection methods, such as pre-trained models and watermarking, face limitations in adaptability and computational efficiency. In this paper, we propose a novel detection method using 2D token probability maps combined with vision models, preserving spatial code structures such as indentation and brackets. By transforming code into log probability matrices and applying vision models like Vision Transformers (ViT) and ResNet, we capture both content and structure for more accurate detection. Our method shows robustness across multiple programming languages and improves upon traditional detectors, offering a scalable and computationally efficient solution for identifying LLM-generated code.

[AI-32] From Aleatoric to Epistemic: Exploring Uncertainty Quantification Techniques in Artificial Intelligence

链接: https://arxiv.org/abs/2501.03282
作者: Tianyang Wang,Yunze Wang,Jun Zhou,Benji Peng,Xinyuan Song,Charles Zhang,Xintian Sun,Qian Niu,Junyu Liu,Silin Chen,Keyu Chen,Ming Li,Pohsun Feng,Ziqian Bi,Ming Liu,Yichao Zhang,Cheng Fei,Caitlyn Heqi Yin,Lawrence KQ Yan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Uncertainty quantification (UQ) is a critical aspect of artificial intelligence (AI) systems, particularly in high-risk domains such as healthcare, autonomous systems, and financial technology, where decision-making processes must account for uncertainty. This review explores the evolution of uncertainty quantification techniques in AI, distinguishing between aleatoric and epistemic uncertainties, and discusses the mathematical foundations and methods used to quantify these uncertainties. We provide an overview of advanced techniques, including probabilistic methods, ensemble learning, sampling-based approaches, and generative models, while also highlighting hybrid approaches that integrate domain-specific knowledge. Furthermore, we examine the diverse applications of UQ across various fields, emphasizing its impact on decision-making, predictive accuracy, and system robustness. The review also addresses key challenges such as scalability, efficiency, and integration with explainable AI, and outlines future directions for research in this rapidly developing area. Through this comprehensive survey, we aim to provide a deeper understanding of UQ’s role in enhancing the reliability, safety, and trustworthiness of AI systems.

[AI-33] Revolutionizing Encrypted Traffic Classification with MH-Net: A Multi-View Heterogeneous Graph Model AAAI2025

链接: https://arxiv.org/abs/2501.03279
作者: Haozhen Zhang,Haodong Yue,Xi Xiao,Le Yu,Qing Li,Zhen Ling,Ye Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025. The code is available at this https URL . arXiv admin note: text overlap with arXiv:2402.07501

点击查看摘要

Abstract:With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional byte-based traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.

[AI-34] Heterogeneous Graph Pre-training Based Model for Secure and Efficient Prediction of Default Risk Propagation among Bond Issuers

链接: https://arxiv.org/abs/2501.03268
作者: Xurui Li,Xin Shan,Wenhao Yin,Haijiao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient prediction of default risk for bond-issuing enterprises is pivotal for maintaining stability and fostering growth in the bond market. Conventional methods usually rely solely on an enterprise’s internal data for risk assessment. In contrast, graph-based techniques leverage interconnected corporate information to enhance default risk identification for targeted bond issuers. Traditional graph techniques such as label propagation algorithm or deepwalk fail to effectively integrate a enterprise’s inherent attribute information with its topological network data. Additionally, due to data scarcity and security privacy concerns between enterprises, end-to-end graph neural network (GNN) algorithms may struggle in delivering satisfactory performance for target tasks. To address these challenges, we present a novel two-stage model. In the first stage, we employ an innovative Masked Autoencoders for Heterogeneous Graph (HGMAE) to pre-train on a vast enterprise knowledge graph. Subsequently, in the second stage, a specialized classifier model is trained to predict default risk propagation probabilities. The classifier leverages concatenated feature vectors derived from the pre-trained encoder with the enterprise’s task-specific feature vectors. Through the two-stage training approach, our model not only boosts the importance of unique bond characteristics for specific default prediction tasks, but also securely and efficiently leverage the global information pre-trained from other enterprises. Experimental results demonstrate that our proposed model outperforms existing approaches in predicting default risk for bond issuers.

[AI-35] Optimizing Edge AI: A Comprehensive Survey on Data Model and System Strategies

链接: https://arxiv.org/abs/2501.03265
作者: Xubin Wang,Weijia Jia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of 5G and edge computing hardware has brought about a significant shift in artificial intelligence, with edge AI becoming a crucial technology for enabling intelligent applications. With the growing amount of data generated and stored on edge devices, deploying AI models for local processing and inference has become increasingly necessary. However, deploying state-of-the-art AI models on resource-constrained edge devices faces significant challenges that must be addressed. This paper presents an optimization triad for efficient and reliable edge AI deployment, including data, model, and system optimization. First, we discuss optimizing data through data cleaning, compression, and augmentation to make it more suitable for edge deployment. Second, we explore model design and compression methods at the model level, such as pruning, quantization, and knowledge distillation. Finally, we introduce system optimization techniques like framework support and hardware acceleration to accelerate edge AI workflows. Based on an in-depth analysis of various application scenarios and deployment challenges of edge AI, this paper proposes an optimization paradigm based on the data-model-system triad to enable a whole set of solutions to effectively transfer ML models, which are initially trained in the cloud, to various edge devices for supporting multiple scenarios.

[AI-36] Bridge the Inference Gaps of Neural Processes via Expectation Maximization ICLR2023

链接: https://arxiv.org/abs/2501.03264
作者: Qi Wang,Marco Federici,Herke van Hoof
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: ICLR2023

点击查看摘要

Abstract:The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, \textite.g. attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance. Our code is available at \urlthis https URL.

[AI-37] Navigation Variable-based Multi-objective Particle Swarm Optimization for UAV Path Planning with Kinematic Constraints

链接: https://arxiv.org/abs/2501.03261
作者: Thi Thuy Ngan Duong,Duy-Nam Bui,Manh Duong Phung
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Path planning is essential for unmanned aerial vehicles (UAVs) as it determines the path that the UAV needs to follow to complete a task. This work addresses this problem by introducing a new algorithm called navigation variable-based multi-objective particle swarm optimization (NMOPSO). It first models path planning as an optimization problem via the definition of a set of objective functions that include optimality and safety requirements for UAV operation. The NMOPSO is then used to minimize those functions through Pareto optimal solutions. The algorithm features a new path representation based on navigation variables to include kinematic constraints and exploit the maneuverable characteristics of the UAV. It also includes an adaptive mutation mechanism to enhance the diversity of the swarm for better solutions. Comparisons with various algorithms have been carried out to benchmark the proposed approach. The results indicate that the NMOPSO performs better than not only other particle swarm optimization variants but also other state-of-the-art multi-objective and metaheuristic optimization algorithms. Experiments have also been conducted with real UAVs to confirm the validity of the approach for practical flights. The source code of the algorithm is available at this https URL.

[AI-38] AI-ANNE: (A) (N)eural (N)et for (E)xploration: Transferring Deep Learning Models onto Microcontrollers and Embedded Systems

链接: https://arxiv.org/abs/2501.03256
作者: Dennis Klinkhammer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 tables

点击查看摘要

Abstract:This working paper explores the integration of neural networks onto resource-constrained embedded systems like a Raspberry Pi Pico / Raspberry Pi Pico 2. A TinyML aproach transfers neural networks directly on these microcontrollers, enabling real-time, low-latency, and energy-efficient inference while maintaining data privacy. Therefore, AI-ANNE: (A) (N)eural (N)et for (E)xploration will be presented, which facilitates the transfer of pre-trained models from high-performance platforms like TensorFlow and Keras onto microcontrollers, using a lightweight programming language like MicroPython. This approach demonstrates how neural network architectures, such as neurons, layers, density and activation functions can be implemented in MicroPython in order to deal with the computational limitations of embedded systems. Based on the Raspberry Pi Pico / Raspberry Pi Pico 2, two different neural networks on microcontrollers are presented for an example of data classification. As an further application example, such a microcontroller can be used for condition monitoring, where immediate corrective measures are triggered on the basis of sensor data. Overall, this working paper presents a very easy-to-implement way of using neural networks on energy-efficient devices such as microcontrollers. This makes AI-ANNE: (A) (N)eural (N)et for (E)xploration not only suited for practical use, but also as an educational tool with clear insights into how neural networks operate.

[AI-39] Advanced Displacement Magnitude Prediction in Multi-Material Architected Lattice Structure Beams Using Physics Informed Neural Network Architecture

链接: https://arxiv.org/abs/2501.03254
作者: Akshansh Mishra
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 34 pages, 19 figures

点击查看摘要

Abstract:This paper proposes an innovative method for predicting deformation in architected lattice structures that combines Physics-Informed Neural Networks (PINNs) with finite element analysis. A thorough study was carried out on FCC-based lattice beams utilizing five different materials (Structural Steel, AA6061, AA7075, Ti6Al4V, and Inconel 718) under varied edge loads (1000-10000 N). The PINN model blends data-driven learning with physics-based limitations via a proprietary loss function, resulting in much higher prediction accuracy than linear regression. PINN outperforms linear regression, achieving greater R-square (0.7923 vs 0.5686) and lower error metrics (MSE: 0.00017417 vs 0.00036187). Among the materials examined, AA6061 had the highest displacement sensitivity (0.1014 mm at maximum load), while Inconel718 had better structural stability.

[AI-40] Machine Learning and Deep Learning Techniques used in Cybersecurity and Digital Forensics: a Review

链接: https://arxiv.org/abs/2501.03250
作者: Jaouhar Fattahi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the paced realms of cybersecurity and digital forensics machine learning (ML) and deep learning (DL) have emerged as game changing technologies that introduce methods to identify stop and analyze cyber risks. This review presents an overview of the ML and DL approaches used in these fields showcasing their advantages drawbacks and possibilities. It covers a range of AI techniques used in spotting intrusions in systems and classifying malware to prevent cybersecurity attacks, detect anomalies and enhance resilience. This study concludes by highlighting areas where further research is needed and suggesting ways to create transparent and scalable ML and DL solutions that are suited to the evolving landscape of cybersecurity and digital forensics.

[AI-41] Accuracy Can Lie: On the Impact of Surrogate Model in Configuration Tuning

链接: https://arxiv.org/abs/2501.01876
作者: Pengzhou Chen,Jingzhi Gong,Tao Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by TSE

点击查看摘要

Abstract:To ease the expensive measurements during configuration tuning, it is natural to build a surrogate model as the replacement of the system, and thereby the configuration performance can be cheaply evaluated. Yet, a stereotype therein is that the higher the model accuracy, the better the tuning result would be. This “accuracy is all” belief drives our research community to build more and more accurate models and criticize a tuner for the inaccuracy of the model used. However, this practice raises some previously unaddressed questions, e.g., Do those somewhat small accuracy improvements reported in existing work really matter much to the tuners? What role does model accuracy play in the impact of tuning quality? To answer those related questions, we conduct one of the largest-scale empirical studies to date-running over the period of 13 months 24*7-that covers 10 models, 17 tuners, and 29 systems from the existing works while under four different commonly used metrics, leading to 13,612 cases of investigation. Surprisingly, our key findings reveal that the accuracy can lie: there are a considerable number of cases where higher accuracy actually leads to no improvement in the tuning outcomes (up to 58% cases under certain setting), or even worse, it can degrade the tuning quality (up to 24% cases under certain setting). We also discover that the chosen models in most proposed tuners are sub-optimal and that the required % of accuracy change to significantly improve tuning quality varies according to the range of model accuracy. Deriving from the fitness landscape analysis, we provide in-depth discussions of the rationale behind, offering several lessons learned as well as insights for future opportunities. Most importantly, this work poses a clear message to the community: we should take one step back from the natural “accuracy is all” belief for model-based configuration tuning.

[AI-42] SelectiveFinetuning: Enhancing Transfer Learning in Sleep Staging through Selective Domain Alignment ICASSP2025

链接: https://arxiv.org/abs/2501.03764
作者: Siyuan Zhao,Chenyu Liu,Yi Ding,Xinliang Zhou
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:In practical sleep stage classification, a key challenge is the variability of EEG data across different subjects and environments. Differences in physiology, age, health status, and recording conditions can lead to domain shifts between data. These domain shifts often result in decreased model accuracy and reliability, particularly when the model is applied to new data with characteristics different from those it was originally trained on, which is a typical manifestation of negative transfer. To address this, we propose SelectiveFinetuning in this paper. Our method utilizes a pretrained Multi Resolution Convolutional Neural Network (MRCNN) to extract EEG features, capturing the distinctive characteristics of different sleep stages. To mitigate the effect of domain shifts, we introduce a domain aligning mechanism that employs Earth Mover Distance (EMD) to evaluate and select source domain data closely matching the target domain. By finetuning the model with selective source data, our SelectiveFinetuning enhances the model’s performance on target domain that exhibits domain shifts compared to the data used for training. Experimental results show that our method outperforms existing baselines, offering greater robustness and adaptability in practical scenarios where data distributions are often unpredictable.

[AI-43] Optimization Learning

链接: https://arxiv.org/abs/2501.03443
作者: Pascal Van Hentenryck
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article introduces the concept of optimization learning, a methodology to design optimization proxies that learn the input/output mapping of parametric optimization problems. These optimization proxies are trustworthy by design: they compute feasible solutions to the underlying optimization problems, provide quality guarantees on the returned solutions, and scale to large instances. Optimization proxies are differentiable programs that combine traditional deep learning technology with repair or completion layers to produce feasible solutions. The article shows that optimization proxies can be trained end-to-end in a self-supervised way. It presents methodologies to provide performance guarantees and to scale optimization proxies to large-scale optimization problems. The potential of optimization proxies is highlighted through applications in power systems and, in particular, real-time risk assessment and security-constrained optimal power flow.

[AI-44] Neural networks consisting of DNA

链接: https://arxiv.org/abs/2501.03235
作者: Michael te Vrugt
类目: Biological Physics (physics.bio-ph); Soft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
*备注: Book chapter, to appear in: Artificial Intelligence and Intelligent Matter, Springer, Cham

点击查看摘要

Abstract:Neural networks based on soft and biological matter constitute an interesting potential alternative to traditional implementations based on electric circuits. DNA is a particularly promising system in this context due its natural ability to store information. In recent years, researchers have started to construct neural networks that are based on DNA. In this chapter, I provide a very basic introduction to the concept of DNA neural networks, aiming at an audience that is not familiar with biochemistry.

机器学习

[LG-0] A Survey on Federated Learning in Human Sensing

链接: https://arxiv.org/abs/2501.04000
作者: Mohan Li,Martin Gjoreski,Pietro Barbiero,Gašper Slapničar,Mitja Luštrek,Nicholas D. Lane,Marc Langheinrich
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Human Sensing, a field that leverages technology to monitor human activities, psycho-physiological states, and interactions with the environment, enhances our understanding of human behavior and drives the development of advanced services that improve overall quality of life. However, its reliance on detailed and often privacy-sensitive data as the basis for its machine learning (ML) models raises significant legal and ethical concerns. The recently proposed ML approach of Federated Learning (FL) promises to alleviate many of these concerns, as it is able to create accurate ML models without sending raw user data to a central server. While FL has demonstrated its usefulness across a variety of areas, such as text prediction and cyber security, its benefits in Human Sensing are under-explored, given the particular challenges in this domain. This survey conducts a comprehensive analysis of the current state-of-the-art studies on FL in Human Sensing, and proposes a taxonomy and an eight-dimensional assessment for FL approaches. Through the eight-dimensional assessment, we then evaluate whether the surveyed studies consider a specific FL-in-Human-Sensing challenge or not. Finally, based on the overall analysis, we discuss open challenges and highlight five research aspects related to FL in Human Sensing that require urgent research attention. Our work provides a comprehensive corpus of FL studies and aims to assist FL practitioners in developing and evaluating solutions that effectively address the real-world complexities of Human Sensing.

[LG-1] WAPTS: A Weighted Allocation Probability Adjusted Thompson Sampling Algorithm for High-Dimensional and Sparse Experiment Settings

链接: https://arxiv.org/abs/2501.03999
作者: Haochen Song,Ilya Musabirov,Ananya Bhattacharjee,Audrey Durand,Meredith Franklin,Anna Rafferty,Joseph Jay Williams
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Aiming for more effective experiment design, such as in video content advertising where different content options compete for user engagement, these scenarios can be modeled as multi-arm bandit problems. In cases where limited interactions are available due to external factors, such as the cost of conducting experiments, recommenders often face constraints due to the small number of user interactions. In addition, there is a trade-off between selecting the best treatment and the ability to personalize and contextualize based on individual factors. A popular solution to this dilemma is the Contextual Bandit framework. It aims to maximize outcomes while incorporating personalization (contextual) factors, customizing treatments such as a user’s profile to individual preferences. Despite their advantages, Contextual Bandit algorithms face challenges like measurement bias and the ‘curse of dimensionality.’ These issues complicate the management of numerous interventions and often lead to data sparsity through participant segmentation. To address these problems, we introduce the Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS) algorithm. WAPTS builds on the contextual Thompson Sampling method by using a dynamic weighting parameter. This improves the allocation process for interventions and enables rapid optimization in data-sparse environments. We demonstrate the performance of our approach on different numbers of arms and effect sizes.

[LG-2] A precise asymptotic analysis of learning diffusion models: theory and insights

链接: https://arxiv.org/abs/2501.03937
作者: Hugo Cui,Cengiz Pehlevan,Yue M. Lu
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注:

点击查看摘要

Abstract:In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.

[LG-3] mFabric: An Efficient and Scalable Fabric for Mixture-of-Experts Training

链接: https://arxiv.org/abs/2501.03905
作者: Xudong Liao,Yijun Sun,Han Tian,Xinchen Wan,Yilun Jin,Zilong Wang,Zhenghang Ren,Xinyang Huang,Wenxue Li,Kin Fai Tse,Zhizhen Zhong,Guyue Liu,Ying Zhang,Xiaofeng Ye,Yiming Zhang,Kai Chen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Corresponding authors: zhizhenz@mit.edu (Z. Zhong), kaichen@cse. this http URL (K. Chen)

点击查看摘要

Abstract:Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named \emphexperts, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain \emphstatic during the distributed training process. In this paper, we advocate for a first-of-its-kind system, called mFabric, that unlocks topology reconfiguration \emphduring distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has \emphstrong locality, alleviating the requirement of global reconfiguration. Based on this, we design and implement a \emphregionally reconfigurable high-bandwidth domain on top of existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We have built a fully functional mFabric prototype with commodity hardware and a customized collective communication runtime that trains state-of-the-art MoE models with \emphin-training topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that mFabric delivers comparable performance as the non-blocking fat-tree fabric while boosting the training cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2 \times --1.5 \times and 1.9 \times --2.3 \times at 100 Gbps and 400 Gbps link bandwidths, respectively.

[LG-4] Stochastically Constrained Best Arm Identification with Thompson Sampling

链接: https://arxiv.org/abs/2501.03877
作者: Le Yang,Siyang Gao,Cheng Li,Yi Wang
类目: Machine Learning (cs.LG)
*备注: 30 pages, 12 figures, 1 table

点击查看摘要

Abstract:We consider the problem of the best arm identification in the presence of stochastic constraints, where there is a finite number of arms associated with multiple performance measures. The goal is to identify the arm that optimizes the objective measure subject to constraints on the remaining measures. We will explore the popular idea of Thompson sampling (TS) as a means to solve it. To the best of our knowledge, it is the first attempt to extend TS to this problem. We will design a TS-based sampling algorithm, establish its asymptotic optimality in the rate of posterior convergence, and demonstrate its superior performance using numerical examples.

[LG-5] ruthful mechanisms for linear bandit games with private contexts AAMAS2025

链接: https://arxiv.org/abs/2501.03865
作者: Yiting Hu,Lingjie Duan
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: To appear at AAMAS 2025

点击查看摘要

Abstract:The contextual bandit problem, where agents arrive sequentially with personal contexts and the system adapts its arm allocation decisions accordingly, has recently garnered increasing attention for enabling more personalized outcomes. However, in many healthcare and recommendation applications, agents have private profiles and may misreport their contexts to gain from the system. For example, in adaptive clinical trials, where hospitals sequentially recruit volunteers to test multiple new treatments and adjust plans based on volunteers’ reported profiles such as symptoms and interim data, participants may misreport severe side effects like allergy and nausea to avoid perceived suboptimal treatments. We are the first to study this issue of private context misreporting in a stochastic contextual bandit game between the system and non-repeated agents. We show that traditional low-regret algorithms, such as UCB family algorithms and Thompson sampling, fail to ensure truthful reporting and can result in linear regret in the worst case, while traditional truthful algorithms like explore-then-commit (ETC) and \epsilon -greedy algorithm incur sublinear but high regret. We propose a mechanism that uses a linear program to ensure truthfulness while minimizing deviation from Thompson sampling, yielding an O(\ln T) frequentist regret. Our numerical experiments further demonstrate strong performance in multiple contexts and across other distribution families.

[LG-6] Symmetry and Generalisation in Machine Learning

链接: https://arxiv.org/abs/2501.03858
作者: Hayder Elesedy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD Thesis

点击查看摘要

Abstract:This work is about understanding the impact of invariance and equivariance on generalisation in supervised learning. We use the perspective afforded by an averaging operator to show that for any predictor that is not equivariant, there is an equivariant predictor with strictly lower test risk on all regression problems where the equivariance is correctly specified. This constitutes a rigorous proof that symmetry, in the form of invariance or equivariance, is a useful inductive bias. We apply these ideas to equivariance and invariance in random design least squares and kernel ridge regression respectively. This allows us to specify the reduction in expected test risk in more concrete settings and express it in terms of properties of the group, the model and the data. Along the way, we give examples and additional results to demonstrate the utility of the averaging operator approach in analysing equivariant predictors. In addition, we adopt an alternative perspective and formalise the common intuition that learning with invariant models reduces to a problem in terms of orbit representatives. The formalism extends naturally to a similar intuition for equivariant models. We conclude by connecting the two perspectives and giving some ideas for future work. Comments: PhD Thesis Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.03858 [cs.LG] (or arXiv:2501.03858v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.03858 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Leverag ing time and parameters for nonlinear model reduction methods

链接: https://arxiv.org/abs/2501.03853
作者: Silke Glas,Benjamin Unger
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider model order reduction (MOR) methods for problems with slowly decaying Kolmogorov n -widths as, e.g., certain wave-like or transport-dominated problems. To overcome this Kolmogorov barrier within MOR, nonlinear projections are used, which are often realized numerically using autoencoders. These autoencoders generally consist of a nonlinear encoder and a nonlinear decoder and involve costly training of the hyperparameters to obtain a good approximation quality of the reduced system. To facilitate the training process, we show that extending the to-be-reduced system and its corresponding training data makes it possible to replace the nonlinear encoder with a linear encoder without sacrificing accuracy, thus roughly halving the number of hyperparameters to be trained.

[LG-8] Machine learning applications in archaeological practices: a review

链接: https://arxiv.org/abs/2501.03840
作者: Mathias Bellat,Jordy D. Orellana Figueroa,Jonathan S. Reeves,Ruhollah Taghizadeh-Mehrjardi,Claudio Tennie,Thomas Scholten
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial intelligence and machine learning applications in archaeology have increased significantly in recent years, and these now span all subfields, geographical regions, and time periods. The prevalence and success of these applications have remained largely unexamined, as recent reviews on the use of machine learning in archaeology have only focused only on specific subfields of archaeology. Our review examined an exhaustive corpus of 135 articles published between 1997 and 2022. We observed a significant increase in the number of relevant publications from 2019 onwards. Automatic structure detection and artefact classification were the most represented tasks in the articles reviewed, followed by taphonomy, and archaeological predictive modelling. From the review, clustering and unsupervised methods were underrepresented compared to supervised models. Artificial neural networks and ensemble learning account for two thirds of the total number of models used. However, if machine learning is gaining in popularity it remains subject to misunderstanding. We observed, in some cases, poorly defined requirements and caveats of the machine learning methods used. Furthermore, the goals and the needs of machine learning applications for archaeological purposes are in some cases unclear or poorly expressed. To address this, we proposed a workflow guide for archaeologists to develop coherent and consistent methodologies adapted to their research questions, project scale and data. As in many other areas, machine learning is rapidly becoming an important tool in archaeological research and practice, useful for the analyses of large and multivariate data, although not without limitations. This review highlights the importance of well-defined and well-reported structured methodologies and collaborative practices to maximise the potential of applications of machine learning methods in archaeology.

[LG-9] Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights NEURIPS2024

链接: https://arxiv.org/abs/2501.03782
作者: Sy-Tuyen Ho,Tuan Van Vo,Somayeh Ebrahimkhani,Ngai-Man Cheung
类目: Machine Learning (cs.LG)
*备注: Accepted in NeurIPS 2024

点击查看摘要

Abstract:While ViTs have achieved across machine learning tasks, deploying them in real-world scenarios faces a critical challenge: generalizing under OoD shifts. A crucial research gap exists in understanding how to design ViT architectures, both manually and automatically, for better OoD generalization. To this end, we introduce OoD-ViT-NAS, the first systematic benchmark for ViTs NAS focused on OoD generalization. This benchmark includes 3000 ViT architectures of varying computational budgets evaluated on 8 common OoD datasets. Using this benchmark, we analyze factors contributing to OoD generalization. Our findings reveal key insights. First, ViT architecture designs significantly affect OoD generalization. Second, ID accuracy is often a poor indicator of OoD accuracy, highlighting the risk of optimizing ViT architectures solely for ID performance. Third, we perform the first study of NAS for ViTs OoD robustness, analyzing 9 Training-free NAS methods. We find that existing Training-free NAS methods are largely ineffective in predicting OoD accuracy despite excelling at ID accuracy. Simple proxies like Param or Flop surprisingly outperform complex Training-free NAS methods in predicting OoD accuracy. Finally, we study how ViT architectural attributes impact OoD generalization and discover that increasing embedding dimensions generally enhances performance. Our benchmark shows that ViT architectures exhibit a wide range of OoD accuracy, with up to 11.85% improvement for some OoD shifts. This underscores the importance of studying ViT architecture design for OoD. We believe OoD-ViT-NAS can catalyze further research into how ViT designs influence OoD generalization.

[LG-10] Multi-label Cross-lingual automatic music genre classification from lyrics with Sentence BERT

链接: https://arxiv.org/abs/2501.03769
作者: Tiago Fernandes Tavares,Fabio José Ayres
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 pages

点击查看摘要

Abstract:Music genres are shaped by both the stylistic features of songs and the cultural preferences of artists’ audiences. Automatic classification of music genres using lyrics can be useful in several applications such as recommendation systems, playlist creation, and library organization. We present a multi-label, cross-lingual genre classification system based on multilingual sentence embeddings generated by sBERT. Using a bilingual Portuguese-English dataset with eight overlapping genres, we demonstrate the system’s ability to train on lyrics in one language and predict genres in another. Our approach outperforms the baseline approach of translating lyrics and using a bag-of-words representation, improving the genrewise average F1-Score from 0.35 to 0.69. The classifier uses a one-vs-all architecture, enabling it to assign multiple genre labels to a single lyric. Experimental results reveal that dataset centralization notably improves cross-lingual performance. This approach offers a scalable solution for genre classification across underrepresented languages and cultural domains, advancing the capabilities of music information retrieval systems.

[LG-11] A Multimodal Lightweight Approach to Fault Diagnosis of Induction Motors in High-Dimensional Dataset

链接: https://arxiv.org/abs/2501.03746
作者: Usman Ali
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:An accurate AI-based diagnostic system for induction motors (IMs) holds the potential to enhance proactive maintenance, mitigating unplanned downtime and curbing overall maintenance costs within an industrial environment. Notably, among the prevalent faults in IMs, a Broken Rotor Bar (BRB) fault is frequently encountered. Researchers have proposed various fault diagnosis approaches using signal processing (SP), machine learning (ML), deep learning (DL), and hybrid architectures for BRB faults. One limitation in the existing literature is the training of these architectures on relatively small datasets, risking overfitting when implementing such systems in industrial environments. This paper addresses this limitation by implementing large-scale data of BRB faults by using a transfer-learning-based lightweight DL model named ShuffleNetV2 for diagnosing one, two, three, and four BRB faults using current and vibration signal data. Spectral images for training and testing are generated using a Short-Time Fourier Transform (STFT). The dataset comprises 57,500 images, with 47,500 used for training and 10,000 for testing. Remarkably, the ShuffleNetV2 model exhibited superior performance, in less computational cost as well as accurately classifying 98.856% of spectral images. To further enhance the visualization of harmonic sidebands resulting from broken bars, Fast Fourier Transform (FFT) is applied to current and vibration data. The paper also provides insights into the training and testing times for each model, contributing to a comprehensive understanding of the proposed fault diagnosis methodology. The findings of our research provide valuable insights into the performance and efficiency of different ML and DL models, offering a foundation for the development of robust fault diagnosis systems for induction motors in industrial settings.

[LG-12] Deep Networks are Reproducing Kernel Chains

链接: https://arxiv.org/abs/2501.03697
作者: Tjeerd Jan Heeringa,Len Spek,Christoph Brune
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:Identifying an appropriate function space for deep neural networks remains a key open question. While shallow neural networks are naturally associated with Reproducing Kernel Banach Spaces (RKBS), deep networks present unique challenges. In this work, we extend RKBS to chain RKBS (cRKBS), a new framework that composes kernels rather than functions, preserving the desirable properties of RKBS. We prove that any deep neural network function is a neural cRKBS function, and conversely, any neural cRKBS function defined on a finite dataset corresponds to a deep neural network. This approach provides a sparse solution to the empirical risk minimization problem, requiring no more than N neurons per layer, where N is the number of data points.

[LG-13] Imitation Learning of MPC with Neural Networks: Error Guarantees and Sparsification

链接: https://arxiv.org/abs/2501.03671
作者: Hendrik Alsmeier,Lukas Theiner,Anton Savchenko,Ali Mesbah,Rolf Findeisen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a framework for bounding the approximation error in imitation model predictive controllers utilizing neural networks. Leveraging the Lipschitz properties of these neural networks, we derive a bound that guides dataset design to ensure the approximation error remains at chosen limits. We discuss how this method can be used to design a stable neural network controller with performance guarantees employing existing robust model predictive control approaches for data generation. Additionally, we introduce a training adjustment, which is based on the sensitivities of the optimization problem and reduces dataset density requirements based on the derived bounds. We verify that the proposed augmentation results in improvements to the network’s predictive capabilities and a reduction of the Lipschitz constant. Moreover, on a simulated inverted pendulum problem, we show that the approach results in a closer match of the closed-loop behavior between the imitation and the original model predictive controller.

[LG-14] Hybrid Machine Learning Model with a Constrained Action Space for Trajectory Prediction

链接: https://arxiv.org/abs/2501.03666
作者: Alexander Fertig,Lakshman Balasubramanian,Michael Botsch
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to 2025 IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:Trajectory prediction is crucial to advance autonomous driving, improving safety, and efficiency. Although end-to-end models based on deep learning have great potential, they often do not consider vehicle dynamic limitations, leading to unrealistic predictions. To address this problem, this work introduces a novel hybrid model that combines deep learning with a kinematic motion model. It is able to predict object attributes such as acceleration and yaw rate and generate trajectories based on them. A key contribution is the incorporation of expert knowledge into the learning objective of the deep learning model. This results in the constraint of the available action space, thus enabling the prediction of physically feasible object attributes and trajectories, thereby increasing safety and robustness. The proposed hybrid model facilitates enhanced interpretability, thereby reinforcing the trustworthiness of deep learning methods and promoting the development of safe planning solutions. Experiments conducted on the publicly available real-world Argoverse dataset demonstrate realistic driving behaviour, with benchmark comparisons and ablation studies showing promising results.

[LG-15] Data Augmentation for Deep Learning Regression Tasks by Machine Learning Models

链接: https://arxiv.org/abs/2501.03654
作者: Assaf Shmuel,Oren Glickman,Teddy Lazebnik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning (DL) models have gained prominence in domains such as computer vision and natural language processing but remain underutilized for regression tasks involving tabular data. In these cases, traditional machine learning (ML) models often outperform DL models. In this study, we propose and evaluate various data augmentation (DA) techniques to improve the performance of DL models for tabular data regression tasks. We compare the performance gain of Neural Networks by different DA strategies ranging from a naive method of duplicating existing observations and adding noise to a more sophisticated DA strategy that preserves the underlying statistical relationship in the data. Our analysis demonstrates that the advanced DA method significantly improves DL model performance across multiple datasets and regression tasks, resulting in an average performance increase of over 10% compared to baseline models without augmentation. The efficacy of these DA strategies was rigorously validated across 30 distinct datasets, with multiple iterations and evaluations using three different automated deep learning (AutoDL) frameworks: AutoKeras, H2O, and AutoGluon. This study demonstrates that by leveraging advanced DA techniques, DL models can realize their full potential in regression tasks, thereby contributing to broader adoption and enhanced performance in practical applications.

[LG-16] Coupled Hierarchical Structure Learning using Tree-Wasserstein Distance

链接: https://arxiv.org/abs/2501.03627
作者: Ya-Wei Eileen Lin,Ronald R. Coifman,Gal Mishne,Ronen Talmon
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In many applications, both data samples and features have underlying hierarchical structures. However, existing methods for learning these latent structures typically focus on either samples or features, ignoring possible coupling between them. In this paper, we introduce a coupled hierarchical structure learning method using tree-Wasserstein distance (TWD). Our method jointly computes TWDs for samples and features, representing their latent hierarchies as trees. We propose an iterative, unsupervised procedure to build these sample and feature trees based on diffusion geometry, hyperbolic geometry, and wavelet filters. We show that this iterative procedure converges and empirically improves the quality of the constructed trees. The method is also computationally efficient and scales well in high-dimensional settings. Our method can be seamlessly integrated with hyperbolic graph convolutional networks (HGCN). We demonstrate that our method outperforms competing approaches in sparse approximation and unsupervised Wasserstein distance learning on several word-document and single-cell RNA-sequencing datasets. In addition, integrating our method into HGCN enhances performance in link prediction and node classification tasks.

[LG-17] AADNet: Exploring EEG Spatiotemporal Information for Fast and Accurate Orientation and Timbre Detection of Auditory Attention Based on A Cue-Masked Paradigm

链接: https://arxiv.org/abs/2501.03571
作者: Keren Shi,Xu Liu,Xue Yuan,Haijie Shang,Ruiting Dai,Hanbin Wang,Yunfa Fu,Ning Jiang,Jiayuan He
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Auditory attention decoding from electroencephalogram (EEG) could infer to which source the user is attending in noisy environments. Decoding algorithms and experimental paradigm designs are crucial for the development of technology in practical applications. To simulate real-world scenarios, this study proposed a cue-masked auditory attention paradigm to avoid information leakage before the experiment. To obtain high decoding accuracy with low latency, an end-to-end deep learning model, AADNet, was proposed to exploit the spatiotemporal information from the short time window of EEG signals. The results showed that with a 0.5-second EEG window, AADNet achieved an average accuracy of 93.46% and 91.09% in decoding auditory orientation attention (OA) and timbre attention (TA), respectively. It significantly outperformed five previous methods and did not need the knowledge of the original audio source. This work demonstrated that it was possible to detect the orientation and timbre of auditory attention from EEG signals fast and accurately. The results are promising for the real-time multi-property auditory attention decoding, facilitating the application of the neuro-steered hearing aids and other assistive listening devices.

[LG-18] Advanced Tutorial: Label-Efficient Two-Sample Tests

链接: https://arxiv.org/abs/2501.03568
作者: Weizhi Li,Visar Berisha,Gautam Dasarathy
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Hypothesis testing is a statistical inference approach used to determine whether data supports a specific hypothesis. An important type is the two-sample test, which evaluates whether two sets of data points are from identical distributions. This test is widely used, such as by clinical researchers comparing treatment effectiveness. This tutorial explores two-sample testing in a context where an analyst has many features from two samples, but determining the sample membership (or labels) of these features is costly. In machine learning, a similar scenario is studied in active learning. This tutorial extends active learning concepts to two-sample testing within this \textitlabel-costly setting while maintaining statistical validity and high testing power. Additionally, the tutorial discusses practical applications of these label-efficient two-sample tests.

[LG-19] Multi-Source Urban Traffic Flow Forecasting with Drone and Loop Detector Data

链接: https://arxiv.org/abs/2501.03492
作者: Weijiang Xiong,Robert Fonod,Alexandre Alahi,Nikolas Geroliminis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic forecasting is a fundamental task in transportation research, however the scope of current research has mainly focused on a single data modality of loop detectors. Recently, the advances in Artificial Intelligence and drone technologies have made possible novel solutions for efficient, accurate and flexible aerial observations of urban traffic. As a promising traffic monitoring approach, drone-captured data can create an accurate multi-sensor mobility observatory for large-scale urban networks, when combined with existing infrastructure. Therefore, this paper investigates the problem of multi-source traffic speed prediction, simultaneously using drone and loop detector data. A simple yet effective graph-based model HiMSNet is proposed to integrate multiple data modalities and learn spatio-temporal correlations. Detailed analysis shows that predicting accurate segment-level speed is more challenging than the regional speed, especially under high-demand scenarios with heavier congestions and varying traffic dynamics. Utilizing both drone and loop detector data, the prediction accuracy can be improved compared to single-modality cases, when the sensors have lower coverages and are subject to noise. Our simulation study based on vehicle trajectories in a real urban road network has highlighted the added value of integrating drones in traffic forecasting and monitoring.

[LG-20] Entropy-Guided Attention for Private LLM s AAAI

链接: https://arxiv.org/abs/2501.03489
作者: Nandan Kumar Jha,Brandon Reagen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: The 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence (PPAI), 2025. arXiv admin note: substantial text overlap with arXiv:2410.13060

点击查看摘要

Abstract:The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users’ sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon’s entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: \em entropy collapse in deeper layers that destabilizes training, and \em entropic overload in earlier layers that leads to under-utilization of Multi-Head Attention’s (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at \hrefthis https URLentropy-guided-llm. Comments: The 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence (PPAI), 2025. arXiv admin note: substantial text overlap with arXiv:2410.13060 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2501.03489 [cs.LG] (or arXiv:2501.03489v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.03489 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] A study on performance limitations in Federated Learning

链接: https://arxiv.org/abs/2501.03477
作者: Karthik Mohan
类目: Machine Learning (cs.LG)
*备注: archive 2021 work

点击查看摘要

Abstract:Increasing privacy concerns and unrestricted access to data lead to the development of a novel machine learning paradigm called Federated Learning (FL). FL borrows many of the ideas from distributed machine learning, however, the challenges associated with federated learning makes it an interesting engineering problem since the models are trained on edge devices. It was introduced in 2016 by Google, and since then active research is being carried out in different areas within FL such as federated optimization algorithms, model and update compression, differential privacy, robustness, and attacks, federated GANs and privacy preserved personalization. There are many open challenges in the development of such federated machine learning systems and this project will be focusing on the communication bottleneck and data Non IID-ness, and its effect on the performance of the models. These issues are characterized on a baseline model, model performance is evaluated, and discussions are made to overcome these issues.

[LG-22] Optimizing Value of Learning in Task-Oriented Federated Meta-Learning Systems

链接: https://arxiv.org/abs/2501.03448
作者: Bibo Wu,Fang Fang,Xianbin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained significant attention in recent years due to its distributed nature and privacy preserving benefits. However, a key limitation of conventional FL is that it learns and distributes a common global model to all participants, which fails to provide customized solutions for diverse task requirements. Federated meta-learning (FML) offers a promising solution to this issue by enabling devices to finetune local models after receiving a shared meta-model from the server. In this paper, we propose a task-oriented FML framework over non-orthogonal multiple access (NOMA) networks. A novel metric, termed value of learning (VoL), is introduced to assess the individual training needs across devices. Moreover, a task-level weight (TLW) metric is defined based on task requirements and fairness considerations, guiding the prioritization of edge devices during FML training. The formulated problem, to maximize the sum of TLW-based VoL across devices, forms a non-convex mixed-integer non-linear programming (MINLP) challenge, addressed here using a parameterized deep Q-network (PDQN) algorithm to handle both discrete and continuous variables. Simulation results demonstrate that our approach significantly outperforms baseline schemes, underscoring the advantages of the proposed framework.

[LG-23] Physics-Constrained Generative Artificial Intelligence for Rapid Takeoff Trajectory Design

链接: https://arxiv.org/abs/2501.03445
作者: Samuel Sisk,Xiaosong Du
类目: Machine Learning (cs.LG)
*备注: Conference version with 10 pages and 7 figures

点击查看摘要

Abstract:To aid urban air mobility (UAM), electric vertical takeoff and landing (eVTOL) aircraft are being targeted. Conventional multidisciplinary analysis and optimization (MDAO) can be expensive, while surrogate-based optimization can struggle with challenging physical constraints. This work proposes physics-constrained generative adversarial networks (physicsGAN), to intelligently parameterize the takeoff control profiles of an eVTOL aircraft and to transform the original design space to a feasible space. Specifically, the transformed feasible space refers to a space where all designs directly satisfy all design constraints. The physicsGAN-enabled surrogate-based takeoff trajectory design framework was demonstrated on the Airbus A3 Vahana. The physicsGAN generated only feasible control profiles of power and wing angle in the feasible space with around 98.9% of designs satisfying all constraints. The proposed design framework obtained 99.6% accuracy compared with simulation-based optimal design and took only 2.2 seconds, which reduced the computational time by around 200 times. Meanwhile, data-driven GAN-enabled surrogate-based optimization took 21.9 seconds using a derivative-free optimizer, which was around an order of magnitude slower than the proposed framework. Moreover, the data-driven GAN-based optimization using gradient-based optimizers could not consistently find the optimal design during random trials and got stuck in an infeasible region, which is problematic in real practice. Therefore, the proposed physicsGAN-based design framework outperformed data-driven GAN-based design to the extent of efficiency (2.2 seconds), optimality (99.6% accurate), and feasibility (100% feasible). According to the literature review, this is the first physics-constrained generative artificial intelligence enabled by surrogate models.

[LG-24] Mixture-of-Experts Graph Transformers for Interpretable Particle Collision Detection

链接: https://arxiv.org/abs/2501.03432
作者: Donatella Genovese,Alessandro Sgroi,Alessio Devoto,Samuel Valentine,Lennox Wood,Cristiano Sebastiani,Stefano Giagu,Monica D’Onofrio,Simone Scardapane
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:

点击查看摘要

Abstract:The Large Hadron Collider at CERN produces immense volumes of complex data from high-energy particle collisions, demanding sophisticated analytical techniques for effective interpretation. Neural Networks, including Graph Neural Networks, have shown promise in tasks such as event classification and object identification by representing collisions as graphs. However, while Graph Neural Networks excel in predictive accuracy, their “black box” nature often limits their interpretability, making it difficult to trust their decision-making processes. In this paper, we propose a novel approach that combines a Graph Transformer model with Mixture-of-Expert layers to achieve high predictive performance while embedding interpretability into the architecture. By leveraging attention maps and expert specialization, the model offers insights into its internal decision-making, linking predictions to physics-informed features. We evaluate the model on simulated events from the ATLAS experiment, focusing on distinguishing rare Supersymmetric signal events from Standard Model background. Our results highlight that the model achieves competitive classification accuracy while providing interpretable outputs that align with known physics, demonstrating its potential as a robust and transparent tool for high-energy physics data analysis. This approach underscores the importance of explainability in machine learning methods applied to high energy physics, offering a path toward greater trust in AI-driven discoveries.

[LG-25] Low-Order Flow Reconstruction and Uncertainty Quantification in Disturbed Aerodynamics Using Sparse Pressure Measurements

链接: https://arxiv.org/abs/2501.03406
作者: Hanieh Mousavi,Jeff D. Eldredge
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This paper presents a novel machine-learning framework for reconstructing low-order gust-encounter flow field and lift coefficients from sparse, noisy surface pressure measurements. Our study thoroughly investigates the time-varying response of sensors to gust-airfoil interactions, uncovering valuable insights into optimal sensor placement. To address uncertainties in deep learning predictions, we implement probabilistic regression strategies to model both epistemic and aleatoric uncertainties. Epistemic uncertainty, reflecting the model’s confidence in its predictions, is modeled using Monte Carlo dropout, as an approximation to the variational inference in the Bayesian framework, treating the neural network as a stochastic entity. On the other hand, aleatoric uncertainty, arising from noisy input measurements, is captured via learned statistical parameters, which propagates measurement noise through the network into the final predictions. Our results showcase the efficacy of this dual uncertainty quantification strategy in accurately predicting aerodynamic behavior under extreme conditions while maintaining computational efficiency, underscoring its potential to improve online sensor-based flow estimation in real-world applications.

[LG-26] Detecting Defective Wafers Via Modular Networks

链接: https://arxiv.org/abs/2501.03368
作者: Yifeng Zhang,Bryan Baker,Shi Chen,Chao Zhang,Yu Huang,Qi Zhao,Sthitie Bom
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing availability of sensors within semiconductor manufacturing processes makes it feasible to detect defective wafers with data-driven models. Without directly measuring the quality of semiconductor devices, they capture the modalities between diverse sensor readings and can be used to predict key quality indicators (KQI, \textite.g., roughness, resistance) to detect faulty products, significantly reducing the capital and human cost in maintaining physical metrology steps. Nevertheless, existing models pay little attention to the correlations among different processes for diverse wafer products and commonly struggle with generalizability issues. To enable generic fault detection, in this work, we propose a modular network (MN) trained using time series stage-wise datasets that embodies the structure of the manufacturing process. It decomposes KQI prediction as a combination of stage modules to simulate compositional semiconductor manufacturing, universally enhancing faulty wafer detection among different wafer types and manufacturing processes. Extensive experiments demonstrate the usefulness of our approach, and shed light on how the compositional design provides an interpretable interface for more practical applications.

[LG-27] Data integrity vs. inference accuracy in large AIS datasets

链接: https://arxiv.org/abs/2501.03358
作者: Adam Kiersztyn,Dariusz Czerwiński,Aneta Oniszczuk-Jastrzabek,Ernest Czermański,Agnieszka Rzepka
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, conference: The International Conference on Business and Digital Technology (ICBDT2024)

点击查看摘要

Abstract:Automatic Ship Identification Systems (AIS) play a key role in monitoring maritime traffic, providing the data necessary for analysis and decision-making. The integrity of this data is fundamental to the correctness of infer-ence and decision-making in the context of maritime safety, traffic manage-ment and environmental protection. This paper analyzes the impact of data integrity in large AIS datasets, on classification accuracy. It also presents er-ror detection and correction methods and data verification techniques that can improve the reliability of AIS systems. The results show that improving the integrity of AIS data significantly improves the quality of inference, which has a direct impact on operational efficiency and safety at sea.

[LG-28] he Robustness of Spiking Neural Networks in Federated Learning with Compression Against Non-omniscient Byzantine Attacks

链接: https://arxiv.org/abs/2501.03306
作者: Manh V. Nguyen,Liang Zhao,Bobin Deng,Shaoen Wu
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), which offer exceptional energy efficiency for inference, and Federated Learning (FL), which offers privacy-preserving distributed training, is a rising area of interest that highly beneficial towards Internet of Things (IoT) devices. Despite this, research that tackles Byzantine attacks and bandwidth limitation in FL-SNNs, both poses significant threats on model convergence and training times, still remains largely unexplored. Going beyond proposing a solution for both of these problems, in this work we highlight the dual benefits of FL-SNNs, against non-omniscient Byzantine adversaries (ones that restrict attackers access to local clients datasets), and greater communication efficiency, over FL-ANNs. Specifically, we discovered that a simple integration of Top-\kappa sparsification into the FL apparatus can help leverage the advantages of the SNN models in both greatly reducing bandwidth usage and significantly boosting the robustness of FL training against non-omniscient Byzantine adversaries. Most notably, we saw a massive improvement of roughly 40% accuracy gain in FL-SNNs training under the lethal MinMax attack

[LG-29] LiLMaps: Learnable Implicit Language Maps

链接: https://arxiv.org/abs/2501.03304
作者: Evgenii Kruzhkov,Sven Behnke
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the current trends in robotics is to employ large language models (LLMs) to provide non-predefined command execution and natural human-robot interaction. It is useful to have an environment map together with its language representation, which can be further utilized by LLMs. Such a comprehensive scene representation enables numerous ways of interaction with the map for autonomously operating robots. In this work, we present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically, we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions. Our experiments demonstrate the effectiveness of LiLMaps and solid improvements in performance.

[LG-30] Method of data forward generation with partial differential equations for machine learning modeling in fluid mechanics

链接: https://arxiv.org/abs/2501.03300
作者: Ruilin Chen,Xiaowei Jin,Nikolaus A. Adams,Hui Li
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) for fluid mechanics has become attractive topic. High-fidelity data is one of most critical issues for the successful applications of AI in fluid mechanics, however, it is expensively obtained or even inaccessible. This study proposes a high-efficient data forward generation method from the partial differential equations (PDEs). Specifically, the solutions of the PDEs are first generated either following a random field (e.g. Gaussian random field, GRF, computational complexity O(NlogN), N is the number of spatial points) or physical laws (e.g. a kind of spectra, computational complexity O(NM), M is the number of modes), then the source terms, boundary conditions and initial conditions are computed to satisfy PDEs. Thus, the data pairs of source terms, boundary conditions and initial conditions with corresponding solutions of PDEs can be constructed. A Poisson neural network (Poisson-NN) embedded in projection method and a wavelet transform convolutional neuro network (WTCNN) embedded in multigrid numerical simulation for solving incompressible Navier-Stokes equations is respectively proposed. The feasibility of generated data for training Poisson-NN and WTCNN is validated. The results indicate that even without any DNS data, the generated data can train these two models with excellent generalization and accuracy. The data following physical laws can significantly improve the convergence rate, generalization and accuracy than that generated following GRF.

[LG-31] Adaptive Pruning of Pretrained Transformer via Differential Inclusions

链接: https://arxiv.org/abs/2501.03289
作者: Yizhuo Ding,Ke Fan,Yikai Wang,Xinwei Sun,Yanwei Fu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large transformers have demonstrated remarkable success, making it necessary to compress these models to reduce inference costs while preserving their perfor-mance. Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure. Therefore, the solution path identifies a Transformer weight family with various sparsity levels, offering greater flexibility and customization. In this paper, we introduce such an effective pruning method, termed SPP (Solution Path Pruning). To achieve effective pruning, we segment the transformers into paired modules, including query-key pairs, value-projection pairs, and sequential linear layers, and apply low-rank compression to these pairs, maintaining the output structure while enabling structural compression within the inner states. Extensive experiments conducted on various well-known transformer backbones have demonstrated the efficacy of SPP.

[LG-32] Inverse Design of Optimal Stern Shape with Convolutional Neural Network-based Pressure Distribution

链接: https://arxiv.org/abs/2501.03286
作者: Sang-jin Oh,Ju Young Kang,Kyungryeong Pak,Heejung Kim,Sung-chul Shin
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Hull form designing is an iterative process wherein the performance of the hull form needs to be checked via computational fluid dynamics calculations or model experiments. The stern shape has to undergo a process wherein the hull form variations from the pressure distribution analysis results are repeated until the resistance and propulsion efficiency meet the design requirements. In this study, the designer designed a pressure distribution that meets the design requirements; this paper proposes an inverse design algorithm that estimates the stern shape using deep learning. A convolutional neural network was used to extract the features of the pressure distribution expressed as a contour, whereas a multi-task learning model was used to estimate various sections of the stern shape. We estimated the stern shape indirectly by estimating the control point of the B-spline and comparing the actual and converted offsets for each section; the performance was verified, and an inverse design is proposed herein

[LG-33] Sensorformer: Cross-patch attention with global-patch compression is effective for high-dimensional multivariate time series forecasting

链接: https://arxiv.org/abs/2501.03284
作者: Liyang Qin,Xiaoli Wang,Chunhua Yang,Huaiwen Zou,Haochuan Zhang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 15 figures

点击查看摘要

Abstract:Among the existing Transformer-based multivariate time series forecasting methods, iTransformer, which treats each variable sequence as a token and only explicitly extracts cross-variable dependencies, and PatchTST, which adopts a channel-independent strategy and only explicitly extracts cross-time dependencies, both significantly outperform most Channel-Dependent Transformer that simultaneously extract cross-time and cross-variable dependencies. This indicates that existing Transformer-based multivariate time series forecasting methods still struggle to effectively fuse these two types of information. We attribute this issue to the dynamic time lags in the causal relationships between different variables. Therefore, we propose a new multivariate time series forecasting Transformer, Sensorformer, which first compresses the global patch information and then simultaneously extracts cross-variable and cross-time dependencies from the compressed representations. Sensorformer can effectively capture the correct inter-variable correlations and causal relationships, even in the presence of dynamic causal lags between variables, while also reducing the computational complexity of pure cross-patch self-attention from O(D^2 \cdot Patch_num^2 \cdot d_model) to O(D^2 \cdot Patch_num \cdot d_model) . Extensive comparative and ablation experiments on 9 mainstream real-world multivariate time series forecasting datasets demonstrate the superiority of Sensorformer. The implementation of Sensorformer, following the style of the Time-series-library and scripts for reproducing the main results, is publicly available at this https URL

[LG-34] Efficacy of Full-Packet Encryption in Mitigating Protocol Detection for Evasive Virtual Private Networks

链接: https://arxiv.org/abs/2412.17352
作者: Amy Iris Parker
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 3 figures, target conference undecided

点击查看摘要

Abstract:Full-packet encryption is a technique used by modern evasive Virtual Private Networks (VPNs) to avoid protocol-based flagging from censorship models by disguising their traffic as random noise on the network. Traditional methods for censoring full-packet-encryption based VPN protocols requires assuming a substantial amount of collateral damage, as other non-VPN network traffic that appears random will be blocked. I tested several machine learning-based classification models against the Aggressive Circumvention of Censorship (ACC) protocol, a fully-encrypted evasive VPN protocol which merges strategies from a wide variety of currently in-use evasive VPN protocols. My testing found that while ACC was able to survive our models when compared to random noise, it was easily detectable with minimal collateral damage using several different machine learning models when within a stream of regular network traffic. While resistant to the current techniques deployed by nation-state censors, the ACC protocol and other evasive protocols are potentially subject to packet-based protocol identification utilizing similar classification models.

[LG-35] Class-Balance Bias in Regularized Regression

链接: https://arxiv.org/abs/2501.03821
作者: Johan Larsson,Jonas Wallin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 27 pages, 21 figures

点击查看摘要

Abstract:Regularized models are often sensitive to the scales of the features in the data and it has therefore become standard practice to normalize (center and scale) the features before fitting the model. But there are many different ways to normalize the features and the choice may have dramatic effects on the resulting model. In spite of this, there has so far been no research on this topic. In this paper, we begin to bridge this knowledge gap by studying normalization in the context of lasso, ridge, and elastic net regression. We focus on normal and binary features and show that the class balances of binary features directly influences the regression coefficients and that this effect depends on the combination of normalization and regularization methods used. We demonstrate that this effect can be mitigated by scaling binary features with their variance in the case of the lasso and standard deviation in the case of ridge regression, but that this comes at the cost of increased variance. For the elastic net, we show that scaling the penalty weights, rather than the features, can achieve the same effect. Finally, we also tackle mixes of binary and normal features as well as interactions and provide some initial results on how to normalize features in these cases.

[LG-36] Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives

链接: https://arxiv.org/abs/2501.03727
作者: Jinchao Li,Yuejiao Wang,Junan Li,Jiawen Kang,Bo Zheng,Simon Wong,Brian Mak,Helene Fung,Jean Woo,Man-Wai Mak,Timothy Kwok,Vincent Mok,Xianmin Gong,Xixin Wu,Xunying Liu,Patrick Wong,Helen Meng
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Speech analysis offers a non-intrusive and scalable screening method, particularly through narrative tasks in neuropsychological assessment tools. Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax. While these features provide insights into language production abilities, they often fail to capture global narrative patterns, or microstructures. Macrostructures include coherence, thematic organization, and logical progressions, reflecting essential cognitive skills potentially critical for recognizing NCDs. Addressing this gap, we propose to investigate specific cognitive and linguistic challenges by analyzing topical shifts, temporal dynamics, and the coherence of narratives over time, aiming to reveal cognitive deficits by identifying narrative impairments, and exploring their impact on communication and cognition. The investigation is based on the CU-MARVEL Rabbit Story corpus, which comprises recordings of a story-telling task from 758 older adults. We developed two approaches: the Dynamic Topic Models (DTM)-based temporal analysis to examine the evolution of topics over time, and the Text-Image Temporal Alignment Network (TITAN) to evaluate the coherence between spoken narratives and visual stimuli. DTM-based approach validated the effectiveness of dynamic topic consistency as a macrostructural metric (F1=0.61, AUC=0.78). The TITAN approach achieved the highest performance (F1=0.72, AUC=0.81), surpassing established microstructural and macrostructural feature sets. Cross-comparison and regression tasks further demonstrated the effectiveness of proposed dynamic macrostructural modeling approaches for NCD detection.

[LG-37] Run-and-tumble chemotaxis using reinforcement learning

链接: https://arxiv.org/abs/2501.03687
作者: Ramesh Pramanik,Shradha Mishra,Sakuntala Chatterjee
类目: Cell Behavior (q-bio.CB); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Bacterial cells use run-and-tumble motion to climb up attractant concentration gradient in their environment. By extending the uphill runs and shortening the downhill runs the cells migrate towards the higher attractant zones. Motivated by this, we formulate a reinforcement learning (RL) algorithm where an agent moves in one dimension in the presence of an attractant gradient. The agent can perform two actions: either persistent motion in the same direction or reversal of direction. We assign costs for these actions based on the recent history of the agent’s trajectory. We ask the question: which RL strategy works best in different types of attractant environment. We quantify efficiency of the RL strategy by the ability of the agent (a) to localize in the favorable zones after large times, and (b) to learn about its complete environment. Depending on the attractant profile and the initial condition, we find an optimum balance is needed between exploration and exploitation to ensure the most efficient performance.

[LG-38] ransfer Learning for Deep-Unfolded Combinatorial Optimization Solver with Quantum Annealer

链接: https://arxiv.org/abs/2501.03518
作者: Ryo Hagiwara,Shunta Arai,Satoshi Takabe
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Quantum annealing (QA) has attracted research interest as a sampler and combinatorial optimization problem (COP) solver. A recently proposed sampling-based solver for QA significantly reduces the required number of qubits, being capable of large COPs. In relation to this, a trainable sampling-based COP solver has been proposed that optimizes its internal parameters from a dataset by using a deep learning technique called deep unfolding. Although learning the internal parameters accelerates the convergence speed, the sampler in the trainable solver is restricted to using a classical sampler owing to the training cost. In this study, to utilize QA in the trainable solver, we propose classical-quantum transfer learning, where parameters are trained classically, and the trained parameters are used in the solver with QA. The results of numerical experiments demonstrate that the trainable quantum COP solver using classical-quantum transfer learning improves convergence speed and execution time over the original solver.

[LG-39] Structure-Preference Enabled Graph Embedding Generation under Differential Privacy ICDE25

链接: https://arxiv.org/abs/2501.03451
作者: Sen Zhang,Qingqing Ye,Haibo Hu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by ICDE 25

点击查看摘要

Abstract:Graph embedding generation techniques aim to learn low-dimensional vectors for each node in a graph and have recently gained increasing research attention. Publishing low-dimensional node vectors enables various graph analysis tasks, such as structural equivalence and link prediction. Yet, improper publication opens a backdoor to malicious attackers, who can infer sensitive information of individuals from the low-dimensional node vectors. Existing methods tackle this issue by developing deep graph learning models with differential privacy (DP). However, they often suffer from large noise injections and cannot provide structural preferences consistent with mining objectives. Recently, skip-gram based graph embedding generation techniques are widely used due to their ability to extract customizable structures. Based on skip-gram, we present SE-PrivGEmb, a structure-preference enabled graph embedding generation under DP. For arbitrary structure preferences, we design a unified noise tolerance mechanism via perturbing non-zero vectors. This mechanism mitigates utility degradation caused by high sensitivity. By carefully designing negative sampling probabilities in skip-gram, we theoretically demonstrate that skip-gram can preserve arbitrary proximities, which quantify structural features in graphs. Extensive experiments show that our method outperforms existing state-of-the-art methods under structural equivalence and link prediction tasks.

[LG-40] On the Adversarial Robustness of Benjamini Hochberg NEURIPS2024

链接: https://arxiv.org/abs/2501.03402
作者: Louis L Chen,Roberto Szechtman,Matan Seri
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, NeurIPS 2024

点击查看摘要

Abstract:The Benjamini-Hochberg (BH) procedure is widely used to control the false detection rate (FDR) in multiple testing. Applications of this control abound in drug discovery, forensics, anomaly detection, and, in particular, machine learning, ranging from nonparametric outlier detection to out-of-distribution detection and one-class classification methods. Considering this control could be relied upon in critical safety/security contexts, we investigate its adversarial robustness. More precisely, we study under what conditions BH does and does not exhibit adversarial robustness, we present a class of simple and easily implementable adversarial test-perturbation algorithms, and we perform computational experiments. With our algorithms, we demonstrate that there are conditions under which BH’s control can be significantly broken with relatively few (even just one) test score perturbation(s), and provide non-asymptotic guarantees on the expected adversarial-adjustment to FDR. Our technical analysis involves a combinatorial reframing of the BH procedure as a ``balls into bins’’ process, and drawing a connection to generalized ballot problems to facilitate an information-theoretic approach for deriving non-asymptotic lower bounds.

[LG-41] he Artificial Scientist – in-transit Machine Learning of Plasma Simulations

链接: https://arxiv.org/abs/2501.03383
作者: Jeffrey Kelling,Vicente Bolea,Michael Bussmann,Ankush Checkervarty,Alexander Debus,Jan Ebert,Greg Eisenhauer,Vineeth Gutta,Stefan Kesselheim,Scott Klasky,Richard Pausch,Norbert Podhorszki,Franz Poschel,David Rogers,Jeyhun Rustamov,Steve Schmerler,Ulrich Schramm,Klaus Steiniger,Rene Widera,Anna Willmann,Sunita Chandrasekaran
类目: Computational Physics (physics.comp-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.

[LG-42] DenseGNN: universal and scalable deeper graph neural networks for high-performance property prediction in crystals and molecules

链接: https://arxiv.org/abs/2501.03278
作者: Hongwei Du,Jiamin Wang,Jian Hui,Lanting Zhang,Hong Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: DenseGNN optimizes computational efficiency and accuracy in predicting material properties using DCN, HRN, and LOPE. It enhances transferability and overcomes over-smoothing, enabling deep architectures. Performance improvements on JARVIS-DFT, Materials Project, and QM9 datasets advance materials discovery and design

点击查看摘要

Abstract:Generative models generate vast numbers of hypothetical materials, necessitating fast, accurate models for property prediction. Graph Neural Networks (GNNs) excel in this domain but face challenges like high training costs, domain adaptation issues, and over-smoothing. We introduce DenseGNN, which employs Dense Connectivity Network (DCN), Hierarchical Node-Edge-Graph Residual Networks (HRN), and Local Structure Order Parameters Embedding (LOPE) to address these challenges. DenseGNN achieves state-of-the-art performance on datasets such as JARVIS-DFT, Materials Project, and QM9, improving the performance of models like GIN, Schnet, and Hamnet on materials datasets. By optimizing atomic embeddings and reducing computational costs, DenseGNN enables deeper architectures and surpasses other GNNs in crystal structure distinction, approaching X-ray diffraction method accuracy. This advances materials discovery and design.

信息检索

[IR-0] (De)-Indexing and the Right to be Forgotten

链接: https://arxiv.org/abs/2501.03989
作者: Salvatore Vilella,Giancarlo Ruffo
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the digital age, the challenge of forgetfulness has emerged as a significant concern, particularly regarding the management of personal data and its accessibility online. The right to be forgotten (RTBF) allows individuals to request the removal of outdated or harmful information from public access, yet implementing this right poses substantial technical difficulties for search engines. This paper aims to introduce non-experts to the foundational concepts of information retrieval (IR) and de-indexing, which are critical for understanding how search engines can effectively “forget” certain content. We will explore various IR models, including boolean, probabilistic, vector space, and embedding-based approaches, as well as the role of Large Language Models (LLMs) in enhancing data processing capabilities. By providing this overview, we seek to highlight the complexities involved in balancing individual privacy rights with the operational challenges faced by search engines in managing information visibility.

[IR-1] owards Reliable Testing for Multiple Information Retrieval System Comparisons

链接: https://arxiv.org/abs/2501.03930
作者: David Otero,Javier Parapar,Álvaro Barreiro
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Null Hypothesis Significance Testing is the \textitde facto tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.

[IR-2] Extending ChatGPT with a Browserless System for Web Product Price Extraction

链接: https://arxiv.org/abs/2501.03811
作者: Jorge Lloret-Gazo
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:With the advenement of ChatGPT, we can find very clean, precise answers to a varied amount of questions. However, for questions such as ‘find the price of the lemon cake at zingerman’s’, the answer looks like ‘I can’t browse the web right now’. In this paper, we propose a system, called Wextractor, which extends ChatGPT to answer questions as the one mentioned before. Obviously, our system cannot be labeled as `artificial intelligence’. Simply, it offers to cover a kind of transactional search that is not included in the current version of ChatGPT. Moreover, Wextractor includes two improvements with respect to the initial version: social extraction and pointing pattern extraction to improve the answer speed.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-08

目录

概览 (2025-01-08)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载