本篇博文主要展示 2025-01-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-01-03)
今日共更新730篇论文,其中:
- 自然语言处理共146篇(Computation and Language (cs.CL))
- 人工智能共194篇(Artificial Intelligence (cs.AI))
- 计算机视觉共166篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共273篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Unifying Specialized Visual Encoders for Video Language Models
【速读】: 该论文试图解决当前视频大语言模型(VideoLLMs)在视觉处理上的局限性问题。现有的VideoLLMs通常依赖单一的视觉编码器(vision encoder)来处理所有视觉信息,这限制了传递给大语言模型的视觉信息的种类和数量。为了解决这一问题,论文提出了一种名为MERV(Multi-Encoder Representation of Videos)的方法,其关键创新在于利用多个冻结的视觉编码器来生成视频的统一表示。通过时空对齐(spatio-temporally aligning)来自每个编码器的特征,MERV能够提供更全面的视觉知识,从而在开放式和多项选择视频理解任务中表现出色。实验结果表明,MERV在标准视频理解基准测试中的准确率比Video-LLaVA高出3.7%,并且在零样本感知测试(zero-shot Perception Test)中比之前的最高水平SeViLA提高了2.2%。此外,MERV在引入极少额外参数的情况下,训练速度更快,并且能够并行处理视觉信息。这些结果表明,利用多个视觉编码器进行视频理解是一个有前景的研究方向。
链接: https://arxiv.org/abs/2501.01426
作者: Jihoon Chung,Tyler Zhu,Max Gonzalez Saez-Diez,Juan Carlos Niebles,Honglu Zhou,Olga Russakovsky
机构: Princeton University (普林斯顿大学); Salesforce Research (Salesforce 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:The recent advent of Large Language Models (LLMs) has ushered sophisticated reasoning capabilities into the realm of video through Video Large Language Models (VideoLLMs). However, VideoLLMs currently rely on a single vision encoder for all of their visual processing, which limits the amount and type of visual information that can be conveyed to the LLM. Our method, MERV, Multi-Encoder Representation of Videos, instead leverages multiple frozen visual encoders to create a unified representation of a video, providing the VideoLLM with a comprehensive set of specialized visual knowledge. Spatio-temporally aligning the features from each encoder allows us to tackle a wider range of open-ended and multiple-choice video understanding questions and outperform prior state-of-the-art works. MERV is up to 3.7% better in accuracy than Video-LLaVA across the standard suite video understanding benchmarks, while also having a better Video-ChatGPT score. We also improve upon SeViLA, the previous best on zero-shot Perception Test accuracy, by 2.2%. MERV introduces minimal extra parameters and trains faster than equivalent single-encoder methods while parallelizing the visual processing. Finally, we provide qualitative evidence that MERV successfully captures domain knowledge from each of its encoders. Our results offer promising directions in utilizing multiple vision encoders for comprehensive video understanding.
zh
[NLP-1] OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
【速读】: 该论文试图解决当前语音对话系统在处理复杂现实世界对话(如音频事件、音乐背景和情感表达)时的局限性问题。现有对话数据集在规模和场景多样性方面存在不足,导致系统难以应对多样化的对话情境。解决方案的关键在于利用合成数据(synthetic data)来增强对话模型在多样化场景中的表现。作者提出了ShareChatX,这是首个涵盖多样化场景的大规模语音对话数据集,并基于此开发了OmniChat系统。OmniChat通过异构特征融合模块(heterogeneous feature fusion module)优化不同对话场景中的特征选择。此外,论文还探讨了使用合成数据训练对话系统的关键问题,确定了合成数据与真实数据之间的理想平衡,并在真实世界对话数据集DailyTalk上取得了最先进的结果。合成数据在应对涉及音频和音乐的复杂对话场景中发挥了重要作用。
链接: https://arxiv.org/abs/2501.01384
作者: Xize Cheng,Dongjie Fu,Xiaoda Yang,Minghui Fang,Ruofan Hu,Jingyu Lu,Bai Jionghao,Zehan Wang,Shengpeng Ji,Rongjie Huang,Linjun Li,Yu Chen,Tao Jin,Zhou Zhao
机构: Zhejiang University(浙江大学); Meituan(美团)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \urlthis https URL.
zh
[NLP-2] raining Medical Large Vision-Language Models with Abnormal-Aware Feedback
【速读】: 该论文旨在解决现有医学大视觉-语言模型(Med-LVLMs)在医学图像中的视觉定位(visual localization)问题,特别是在异常检测和解释方面的挑战。为了解决这些问题,作者提出了一种新型的UMed-LVLM模型,专注于揭示医学异常(Unveiling Medical abnormalities)。解决方案的关键包括:1)收集了一个医学异常揭示(Medical Abnormalities Unveiling, MAU)数据集,利用GPT-4V生成基于医学图像中异常区域的诊断;2)采用两阶段训练方法,包括异常感知指令调优(Abnormal-Aware Instruction Tuning)和异常感知奖励机制(Abnormal-Aware Rewarding),后者进一步分为异常定位奖励(Abnormal Localization Rewarding)和视觉相关性奖励(Vision Relevance Rewarding)。实验结果表明,UMed-LVLM在识别和理解医学异常方面优于现有的Med-LVLMs,并且增强异常检测能力显著提高了模型对医学图像的理解和泛化能力。
链接: https://arxiv.org/abs/2501.01377
作者: Yucheng Zhou,Lingran Song,Jianbing Shen
机构: SKL-IOTSC, CIS, University of Macau (澳门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages
Abstract:Existing Medical Large Vision-Language Models (Med-LVLMs), which encapsulate extensive medical knowledge, demonstrate excellent capabilities in understanding medical images and responding to human queries based on these images. However, there remain challenges in visual localization in medical images, which is crucial for abnormality detection and interpretation. To address these issues, we propose a novel UMed-LVLM designed with Unveiling Medical abnormalities. Specifically, we collect a Medical Abnormalities Unveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM training. To collect MAU dataset, we propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images. Moreover, the two-stage training method includes Abnormal-Aware Instruction Tuning and Abnormal-Aware Rewarding, comprising Abnormal Localization Rewarding and Vision Relevance Rewarding. Experimental results demonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and understanding medical abnormality. In addition, this work shows that enhancing the abnormality detection capabilities of Med-LVLMs significantly improves their understanding of medical images and generalization capability.
zh
[NLP-3] Embedding-based Approaches to Hyperpartisan News Detection
【速读】: 该论文旨在解决如何判断一篇新闻文章是否属于极端党派化(hyperpartisan)新闻的问题。极端党派化新闻指的是那些采取极端政治立场、意图在公众中制造政治分裂的新闻。论文尝试了多种方法,包括n-grams、情感分析(sentiment analysis),以及使用预训练的ELMo(Embeddings from Language Models)进行句子和文档表示。其中,最佳系统采用了预训练的ELMo结合双向LSTM(Bidirectional Long Short-Term Memory),在未进行过多超参数调优的情况下,通过10折交叉验证(10-fold cross-validation)达到了83%的准确率。解决方案的关键在于利用预训练的ELMo模型进行文本表示,并结合双向LSTM来捕捉文本中的上下文信息,从而有效区分极端党派化新闻。
链接: https://arxiv.org/abs/2501.01370
作者: Karthik Mohan,Pengyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 5 pages, 1 figure
Abstract:In this paper, we describe our systems in which the objective is to determine whether a given news article could be considered as hyperpartisan. Hyperpartisan news is news that takes an extremely polarized political standpoint with an intention of creating political divide among the public. We attempted several approaches, including n-grams, sentiment analysis, as well as sentence and document representation using pre-tained ELMo. Our best system using pre-trained ELMo with Bidirectional LSTM achieved an accuracy of 83% through 10-fold cross-validation without much hyperparameter tuning.
zh
[NLP-4] ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
【速读】: 该论文试图解决3D视觉定位(3D Visual Grounding, 3DVG)领域中现有数据集未能充分涵盖自然语言描述多样性(diverse language patterns)的问题。具体而言,现有的3DVG数据集虽然通过大规模语言模型(LLM-based scaling)进行了扩展,但仍未覆盖英语中可能出现的所有潜在提示(prompts),这限制了模型在实际应用中的泛化能力。为解决这一问题,论文提出了一种用于语言分析3DVG提示的框架,并引入了ViGiL3D(Visual Grounding with Diverse Language in 3D)诊断数据集,旨在评估视觉定位方法在面对多样化语言模式时的表现。通过评估现有的开放词汇3DVG方法,论文发现这些方法在处理更具挑战性的、分布外(out-of-distribution)提示时表现不佳,表明其在真实世界应用中的局限性。解决方案的关键在于通过语言分析和多样化数据集的构建,提升模型对复杂语言模式的理解和定位能力。
链接: https://arxiv.org/abs/2501.01366
作者: Austin T. Wang,ZeMing Gong,Angel X. Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages with 5 figures and 11 tables
Abstract:3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
zh
[NLP-5] AdaptVC: High Quality Voice Conversion with Adaptive Learning KR
【速读】: 该论文旨在解决语音转换(voice conversion)中的关键挑战,即如何从源说话者(source speaker)的语音中提取解耦的语言内容(linguistic content),同时从参考说话者(reference speaker)的语音中提取语音风格(voice style),并在零样本(zero-shot)场景下实现鲁棒性。现有方法虽然采用了多种技术来分离这两者,但在泛化能力上仍需进一步改进。论文提出的解决方案核心在于通过适配器(adapters)对自监督语音特征(self-supervised speech features)进行调优,动态编码丰富的自监督特征中的细微差异,并通过解码器(decoder)融合这些特征,生成与参考语音高度相似且内容损失最小的语音。此外,论文还采用了带有交叉注意力(cross-attention)的说话者条件化(speaker conditioning)的条件流匹配解码器(conditional flow matching decoder),以进一步提升合成语音的质量和效率。实验结果表明,该方法在零样本场景下的语音质量和相似度上优于现有模型。
链接: https://arxiv.org/abs/2501.01347
作者: Jaehun Kim,Ji-Hoon Kim,Yeunju Choi,Tan Dat Nguyen,Seongkyu Mun,Joon Son Chung
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 4 pages, 3 figures. Audio samples are available in the demo page: this https URL
Abstract:The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
zh
[NLP-6] Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中视觉与语言表征之间的对齐(alignment)问题,特别是对齐与不对齐(misalignment)现象的根本原因及其影响。通过对齐的解释性视角,论文系统性地探讨了对齐的表征和行为方面、训练方法以及理论基础,并分析了在对象、属性和关系三个语义层次上的不对齐现象。研究发现,不对齐现象源于数据、模型和推理多个层次的挑战。论文还综述了现有的缓解策略,将其分为参数冻结(parameter-frozen)和参数调优(parameter-tuning)两类方法。解决方案的关键在于通过标准化的评估协议和深入的解释性研究,进一步推动对齐问题的理解和改进。
链接: https://arxiv.org/abs/2501.01346
作者: Dong Shu,Haiyan Zhao,Jingyu Hu,Weiru Liu,Lu Cheng,Mengnan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 16 pages, 3 figures
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and linguistic representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.
zh
[NLP-7] Aligning Large Language Models for Faithful Integrity Against Opposing Argument
【速读】: 该论文试图解决大型语言模型(LLMs)在复杂推理任务中容易被不忠实论点误导的问题,即使其原始陈述是正确的。具体来说,研究旨在确保LLMs在面对对立论点时能够坚持其忠实陈述,并在面对忠实论点时能够纠正其错误陈述。为解决这一问题,论文提出了一个名为“Alignment for Faithful Integrity with Confidence Estimation (AFICE)”的新框架。该框架的关键在于设计了一种双边置信度估计(Bilateral Confidence Estimation, BCE)方法,用于估计LLM在特定上下文中生成每个响应的不确定性。BCE同时基于解码过程中的内部状态和累积概率比率来估计模型对问题和答案的置信度。通过BCE,论文构建了一个包含上下文、原始陈述和论点的对话偏好数据集,并采用直接偏好优化(Direct Preference Optimization, DPO)方法来对齐LLM以实现忠实完整性。实验结果表明,AFICE显著提升了LLM在面对对立论点时保持忠实响应的能力,确保了LLM在复杂交互环境中的实用性和可信度。
链接: https://arxiv.org/abs/2501.01336
作者: Yong Zhao,Yang Deng,See-Kiong Ng,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures
Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model’s confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM’s ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via this https URL
zh
[NLP-8] Decoding Knowledge in Large Language Models : A Framework for Categorization and Comprehension
【速读】: 该论文试图解决如何理解和评估大语言模型(LLMs)在知识获取、保留和应用方面的表现这一开放性问题。现有的评估方法通常依赖于二元的准确性指标,无法全面反映模型对知识的理解和应用能力。为此,论文提出了一种新的框架 K-(CSA)^2,该框架通过两个维度——正确性(correctness)和置信度(confidence)——对 LLM 的知识进行分类,定义了六种知识类别,从高度自信的正确知识到自信的错误认知(misconceptions),从而实现对模型理解的细致评估。这一框架的关键在于能够揭示模型内部(预训练)和外部(上下文依赖)知识结构的变化,特别是在应用链式思维提示(chain-of-thought prompting)和基于人类反馈的强化学习(reinforcement learning with human feedback)等技术时。此外,论文通过分层分析发现,LLM 的高层倾向于编码高置信度的知识,而中低层则更容易出现低置信度的知识。这一解决方案为深入理解 LLM 的知识结构提供了新的视角和方法。
链接: https://arxiv.org/abs/2501.01332
作者: Yanbo Fang,Ruixiang Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding how large language models (LLMs) acquire, retain, and apply knowledge remains an open challenge. This paper introduces a novel framework, K-(CSA)^2, which categorizes LLM knowledge along two dimensions: correctness and confidence. The framework defines six categories of knowledge, ranging from highly confident correctness to confidently held misconceptions, enabling a nuanced evaluation of model comprehension beyond binary accuracy. Using this framework, we demonstrate how techniques like chain-of-thought prompting and reinforcement learning with human feedback fundamentally alter the knowledge structures of internal (pre-trained) and external (context-dependent) knowledge in LLMs. CoT particularly enhances base model performance and shows synergistic benefits when applied to aligned LLMs. Moreover, our layer-wise analysis reveals that higher layers in LLMs encode more high-confidence knowledge, while low-confidence knowledge tends to emerge in middle-to-lower layers.
zh
[NLP-9] he Prompt Alchemist: Automated LLM -Tailored Prompt Optimization for Test Case Generation
【速读】: 该论文试图解决的是如何为不同的大型语言模型(LLMs)自动生成最优提示(prompt),以提升其在生成测试用例(test case)任务中的性能。现有方法主要依赖人工编写的通用提示,导致生成结果不理想,且忽视了不同LLMs可能对提示的适应性不同。解决方案的关键在于克服现有自动提示优化方法的两个主要局限:一是现有方法通过简单组合和变异现有提示进行迭代优化,缺乏有效指导,导致生成的提示缺乏多样性且容易重复相同的错误;二是现有提示通常缺乏领域上下文知识,限制了LLMs在生成测试用例任务中的表现。因此,论文提出需要开发一种能够自动发现并优化提示的方法,以更好地适应不同LLMs的特性,并融入领域知识,从而提升测试用例生成的效果。
链接: https://arxiv.org/abs/2501.01329
作者: Shuzheng Gao,Chaozheng Wang,Cuiyun Gao,Xiaoqian Jiao,Chun Yong Chong,Shan Gao,Michael Lyu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs’ performance in the task.
zh
[NLP-10] hink More Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking
【速读】: 该论文试图解决大型语言模型(LLMs)在生成文本时出现的幻觉(hallucination)问题,即模型在自回归生成过程中缺乏显式推理,导致生成的内容不可靠且事实不准确。解决方案的关键在于提出了HaluSearch框架,该框架结合了基于树搜索的算法(如蒙特卡洛树搜索,MCTS),通过显式的慢思考生成过程来缓解LLMs在推理过程中的幻觉问题。具体而言,HaluSearch将文本生成视为逐步推理过程,利用自评估奖励模型对每个生成步骤进行评分,并引导树搜索朝向最可靠的生成路径,从而充分利用LLMs的内部知识。此外,为了平衡效率和质量,该框架引入了层次化思维系统切换机制,该机制受认知科学中的双过程理论启发,动态地在实例和步骤级别之间切换快思考和慢思考模式,以适应问题的复杂性和推理状态。实验结果表明,该方法在英文和中文数据集上显著优于基线方法。
链接: https://arxiv.org/abs/2501.01306
作者: Xiaoxue Cheng,Junyi Li,Wayne Xin Zhao,Ji-Rong Wen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) demonstrate exceptional capabilities, yet still face the hallucination issue. Typical text generation approaches adopt an auto-regressive generation without deliberate reasoning, which often results in untrustworthy and factually inaccurate responses. In this paper, we propose HaluSearch, a novel framework that incorporates tree search-based algorithms (e.g. MCTS) to enable an explicit slow thinking generation process for mitigating hallucinations of LLMs during inference. Specifically, HaluSearch frames text generation as a step-by-step reasoning process, using a self-evaluation reward model to score each generation step and guide the tree search towards the most reliable generation pathway for fully exploiting the internal knowledge of LLMs. To balance efficiency and quality, we introduce a hierarchical thinking system switch mechanism inspired by the dual process theory in cognitive science, which dynamically alternates between fast and slow thinking modes at both the instance and step levels, adapting to the complexity of questions and reasoning states. We conduct extensive experiments on both English and Chinese datasets and the results show that our approach significantly outperforms baseline approaches.
zh
[NLP-11] Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments – The Depression and Anxiety Case
【速读】: 该论文旨在解决大型语言模型(LLMs)在医疗诊断评估中的应用问题,特别是如何使其能够有效辅助诊断重度抑郁症(MDD)和广泛性焦虑症(GAD)。论文的核心解决方案包括通过提示(prompting)和微调(fine-tuning)技术,使LLMs能够紧密遵循临床医生使用的标准诊断流程,如患者健康问卷-9(PHQ-9)和广泛性焦虑症-7(GAD-7)问卷。具体而言,论文研究了专有模型(如GPT-3.5和GPT-4)和开源模型(如llama-3.1-8b和mixtral-8x7b)在这些技术下的表现,并评估了LLM生成的诊断结果与专家验证的真实结果之间的一致性。关键点在于通过精细的模型调整和提示策略,确保LLMs在诊断评估中的准确性和可靠性。
链接: https://arxiv.org/abs/2501.01305
作者: Kaushik Roy,Harshul Surana,Darssan Eswaramoorthi,Yuxin Zi,Vedant Palit,Ritvik Garimella,Amit Sheth
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
zh
[NLP-12] Citations and Trust in LLM Generated Responses AAAI2025
【速读】: 该论文探讨了问答系统(Question Answering Systems)中用户信任度的问题,特别是由于系统的不透明性可能对用户信任产生的影响。研究通过反监控框架(anti-monitoring framework)提出假设,认为用户信任度与引用(citations)的存在呈正相关,而与用户检查引用的行为呈负相关。为了验证这一假设,研究设计了一个实时问答实验,实验中展示了由商业聊天机器人(Chatbot)生成的文本回答,并提供了不同数量(零个、一个或五个)的相关或随机引用。实验记录了参与者是否检查了引用以及他们对生成回答的自我报告信任度。研究结果表明,引用的存在显著提高了用户信任度,即使引用是随机的;同时,当参与者检查引用时,信任度显著下降。这些发现强调了引用在增强用户对AI生成内容的信任中的重要性。
链接: https://arxiv.org/abs/2501.01303
作者: Yifan Ding,Matthew Facciani,Amrit Poudel,Ellen Joyce,Salvador Aguinaga,Balaji Veeramani,Sanmitra Bhattacharya,Tim Weninger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025
Abstract:Question answering systems are rapidly advancing, but their opaque nature may impact user trust. We explored trust through an anti-monitoring framework, where trust is predicted to be correlated with presence of citations and inversely related to checking citations. We tested this hypothesis with a live question-answering experiment that presented text responses generated using a commercial Chatbot along with varying citations (zero, one, or five), both relevant and random, and recorded if participants checked the citations and their self-reported trust in the generated responses. We found a significant increase in trust when citations were present, a result that held true even when the citations were random; we also found a significant decrease in trust when participants checked the citations. These results highlight the importance of citations in enhancing trust in AI-generated content.
zh
[NLP-13] oolComp: A Multi-Tool Reasoning Process Supervision Benchmark
【速读】: 该论文试图解决当前AI系统在执行涉及多个工具的复杂多步推理任务时面临的挑战。现有的基准测试未能充分捕捉工具使用推理的真实复杂性,尤其是在验证最终答案和中间步骤的正确性方面。为了弥补这一差距,作者提出了ToolComp,一个全面的基准测试,旨在评估多步工具使用推理。ToolComp通过模型与人类标注者的协作开发,包含人类编辑/验证的提示、最终答案和过程监督标签,从而能够评估最终结果和中间推理。实验结果表明,过程监督奖励模型(PRMs)在复杂工具使用推理任务中的泛化能力显著优于结果监督奖励模型(ORMs),分别提高了19%和11%的rank@1准确率。这些发现强调了过程监督在AI模型评估和训练中的关键作用,为开发更强大和可靠的复杂多步工具使用系统铺平了道路。
链接: https://arxiv.org/abs/2501.01290
作者: Vaskar Nath,Pranav Raja,Claire Yoon,Sean Hendryx
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite recent advances in AI, the development of systems capable of executing complex, multi-step reasoning tasks involving multiple tools remains a significant challenge. Current benchmarks fall short in capturing the real-world complexity of tool-use reasoning, where verifying the correctness of not only the final answer but also the intermediate steps is important for evaluation, development, and identifying failures during inference time. To bridge this gap, we introduce ToolComp, a comprehensive benchmark designed to evaluate multi-step tool-use reasoning. ToolComp is developed through a collaboration between models and human annotators, featuring human-edited/verified prompts, final answers, and process supervision labels, allowing for the evaluation of both final outcomes and intermediate reasoning. Evaluation across six different model families demonstrates the challenging nature of our dataset, with the majority of models achieving less than 50% accuracy. Additionally, we generate synthetic training data to compare the performance of outcome-supervised reward models (ORMs) with process-supervised reward models (PRMs) to assess their ability to improve complex tool-use reasoning as evaluated by ToolComp. Our results show that PRMs generalize significantly better than ORMs, achieving a 19% and 11% improvement in rank@1 accuracy for ranking base and fine-tuned model trajectories, respectively. These findings highlight the critical role of process supervision in both the evaluation and training of AI models, paving the way for more robust and capable systems in complex, multi-step tool-use tasks.
zh
[NLP-14] NeutraSum: A Language Model can help a Balanced Media Diet by Neutralizing News Summaries
【速读】: 该论文试图解决新闻文章中的媒体偏见(media bias)问题,这种偏见源于媒体机构的政治极化(political polarisation),可能导致社会刻板印象和信念的强化。尽管先前的研究尝试通过多视角新闻文章生成无偏见的摘要,但未能有效缓解固有的媒体偏见。为解决这一问题,论文提出了NeutraSum框架,其关键创新在于引入了两种中立性损失(neutrality losses),用于调整生成摘要的语义空间,从而最小化媒体偏见。这些损失函数旨在平衡极化输入之间的语义距离,并确保生成的摘要与专家撰写的摘要保持一致,从而引导生成中立且事实丰富的摘要。通过政治罗盘测试(political compass test)评估媒体偏见,实验结果表明,NeutraSum不仅提升了摘要生成性能,还显著减少了媒体偏见,为中立新闻摘要提供了一种有前景的解决方案。
链接: https://arxiv.org/abs/2501.01284
作者: Xi Luo,Junjie Liu,Sirong Wu,Yuhui Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Media bias in news articles arises from the political polarisation of media outlets, which can reinforce societal stereotypes and beliefs. Reporting on the same event often varies significantly between outlets, reflecting their political leanings through polarised language and focus. Although previous studies have attempted to generate bias-free summaries from multiperspective news articles, they have not effectively addressed the challenge of mitigating inherent media bias. To address this gap, we propose \textbfNeutraSum, a novel framework that integrates two neutrality losses to adjust the semantic space of generated summaries, thus minimising media bias. These losses, designed to balance the semantic distances across polarised inputs and ensure alignment with expert-written summaries, guide the generation of neutral and factually rich summaries. To evaluate media bias, we employ the political compass test, which maps political leanings based on economic and social dimensions. Experimental results on the Allsides dataset demonstrate that NeutraSum not only improves summarisation performance but also achieves significant reductions in media bias, offering a promising approach for neutral news summarisation.
zh
[NLP-15] CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在文化理解方面的不足,特别是由于训练数据主要基于西方文化背景而导致的符号、手势和文物误解问题。为了解决这一问题,论文提出了两个关键解决方案:首先,构建了一个名为CultureVerse的大规模多模态基准数据集,涵盖了19,682个文化概念、188个国家/地区、15个文化概念和3种问题类型,用于评估和改进VLMs的多文化理解能力;其次,提出了CultureVLM,这是一系列基于该数据集进行微调的VLMs,显著提升了模型在文化理解任务中的表现。通过评估16个模型,论文发现微调后的模型在跨文化、跨大陆和跨数据集的泛化能力上表现优异,且不牺牲其在通用VLM基准上的性能。这些工作为构建更加公平和文化感知的多模态AI系统奠定了基础。
链接: https://arxiv.org/abs/2501.01282
作者: Shudong Liu,Yiqiao Jin,Cheng Li,Derek F. Wong,Qingsong Wen,Lichao Sun,Haipeng Chen,Xing Xie,Jindong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report; 26 pages
Abstract:Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs’ multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models’ general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
zh
[NLP-16] Does a Large Language Model Really Speak in Human-Like Language?
【速读】: 该论文试图解决两个关键问题:首先,探讨人类撰写文本(\mathcal{O})与大型语言模型(LLM)生成的改写文本(\mathcal{G})之间的潜在社区结构差异是否与 \mathcal{G} 和其二次改写文本(\mathcal{S})之间的差异相似;其次,研究当调整控制文本生成多样性的LLM参数时,\mathcal{G} 是否会更接近 \mathcal{O}。论文的核心假设是,如果LLM生成的文本确实接近人类语言,那么 \mathcal{O} 和 \mathcal{G} 之间的差异应与 \mathcal{G} 和 \mathcal{S} 之间的差异相似。为解决这些问题,作者提出了一种统计假设检验框架,利用文本之间的改写关系,将不同数据集的相对位置映射到一个共同的空间中,从而实现对它们的直接比较。研究结果表明,GPT生成的文本与人类撰写的文本仍存在显著差异。
链接: https://arxiv.org/abs/2501.01273
作者: Mose Park,Yunjin Choi,Jong-June Jeon
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:
Abstract:Large Language Models (LLMs) have recently emerged, attracting considerable attention due to their ability to generate highly natural, human-like text. This study compares the latent community structures of LLM-generated text and human-written text within a hypothesis testing procedure. Specifically, we analyze three text sets: original human-written texts ( \mathcalO ), their LLM-paraphrased versions ( \mathcalG ), and a twice-paraphrased set ( \mathcalS ) derived from \mathcalG . Our analysis addresses two key questions: (1) Is the difference in latent community structures between \mathcalO and \mathcalG the same as that between \mathcalG and \mathcalS ? (2) Does \mathcalG become more similar to \mathcalO as the LLM parameter controlling text variability is adjusted? The first question is based on the assumption that if LLM-generated text truly resembles human language, then the gap between the pair ( \mathcalO , \mathcalG ) should be similar to that between the pair ( \mathcalG , \mathcalS ), as both pairs consist of an original text and its paraphrase. The second question examines whether the degree of similarity between LLM-generated and human text varies with changes in the breadth of text generation. To address these questions, we propose a statistical hypothesis testing framework that leverages the fact that each text has corresponding parts across all datasets due to their paraphrasing relationship. This relationship enables the mapping of one dataset’s relative position to another, allowing two datasets to be mapped to a third dataset. As a result, both mapped datasets can be quantified with respect to the space characterized by the third dataset, facilitating a direct comparison between them. Our results indicate that GPT-generated text remains distinct from human-authored text.
zh
[NLP-17] ProgCo: Program Helps Self-Correction of Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在复杂推理任务中自我校正(Self-Correction)时无法有效自我验证和生成正确反馈的问题,从而导致校正失败。解决方案的关键在于提出了程序驱动的自我校正方法(Program-driven Self-Correction, ProgCo)。该方法通过程序驱动的验证(ProgVe)实现复杂的验证逻辑和广泛的验证,利用自生成、自执行的验证伪程序进行验证。随后,程序驱动的精炼(ProgRe)接收来自ProgVe的反馈,对响应和验证程序进行双重反思和精炼,以减少错误反馈在复杂推理任务中的误导。实验结果表明,ProgCo能够有效实现自我校正,并且在结合真实程序工具时进一步提升性能。
链接: https://arxiv.org/abs/2501.01264
作者: Xiaoshuai Song,Yanan Wu,Weixun Wang,Jiaheng Liu,Wenbo Su,Bo Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Working in progress
Abstract:Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.
zh
[NLP-18] CodeElo: Benchmarking Competition-level Code Generation of LLM s with Human-comparable Elo Ratings
【速读】: 该论文试图解决现有大语言模型(LLMs)在代码推理能力评估中面临的挑战,特别是现有基准测试(如LiveCodeBench和USACO)在私有测试用例不可用、缺乏特殊评判支持以及执行环境不一致等方面的不足。为了解决这些问题,论文提出了CodeElo,一个标准化的竞赛级代码生成基准测试。CodeElo的关键解决方案包括:1)基于CodeForces平台,整合最近六个月的竞赛题目,并提供详细的比赛分区、题目难度评级和算法标签等信息;2)引入独特的评判方法,直接将问题提交到CodeForces平台进行评测;3)开发了一个可靠的Elo评分计算系统,该系统与平台对齐,且与人类参与者具有可比性,但方差更低。通过这些措施,CodeElo首次为30个流行的开源和3个专有LLMs提供了Elo评分,揭示了不同模型在竞赛级代码生成任务中的表现差异。
链接: https://arxiv.org/abs/2501.01257
作者: Shanghaoran Quan,Jiaxi Yang,Bowen Yu,Bo Zheng,Dayiheng Liu,An Yang,Xuancheng Ren,Bofei Gao,Yibo Miao,Yunlong Feng,Zekun Wang,Jian Yang,Zeyu Cui,Yang Fan,Yichang Zhang,Binyuan Hui,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.
zh
[NLP-19] Digital Guardians: Can GPT-4 Perspective API and Moderation API reliably detect hate speech in reader comments of German online newspapers?
【速读】: 该论文旨在解决互联网上广泛存在的有毒内容(toxic content)和仇恨言论(hate speech)问题,特别是在在线报纸和论坛的读者评论中。由于法律要求,平台管理者需要对这些内容进行人工审核和删除,这一过程耗时且劳动密集。论文通过比较几种大型语言模型(如GPT-4o、Jigsaw的Perspective API和OpenAI的Moderation API)在检测仇恨言论方面的性能,提出了自动化解决方案。关键解决方案在于使用HOCON34k测试数据集,该数据集专门用于开发检测在线报纸读者评论中仇恨言论的工具。实验结果表明,GPT-4o在零样本(Zero-Shot)、单样本(One-Shot)和少样本(Few-Shot)提示策略下,表现优于其他模型,并在综合MCC和F2-score指标上超过了HOCON34k基线约5个百分点。
链接: https://arxiv.org/abs/2501.01256
作者: Manuel Weber,Moritz Huber,Maximilian Auch,Alexander Döschl,Max-Emanuel Keller,Peter Mandl
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In recent years, toxic content and hate speech have become widespread phenomena on the internet. Moderators of online newspapers and forums are now required, partly due to legal regulations, to carefully review and, if necessary, delete reader comments. This is a labor-intensive process. Some providers of large language models already offer solutions for automated hate speech detection or the identification of toxic content. These include GPT-4o from OpenAI, Jigsaw’s (Google) Perspective API, and OpenAI’s Moderation API. Based on the selected German test dataset HOCON34k, which was specifically created for developing tools to detect hate speech in reader comments of online newspapers, these solutions are compared with each other and against the HOCON34k baseline. The test dataset contains 1,592 annotated text samples. For GPT-4o, three different promptings are used, employing a Zero-Shot, One-Shot, and Few-Shot approach. The results of the experiments demonstrate that GPT-4o outperforms both the Perspective API and the Moderation API, and exceeds the HOCON34k baseline by approximately 5 percentage points, as measured by a combined metric of MCC and F2-score.
zh
[NLP-20] Large Language Model-Enhanced Symbolic Reasoning for Knowledge Base Completion
【速读】: 该论文旨在解决知识库补全(Knowledge Base Completion, KBC)中传统规则推理方法缺乏灵活性和大语言模型(Large Language Models, LLMs)存在幻觉(hallucinations)的问题。传统规则推理方法虽然具有可验证性,但在灵活性上表现不足;而LLMs虽然具备强大的语义理解能力,但生成的规则可能存在不可靠的幻觉。为此,论文提出了一种新颖的框架,结合了LLMs的语义理解能力和规则推理的逻辑严谨性。该框架由三个关键组件构成:子图提取器(Subgraph Extractor)、LLM提议器(LLM Proposer)和规则推理器(Rule Reasoner)。子图提取器首先从知识库中抽取子图,LLM基于这些子图生成多样且有意义的规则,规则推理器则进一步精炼这些规则,以确保其可靠性和重要性。该方案的关键在于通过LLMs增强规则的丰富性和多样性,同时通过规则推理提高推理的可靠性,从而在多个知识库数据集上展现出优异的性能和泛化能力。
链接: https://arxiv.org/abs/2501.01246
作者: Qiyuan He,Jianfei Yu,Wenya Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Integrating large language models (LLMs) with rule-based reasoning offers a powerful solution for improving the flexibility and reliability of Knowledge Base Completion (KBC). Traditional rule-based KBC methods offer verifiable reasoning yet lack flexibility, while LLMs provide strong semantic understanding yet suffer from hallucinations. With the aim of combining LLMs’ understanding capability with the logical and rigor of rule-based approaches, we propose a novel framework consisting of a Subgraph Extractor, an LLM Proposer, and a Rule Reasoner. The Subgraph Extractor first samples subgraphs from the KB. Then, the LLM uses these subgraphs to propose diverse and meaningful rules that are helpful for inferring missing facts. To effectively avoid hallucination in LLMs’ generations, these proposed rules are further refined by a Rule Reasoner to pinpoint the most significant rules in the KB for Knowledge Base Completion. Our approach offers several key benefits: the utilization of LLMs to enhance the richness and diversity of the proposed rules and the integration with rule-based reasoning to improve reliability. Our method also demonstrates strong performance across diverse KB datasets, highlighting the robustness and generalizability of the proposed framework.
zh
[NLP-21] Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants ICLR2025
【速读】: 该论文旨在解决多模态助手(multi-modal assistants)在面部和人类理解能力方面缺乏全面和科学评估的问题。为了应对这一挑战,作者首先提出了一个分层次的能力分类法(hierarchical ability taxonomy),包括三个层次的能力。基于这一分类法,作者从公开的面部和人类相关数据集中收集图像和标注,并构建了一个半自动化的数据管道(semi-automatic data pipeline),以生成新的基准测试问题。最终,构建的Face-Human-Bench包含900个开发集问题和1800个测试集问题,支持中英文双语。通过这一基准测试,作者对25个主流的多模态大语言模型(MLLMs)进行了评估,重点关注能力之间的相关性、目标相对位置对性能的影响以及链式思维提示(Chain of Thought, CoT)对性能的影响。此外,受多模态代理(multi-modal agents)的启发,作者还探讨了哪些MLLMs的能力需要由专业模型(specialist models)进行补充。
链接: https://arxiv.org/abs/2501.01243
作者: Lixiong Qin,Shilong Ou,Miaoxuan Zhang,Jiangning Wei,Yuhang Zhang,Xiaoshuai Song,Yuchen Liu,Mei Wang,Weiran Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 50 pages, 14 figures, 41 tables. Submitted to ICLR 2025
Abstract:Faces and humans are crucial elements in social interaction and are widely included in everyday photos and videos. Therefore, a deep understanding of faces and humans will enable multi-modal assistants to achieve improved response quality and broadened application scope. Currently, the multi-modal assistant community lacks a comprehensive and scientific evaluation of face and human understanding abilities. In this paper, we first propose a hierarchical ability taxonomy that includes three levels of abilities. Then, based on this taxonomy, we collect images and annotations from publicly available datasets in the face and human community and build a semi-automatic data pipeline to produce problems for the new benchmark. Finally, the obtained Face-Human-Bench comprises a development set with 900 problems and a test set with 1800 problems, supporting both English and Chinese. We conduct evaluations over 25 mainstream multi-modal large language models (MLLMs) with our Face-Human-Bench, focusing on the correlation between abilities, the impact of the relative position of targets on performance, and the impact of Chain of Thought (CoT) prompting on performance. Moreover, inspired by multi-modal agents, we also explore which abilities of MLLMs need to be supplemented by specialist models.
zh
[NLP-22] Automated Self-Refinement and Self-Correction for LLM -based Product Attribute Value Extraction
【速读】: 该论文试图解决电子商务平台中产品属性值提取(product attribute value extraction)的问题,特别是在供应商提供非结构化产品描述的情况下,如何确保数据的一致性和可用性。解决方案的关键在于探索两种自优化技术(self-refinement techniques),即基于错误的提示重写(error-based prompt rewriting)和自校正(self-correction),并将其应用于产品属性值提取任务中。研究通过在不同场景(零样本、少样本上下文学习和微调)下使用GPT-4o模型进行实验,发现这些自优化技术对模型性能的提升有限,且显著增加了处理成本。相比之下,在有训练数据的情况下,微调(fine-tuning)能够提供最高的性能,且随着产品描述数量的增加,微调的启动成本逐渐被抵消。
链接: https://arxiv.org/abs/2501.01237
作者: Alexander Brinkmann,Christian Bizer
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Structured product data, in the form of attribute-value pairs, is essential for e-commerce platforms to support features such as faceted product search and attribute-based product comparison. However, vendors often provide unstructured product descriptions, making attribute value extraction necessary to ensure data consistency and usability. Large language models (LLMs) have demonstrated their potential for product attribute value extraction in few-shot scenarios. Recent research has shown that self-refinement techniques can improve the performance of LLMs on tasks such as code generation and text-to-SQL translation. For other tasks, the application of these techniques has resulted in increased costs due to processing additional tokens, without achieving any improvement in performance. This paper investigates applying two self-refinement techniques, error-based prompt rewriting and self-correction, to the product attribute value extraction task. The self-refinement techniques are evaluated across zero-shot, few-shot in-context learning, and fine-tuning scenarios using GPT-4o. The experiments show that both self-refinement techniques have only a marginal impact on the model’s performance across the different scenarios, while significantly increasing processing costs. For scenarios with training data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
zh
[NLP-23] Harnessing Multi-Agent LLM s for Complex Engineering Problem-Solving: A Framework for Senior Design Projects
【速读】: 该论文试图解决的问题是如何在工程教育中的毕业设计项目(senior design projects)中利用多智能体大语言模型(Multi-Agent Large Language Models, LLMs)来支持复杂问题的解决。这些项目通常涉及多学科考量和相互冲突的目标,例如在优化技术性能的同时兼顾伦理、社会和环境问题。论文提出的解决方案关键在于构建一个框架,其中不同的LLM智能体代表不同的专家视角,如问题定义智能体、系统复杂性智能体、社会与伦理智能体或项目管理智能体。这些智能体通过协调、合作和协商等标准多智能体系统(MAS)概念进行交互,并结合提示工程(prompt engineering)为每个智能体开发多样化的角色。通过这种方式,智能体能够模拟人类工程团队的协作对话,借鉴群体智能(swarm AI)的原则,有效平衡个体贡献以实现统一的解决方案。该框架旨在促进跨学科推理和协商,从而更好地支持毕业设计项目的复杂需求。
链接: https://arxiv.org/abs/2501.01205
作者: Abdullah Mushtaq,Muhammad Rafay Naeem,Ibrahim Ghaznavi,Muhammad Imran Taj,Imran Hashmi,Junaid Qadir
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Multi-Agent Large Language Models (LLMs) are gaining significant attention for their ability to harness collective intelligence in complex problem-solving, decision-making, and planning tasks. This aligns with the concept of the wisdom of crowds, where diverse agents contribute collectively to generating effective solutions, making it particularly suitable for educational settings. Senior design projects, also known as capstone or final year projects, are pivotal in engineering education as they integrate theoretical knowledge with practical application, fostering critical thinking, teamwork, and real-world problem-solving skills. In this paper, we explore the use of Multi-Agent LLMs in supporting these senior design projects undertaken by engineering students, which often involve multidisciplinary considerations and conflicting objectives, such as optimizing technical performance while addressing ethical, social, and environmental concerns. We propose a framework where distinct LLM agents represent different expert perspectives, such as problem formulation agents, system complexity agents, societal and ethical agents, or project managers, thus facilitating a holistic problem-solving approach. This implementation leverages standard multi-agent system (MAS) concepts such as coordination, cooperation, and negotiation, incorporating prompt engineering to develop diverse personas for each agent. These agents engage in rich, collaborative dialogues to simulate human engineering teams, guided by principles from swarm AI to efficiently balance individual contributions towards a unified solution. We adapt these techniques to create a collaboration structure for LLM agents, encouraging interdisciplinary reasoning and negotiation similar to real-world senior design projects. To assess the efficacy of this framework, we collected six proposals of engineering and computer science of…
zh
[NLP-24] Data Augmentation Techniques for Chinese Disease Name Normalization
【速读】: 该论文旨在解决医学领域中疾病名称标准化(Disease Name Normalization)任务中训练数据严重不足的问题。疾病名称标准化是将不同格式书写的疾病名称分类为标准名称的过程,是智能医疗系统中各种疾病相关功能的基础组成部分。为了解决这一问题,论文提出了一种新颖的数据增强方法(Data Augmentation Approach),该方法包括一系列数据增强技术和一些支持模块,以帮助缓解训练数据不足的挑战。通过大量实验,论文展示了所提出的方法在各种基线模型和训练目标下,尤其是在训练数据有限的情况下,显著提升了性能表现。
链接: https://arxiv.org/abs/2501.01195
作者: Wenqian Cui,Xiangling Fu,Shaohui Liu,Mingjun Gu,Xien Liu,Ji Wu,Irwin King
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The Version of Record of this contribution is published in 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2024)
Abstract:Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data
zh
[NLP-25] Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets
【速读】: 该论文试图解决在语言模型(Language Models)中准确测量和减轻性别刻板印象偏见(gender stereotypical bias)的复杂问题。作者指出,现有的内在(intrinsic)和外在(extrinsic)测量方法之间缺乏相关性,因此需要更精细的内在测量策略来捕捉性别刻板印象的不同方面。解决方案的关键在于通过分析数据集中的数据分布,并结合社会心理学中的性别刻板印象成分,调整两个数据集的分布,以实现更好的结果对齐。这一方法揭示了性别刻板印象在语言模型中的复杂性,并为开发更精确的偏见检测和减轻技术提供了新的方向。
链接: https://arxiv.org/abs/2501.01168
作者: Mahdi Zakizadeh,Mohammad Taher Pilehvar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The multifaceted challenge of accurately measuring gender stereotypical bias in language models is akin to discerning different segments of a broader, unseen entity. This short paper primarily focuses on intrinsic bias mitigation and measurement strategies for language models, building on prior research that demonstrates a lack of correlation between intrinsic and extrinsic approaches. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our methodology involves analyzing data distributions across datasets and integrating gender stereotype components informed by social psychology. By adjusting the distribution of two datasets, we achieve a better alignment of outcomes. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
zh
[NLP-26] Leveraging Full Dependency Parsing Graph Information For Biomedical Event Extraction
【速读】: 该论文试图解决生物医学事件抽取(Biomedical Event Extraction, BEE)中基于最短依赖路径(Shortest Dependency Path, SDP)表示方法的问题。具体而言,SDP方法在依赖解析图中即使缺失一个词也可能导致最终预测结果的显著变化。为解决这一问题,论文提出使用依赖图的完整邻接矩阵(full adjacency matrix)来表示依赖关系,并通过图卷积网络(Graph Convolutional Network, GCN)对单个词进行嵌入。实验结果表明,使用依赖图信息显著提升了模型性能,且该模型在不同数据集上略微优于现有的最先进模型。
链接: https://arxiv.org/abs/2501.01158
作者: Farshad Noravesh,Reza Haffari,Ong Huey Fang,Layki Soon,Sailaja Rajalana,Arghya Pal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 figures, 4 tables
Abstract:Many models are proposed in the literature on biomedical event extraction(BEE). Some of them use the shortest dependency path(SDP) information to represent the argument classification task. There is an issue with this representation since even missing one word from the dependency parsing graph may totally change the final prediction. To this end, the full adjacency matrix of the dependency graph is used to embed individual tokens using a graph convolutional network(GCN). An ablation study is also done to show the effect of the dependency graph on the overall performance. The results show a significant improvement when dependency graph information is used. The proposed model slightly outperforms state-of-the-art models on BEE over different datasets.
zh
[NLP-27] BlockDialect: Block-wise Fine-grained Mixed Format for Energy-Efficient LLM Inference
【速读】: 该论文试图解决大型语言模型(LLMs)在内存使用和计算成本方面的挑战,特别是由于模型规模增大带来的问题。现有的量化方法在捕捉细粒度的块数据分布方面存在困难。为此,论文提出了BlockDialect,一种块级细粒度混合格式技术,通过从格式库(formatbook)中为每个块分配最优的数字格式,以更好地表示数据。此外,论文引入了DialectFP4,一个包含多种FP4变体(类似于方言)的格式库,能够适应不同的数据分布,并通过选择可表示为低精度整数算术的缩放整数来确保硬件效率。BlockDialect在LLaMA3-8B和LLaMA2-7B模型上分别实现了11.40%和6.90%的精度提升,同时在全路径矩阵乘法量化时仅比全精度低5.89%和3.31%。该方案的关键在于通过优化数据表示而非缩放方式,为能效优化的LLM推理提供了一条有前景的路径。
链接: https://arxiv.org/abs/2501.01144
作者: Wonsuk Jang,Thierry Tambe
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable success, but their increasing size poses significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with fine-grained block-wise quantization emerging as a promising hardware-supported solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. To address this, we propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. Importantly, DialectFP4 ensures hardware efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. Furthermore, we propose a two-stage approach for online DialectFP4 activation quantization. BlockDialect achieves 11.40% (6.90%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with a comparable bit usage per data, while being only 5.89% (3.31%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.
zh
[NLP-28] ED: Turn Emphasis with Dialogue Feature Attention for Emotion Recognition in Conversation
【速读】: 该论文试图解决在对话中的情感识别(Emotion Recognition in Conversation, ERC)任务中,如何有效建模多轮对话上下文的问题。现有的预训练模型在处理多轮输入时,通常通过在输入序列中插入特殊标记来隐式区分当前轮次和其他轮次,但这种方法可能无法充分捕捉对话中的复杂上下文信息。论文提出了一种基于优先级的注意力机制,称为“基于对话的轮次强调”(Turn Emphasis with Dialogue, TED),通过将对话特征(如轮次位置和说话者信息)引入注意力机制,显式地区分每一轮次。TED的关键在于利用多轮输入的基于轮次的向量进行多头自注意力计算,并通过对话特征调整注意力分数,从而更好地捕捉多轮对话中的情感信息。实验结果表明,TED在多个基准数据集上表现出色,尤其在具有多轮对话的IEMOCAP数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2501.01123
作者: Junya Ono,Hiromi Wakaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: past activity in 2021
Abstract:Emotion recognition in conversation (ERC) has been attracting attention by methods for modeling multi-turn contexts. The multi-turn input to a pretraining model implicitly assumes that the current turn and other turns are distinguished during the training process by inserting special tokens into the input sequence. This paper proposes a priority-based attention method to distinguish each turn explicitly by adding dialogue features into the attention mechanism, called Turn Emphasis with Dialogue (TED). It has a priority for each turn according to turn position and speaker information as dialogue features. It takes multi-head self-attention between turn-based vectors for multi-turn input and adjusts attention scores with the dialogue features. We evaluate TED on four typical benchmarks. The experimental results demonstrate that TED has high overall performance in all datasets and achieves state-of-the-art performance on IEMOCAP with numerous turns.
zh
[NLP-29] MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
【速读】: 该论文旨在解决音乐信息学理解任务中的自监督学习(SSL)模型性能提升问题。具体来说,论文提出了一种名为MuQ的自监督音乐表示学习模型,用于音乐理解任务,如音乐标签、乐器分类和调性检测等。与以往采用随机投影或现有神经编解码器的研究不同,MuQ通过预测由Mel残差向量量化(Mel-RVQ)生成的token进行训练。Mel-RVQ采用残差线性投影结构对Mel频谱进行量化,以提高目标提取的稳定性和效率,从而提升模型性能。实验表明,MuQ在仅使用0.9K小时的开源预训练数据的情况下,优于以往的自监督音乐表示模型。通过将数据规模扩展至160K小时以上并采用迭代训练,模型性能得到进一步提升。此外,论文还提出了基于对比学习的联合音乐-文本嵌入模型MuQ-MuLan,在MagnaTagATune数据集上的零样本音乐标签任务中达到了最先进的性能。
链接: https://arxiv.org/abs/2501.01108
作者: Haina Zhu,Yizhi Zhou,Hangting Chen,Jianwei Yu,Ziyang Ma,Rongzhi Gu,Wei Tan,Xie Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in this https URL.
zh
[NLP-30] BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
【速读】: 该论文试图解决孟加拉语宗教新闻标题生成(headline generation)中存在的关键问题,即现有方法通常仅依赖于新闻内容,而忽略了情感(sentiment)、类别(category)和方面(aspect)等重要的上下文特征,导致生成效果受限。为解决这一问题,论文提出了一个名为MultiGen的上下文多输入特征融合方法,通过结合新闻内容与额外的上下文特征(如类别、方面和情感),利用基于Transformer的预训练语言模型(如BanglaT5、mBART、mT5和mT0)来生成更具上下文感知的标题。实验结果表明,MultiGen在BLEU和ROUGE-L评分上均优于仅使用新闻内容的基线方法,证明了上下文特征在低资源语言标题生成中的重要性。
链接: https://arxiv.org/abs/2501.01069
作者: Md Osama,Ashim Dey,Kawsar Ahmed,Muhammad Ashad Kabir
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 4 figures, 11 tables
Abstract:Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at this https URL.
zh
[NLP-31] Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在生成输出时出现的上下文忠实性幻觉(context faithfulness hallucinations)问题,即模型输出与检索信息偏离的现象。这种现象通常由于上下文利用不足和高输出不确定性引起。论文通过不确定性评估实验发现,高不确定性与幻觉之间存在强相关性,并提出假设认为注意力机制编码了指示上下文利用的信号。基于这些发现,论文提出了动态注意力引导的上下文解码(Dynamic Attention-Guided Context Decoding, DAGCD)框架,该框架在单次解码过程中集成了注意力分布和不确定性信号。实验结果表明,DAGCD在多个问答数据集上显著提升了输出的忠实性和鲁棒性,同时保持了计算效率。
链接: https://arxiv.org/abs/2501.01059
作者: Yanwen Huang,Yong Zhang,Ning Cheng,Zhitao Li,Shaojun Wang,Jing Xiao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often suffer from context faithfulness hallucinations, where outputs deviate from retrieved information due to insufficient context utilization and high output uncertainty. Our uncertainty evaluation experiments reveal a strong correlation between high uncertainty and hallucinations. We hypothesize that attention mechanisms encode signals indicative of contextual utilization, validated through probing analysis. Based on these insights, we propose Dynamic Attention-Guided Context Decoding (DAGCD), a lightweight framework that integrates attention distributions and uncertainty signals in a single-pass decoding process. Experiments across QA datasets demonstrate DAGCD’s effectiveness, achieving significant improvements in faithfulness and robustness while maintaining computational efficiency.
zh
[NLP-32] Risks of Cultural Erasure in Large Language Models
【速读】: 该论文探讨了大型语言模型(Large Language Models, LLMs)在全球文化知识生产和发现中的应用及其潜在影响,特别是这些模型如何影响人们对全球文化的认知和互动。论文指出,当前的研究主要集中在评估语言模型输出中全球文化代表性分布的差距,但缺乏对跨文化影响的深入评估,尤其是基于社会学视角的文化影响或伤害的细致分析。为此,论文提出需要开发可量化的评估方法,以考察语言技术对历史权力不平等和文化代表性差异的影响,特别是对在数字语料库中代表性不足的文化的关注。论文聚焦于两种“抹除”概念:遗漏(omission),即某些文化完全未被代表;以及简化(simplification),即通过单一维度的视角抹除文化的复杂性。研究通过两个任务情境进行分析:一是语言模型在描述全球不同地区时的文化代表性;二是语言模型应用生成的旅行推荐中所体现的文化代表性。论文的核心解决方案在于将复杂的社会文化考量纳入标准评估和基准测试中,为自然语言处理(NLP)社区和应用开发者提供操作化的方法。
链接: https://arxiv.org/abs/2501.01056
作者: Rida Qadri,Aida M. Davani,Kevin Robinson,Vinodkumar Prabhakaran
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly being integrated into applications that shape the production and discovery of societal knowledge such as search, online education, and travel planning. As a result, language models will shape how people learn about, perceive and interact with global cultures making it important to consider whose knowledge systems and perspectives are represented in models. Recognizing this importance, increasingly work in Machine Learning and NLP has focused on evaluating gaps in global cultural representational distribution within outputs. However, more work is needed on developing benchmarks for cross-cultural impacts of language models that stem from a nuanced sociologically-aware conceptualization of cultural impact or harm. We join this line of work arguing for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities and differential impacts of representation on global cultures, particularly for cultures already under-represented in the digital corpora. We look at two concepts of erasure: omission: where cultures are not represented at all and simplification i.e. when cultural complexity is erased by presenting one-dimensional views of a rich culture. The former focuses on whether something is represented, and the latter on how it is represented. We focus our analysis on two task contexts with the potential to influence global cultural production. First, we probe representations that a language model produces about different places around the world when asked to describe these contexts. Second, we analyze the cultures represented in the travel recommendations produced by a set of language model applications. Our study shows ways in which the NLP community and application developers can begin to operationalize complex socio-cultural considerations into standard evaluations and benchmarks.
zh
[NLP-33] Dynamic Scaling of Unit Tests for Code Reward Modeling
【速读】: 该论文试图解决当前大语言模型(LLMs)在处理复杂推理任务(如代码生成)时,首次尝试生成准确响应的能力不足的问题。现有研究通过生成多个候选解决方案并使用LLM生成的单元测试进行验证来应对这一挑战,但LLM生成的单元测试不可靠,导致奖励信号质量下降。论文的关键解决方案是通过增加单元测试的数量来提高奖励信号的质量,并提出了CodeRM-8B,一个轻量级但高效的单元测试生成器,能够实现高质量且高效的单元测试扩展。此外,论文还引入了一种动态扩展机制,根据问题难度自适应调整单元测试数量,进一步提高了效率。实验结果表明,该方法在多个基准测试中显著提升了不同模型的性能。
链接: https://arxiv.org/abs/2501.01054
作者: Zeyao Ma,Xiaokang Zhang,Jing Zhang,Jifan Yu,Sijia Luo,Jie Tang
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Homepage: this https URL
Abstract:Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).
zh
[NLP-34] FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
【速读】: 该论文试图解决在大规模数据集上去重(deduplication)时,现有方法在处理效率上的不足问题。具体来说,尽管NVIDIA提出了基于GPU的MinHash LSH去重方法,但其效率仍有提升空间。论文提出的解决方案是\sys,一个基于GPU集群优化的去重框架,其关键点在于优化了MinHash LSH算法,并利用了计算效率高且部分可重用的非加密哈希函数(non-cryptographic hash functions)。实验结果表明,\sys在处理100万文档时,比CPU-based的SlimPajama工具快58.3倍,比GPU-based的NVIDIA NeMo Curator工具快8.6倍,且在四节点、16-GPU环境下,仅用5.1小时完成了1.2万亿token的去重任务。
链接: https://arxiv.org/abs/2501.01046
作者: Youngjun Son,Chaewon Kim,Jaejin Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures
Abstract:Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving training performance and efficiency of LLMs. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework \sys that optimizes MinHash LSH for GPU clusters and leverages computationally efficient and partially reusable non-cryptographic hash functions. \sys significantly outperforms the CPU-based deduplication tool included in SlimPajama by up to 58.3 times and the GPU-based deduplication tool included in NVIDIA NeMo Curator by up to 8.6 times when processing 1 million documents with a node of four GPUs. Deduplication of 1.2 trillion tokens is completed in just 5.1 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (this https URL).
zh
[NLP-35] MSWA: Refining Local Attention with Multi-ScaleWindow Attention
【速读】: 该论文试图解决基于Transformer的大语言模型(LLMs)在自然语言处理(NLP)任务中标准自注意力机制(self-attention)存在的两个主要问题:二次时间复杂度(quadratic time complexity)和线性增长的缓存大小(linearly increased cache size)。虽然滑动窗口注意力(Sliding Window Attention, SWA)通过将注意力范围限制在固定大小的局部上下文窗口中解决了这些问题,但SWA在每个层的每个头中使用统一的窗口大小,导致其在捕捉不同尺度的上下文信息时效率低下。为解决这一局限性,论文提出了多尺度窗口注意力(Multi-Scale Window Attention, MSWA),该机制在同一层的不同头之间以及从浅层到深层之间应用不同的窗口大小。这种设计不仅允许在同一层内使用不同大小的窗口,还通过从浅层到深层逐步增加窗口大小的分配,使模型能够捕捉不同长度和距离的上下文信息。实验结果表明,MSWA在语言建模和常识推理任务中,无论在效果还是效率上都优于传统的局部注意力机制。
链接: https://arxiv.org/abs/2501.01039
作者: Yixing Xu,Shivank Nag,Dong Li,Lu Tian,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.
zh
[NLP-36] Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
【速读】: 该论文试图解决新加坡英语(Singlish)口语形式研究不足的问题,特别是在多语言和多文化背景下对其语言结构和应用的理解有限。为了解决这一问题,作者标准化并注释了最大的新加坡英语口语语料库,引入了多任务国家语音语料库(Multitask National Speech Corpus, MNSC)。该语料库支持多种任务,包括自动语音识别(ASR)、口语问答(SQA)、口语对话摘要(SDS)和副语言问答(PQA)。此外,作者提出了SingAudioLLM,一个多任务多模态模型,利用多模态大语言模型同时处理这些任务。实验表明,该模型在新加坡英语语境中表现出色,性能优于其他音频大语言模型和级联解决方案,提升了10-30%的性能。
链接: https://arxiv.org/abs/2501.01034
作者: Bin Wang,Xunlong Zou,Shuo Sun,Wenyu Zhang,Yingxu He,Zhuohan Liu,Chengwei Wei,Nancy F. Chen,AiTi Aw
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Open-Source: this https URL
Abstract:Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
zh
[NLP-37] ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning
【速读】: 该论文试图解决大语言模型(LLMs)在跨文化语境中存在的文化价值观对齐问题,特别是由于训练数据中的西方中心偏见导致的误表达和公平性问题。现有的方法,如角色分配和少样本学习,往往依赖于预训练知识,缺乏可扩展性,且难以有效捕捉细微的文化价值观差异。为解决这些问题,论文提出了ValuesRAG框架,该框架通过检索增强生成(Retrieval-Augmented Generation, RAG)结合上下文学习,在文本生成过程中动态整合文化和人口统计知识。ValuesRAG利用世界价值观调查(World Values Survey, WVS)数据集生成个体价值观摘要,并通过检索和重排序步骤选择最相关的摘要,从而显著提升了文化对齐效果。实验结果表明,ValuesRAG在主要实验和消融研究中均优于基线方法,展示了其在促进文化对齐的AI系统和增强AI应用包容性方面的潜力。
链接: https://arxiv.org/abs/2501.01031
作者: Wonduk Seo,Zonghao Yuan,Yi Bu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: preprint
Abstract:Cultural values alignment in Large Language Models (LLMs) is a critical challenge due to their tendency to embed Western-centric biases from training data, leading to misrepresentations and fairness issues in cross-cultural contexts. Recent approaches, such as role-assignment and few-shot learning, often struggle with reliable cultural alignment as they heavily rely on pre-trained knowledge, lack scalability, and fail to capture nuanced cultural values effectively. To address these issues, we propose ValuesRAG, a novel and effective framework that applies Retrieval-Augmented Generation (RAG) with in-context learning to integrate cultural and demographic knowledge dynamically during text generation. Leveraging the World Values Survey (WVS) dataset, ValuesRAG first generates summaries of values for each individual. Subsequently, we curated several representative regional datasets to serve as test datasets and retrieve relevant summaries of values based on demographic features, followed by a reranking step to select the top-k relevant summaries. ValuesRAG consistently outperforms baseline methods, both in the main experiment and in the ablation study where only the values summary was provided, highlighting ValuesRAG’s potential to foster culturally aligned AI systems and enhance the inclusivity of AI-driven applications.
zh
[NLP-38] Reasoning based on symbolic and parametric knowledge bases: a survey
【速读】: 该论文试图解决的问题是现有文献在推理方法(reasoning methods)的综述中缺乏从知识库(knowledge base)视角进行系统性分析。具体而言,现有研究未充分考虑知识库的应用场景和存储格式的差异,导致对推理方法的挑战和未来方向的深入理解不足。为解决这一问题,论文首先将知识库分类为符号知识库(symbolic knowledge base)和参数化知识库(parametric knowledge base),前者以人类可读的符号显式存储信息,后者通过参数隐式编码知识。在此基础上,论文全面综述了基于符号知识库、参数化知识库以及两者结合的推理方法,并提出了未来研究方向,旨在通过增强推理能力来缩小人类与机器智能之间的差距。解决方案的关键在于从知识库的视角系统分类和分析推理方法,从而更好地理解其应用场景和存储格式对推理能力的影响。
链接: https://arxiv.org/abs/2501.01030
作者: Mayi Xu,Yunfeng Ning,Yongqi Li,Jianhao Chen,Jintao Wen,Yao Xiao,Shen Zhou,Birong Pan,Zepeng Bao,Xin Miao,Hankun Kang,Ke Sun,Tieyun Qian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning is fundamental to human intelligence, and critical for problem-solving, decision-making, and critical thinking. Reasoning refers to drawing new conclusions based on existing knowledge, which can support various applications like clinical diagnosis, basic education, and financial analysis. Though a good number of surveys have been proposed for reviewing reasoning-related methods, none of them has systematically investigated these methods from the viewpoint of their dependent knowledge base. Both the scenarios to which the knowledge bases are applied and their storage formats are significantly different. Hence, investigating reasoning methods from the knowledge base perspective helps us better understand the challenges and future directions. To fill this gap, this paper first classifies the knowledge base into symbolic and parametric ones. The former explicitly stores information in human-readable symbols, and the latter implicitly encodes knowledge within parameters. Then, we provide a comprehensive overview of reasoning methods using symbolic knowledge bases, parametric knowledge bases, and both of them. Finally, we identify the future direction toward enhancing reasoning capabilities to bridge the gap between human and machine intelligence.
zh
[NLP-39] KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
【速读】: 该论文旨在解决当前通用嵌入模型(embedding models)在训练数据质量方面存在的不足问题。随着检索增强生成(retrieval-augmented generation)在大语言模型中的广泛应用,嵌入模型的重要性日益凸显。然而,现有研究往往忽视了训练数据质量对模型性能的关键影响。为此,论文提出了KaLM-Embedding,一种通用的多语言嵌入模型,其核心解决方案包括三个方面:(1) 利用基于角色的合成数据(persona-based synthetic data)从大语言模型中提取多样化的训练样本;(2) 通过排序一致性过滤(ranking consistency filtering)去除信息量较低的样本;(3) 采用半同质任务批采样(semi-homogeneous task batch sampling)提高训练效率。此外,论文摒弃了传统的BERT架构,选择了Qwen2-0.5B作为预训练模型,以更好地适应自回归语言模型在通用嵌入任务中的应用。实验结果表明,KaLM-Embedding在多语言MTEB基准测试中表现优异,为参数规模在1B左右的多语言嵌入模型设定了新的标准。
链接: https://arxiv.org/abs/2501.01028
作者: Xinshuo Hu,Zifei Shan,Xinping Zhao,Zetian Sun,Zhenyu Liu,Dongfang Li,Shaolin Ye,Xinyuan Wei,Qian Chen,Baotian Hu,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical Report. 23 pages, 6 figures, 10 tables
Abstract:As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with 1B parameters.
zh
[NLP-40] MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model
【速读】: 该论文试图解决在大数据背景下,自动化数据分析系统在利用大语言模型(LLMs)进行数据洞察发现、增强分析和数据叙事方面所面临的挑战。具体问题包括如何高效生成可操作的洞察、如何提升叙事的上下文理解能力以及如何减少人工干预。解决方案的关键在于提出了基于大语言模型的多维数据叙事框架(Multidimensional Data Storytelling Framework, MDSF)。该框架结合了先进的预处理技术、增强分析算法和独特的评分机制,以识别和优先处理可操作的洞察。通过微调的大语言模型,MDSF增强了上下文理解能力,并能够生成高质量的叙事内容。此外,框架还引入了基于代理的实时叙事控制机制,进一步提升了系统的自动化能力和用户满意度。实验结果表明,MDSF在洞察排名准确性、描述质量和叙事连贯性方面优于现有方法,并能够有效减少解释偏差,提升用户满意度。
链接: https://arxiv.org/abs/2501.01014
作者: Chengze Zhang,Changshan Li,Shiyang Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The exponential growth of data and advancements in big data technologies have created a demand for more efficient and automated approaches to data analysis and storytelling. However, automated data analysis systems still face challenges in leveraging large language models (LLMs) for data insight discovery, augmented analysis, and data storytelling. This paper introduces the Multidimensional Data Storytelling Framework (MDSF) based on large language models for automated insight generation and context-aware storytelling. The framework incorporates advanced preprocessing techniques, augmented analysis algorithms, and a unique scoring mechanism to identify and prioritize actionable insights. The use of fine-tuned LLMs enhances contextual understanding and generates narratives with minimal manual intervention. The architecture also includes an agent-based mechanism for real-time storytelling continuation control. Key findings reveal that MDSF outperforms existing methods across various datasets in terms of insight ranking accuracy, descriptive quality, and narrative coherence. The experimental evaluation demonstrates MDSF’s ability to automate complex analytical tasks, reduce interpretive biases, and improve user satisfaction. User studies further underscore its practical utility in enhancing content structure, conclusion extraction, and richness of detail.
zh
[NLP-41] Exploring Information Processing in Large Language Models : Insights from Information Bottleneck Theory
【速读】: 该论文试图解决大语言模型(LLMs)在信息处理过程中如何理解输入并生成有效预测的内部机制问题。尽管LLMs在多种任务中表现出色,但其内部工作机制仍不明确。论文从信息瓶颈理论(Information Bottleneck Theory)的角度出发,提出了一种非训练构建策略来定义任务空间,并揭示了两个关键发现:(1) LLMs将输入信息压缩到特定任务空间(如情感空间、主题空间)以促进任务理解;(2) 在关键时刻从任务空间中提取并利用相关信息以生成准确预测。基于这些发现,论文提出了两种新方法:基于信息压缩的上下文学习(IC-ICL)和任务空间引导的微调(TS-FT)。IC-ICL通过将检索到的示例信息压缩到任务空间中来提升推理性能和推理效率,而TS-FT则通过空间引导的损失函数微调LLMs,鼓励学习更有效的压缩和选择机制。实验结果表明,任务空间构建的有效性得到了验证,IC-ICL不仅提升了性能,还将推理速度提高了40%以上,而TS-FT通过最小的策略调整实现了更优的结果。
链接: https://arxiv.org/abs/2501.00999
作者: Zhou Yang,Zhengyu Qi,Zhaochun Ren,Zhikai Jia,Haizhou Sun,Xiaofei Zhu,Xiangwen Liao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures, 3 tables
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks by understanding input information and predicting corresponding outputs. However, the internal mechanisms by which LLMs comprehend input and make effective predictions remain poorly understood. In this paper, we explore the working mechanism of LLMs in information processing from the perspective of Information Bottleneck Theory. We propose a non-training construction strategy to define a task space and identify the following key findings: (1) LLMs compress input information into specific task spaces (e.g., sentiment space, topic space) to facilitate task understanding; (2) they then extract and utilize relevant information from the task space at critical moments to generate accurate predictions. Based on these insights, we introduce two novel approaches: an Information Compression-based Context Learning (IC-ICL) and a Task-Space-guided Fine-Tuning (TS-FT). IC-ICL enhances reasoning performance and inference efficiency by compressing retrieved example information into the task space. TS-FT employs a space-guided loss to fine-tune LLMs, encouraging the learning of more effective compression and selection mechanisms. Experiments across multiple datasets validate the effectiveness of task space construction. Additionally, IC-ICL not only improves performance but also accelerates inference speed by over 40%, while TS-FT achieves superior results with a minimal strategy adjustment.
zh
[NLP-42] Are LLM s effective psychological assessors? Leveraging adaptive RAG for interpretable mental health screening through psychometric practice
【速读】: 该论文试图解决如何利用社交媒体数据快速进行心理健康筛查的问题。传统的心理健康评估依赖于标准化的问卷(standardized questionnaires),但这些方法通常耗时且依赖于用户的主动参与。随着社交媒体平台的普及,用户在这些平台上分享的个人经历和情感数据为心理健康评估提供了新的数据源。论文提出了一种新颖的自适应检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过分析社交媒体帖子来完成心理问卷的填写。该方法的关键在于利用大型语言模型(Large Language Models, LLMs)在零样本(zero-shot)设置下预测用户对心理问卷的得分,并通过检索与问卷问题最相关的用户帖子来生成响应。研究结果表明,该方法能够有效预测用户对心理问卷(如贝克抑郁量表II, BDI-II)的响应,并在基于Reddit的基准数据集上达到了与或超越了现有最先进模型的性能。此外,该方法还可推广为一种可扩展的筛查工具,通过完成标准化问卷并跟踪个体项目响应如何影响诊断,符合既定的心理测量实践。
链接: https://arxiv.org/abs/2501.00982
作者: Federico Ravenda,Seyed Ali Bahrainian,Andrea Raballo,Antonietta Mira,Noriko Kando
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In psychological practice, standardized questionnaires serve as essential tools for assessing mental constructs (e.g., attitudes, traits, and emotions) through structured questions (aka items). With the increasing prevalence of social media platforms where users share personal experiences and emotions, researchers are exploring computational methods to leverage this data for rapid mental health screening. In this study, we propose a novel adaptive Retrieval-Augmented Generation (RAG) approach that completes psychological questionnaires by analyzing social media posts. Our method retrieves the most relevant user posts for each question in a psychological survey and uses Large Language Models (LLMs) to predict questionnaire scores in a zero-shot setting. Our findings are twofold. First we demonstrate that this approach can effectively predict users’ responses to psychological questionnaires, such as the Beck Depression Inventory II (BDI-II), achieving performance comparable to or surpassing state-of-the-art models on Reddit-based benchmark datasets without relying on training data. Second, we show how this methodology can be generalized as a scalable screening tool, as the final assessment is systematically derived by completing standardized questionnaires and tracking how individual item responses contribute to the diagnosis, aligning with established psychometric practices.
zh
[NLP-43] 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
【速读】: 该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在训练过程中使用的图像-文本对数据(image-text pair data)存在的知识密度低、图像-文本关系松散以及图像间逻辑一致性差等问题。为了解决这些问题,论文提出了一种基于教学视频的高质量多模态教科书语料库(multimodal textbook corpus),用于VLM的预训练。该语料库通过收集超过2.5年的教学视频(总计22,000课时),并利用大语言模型(LLM)提出的分类法系统性地整理这些视频。随后,通过逐步提取和精炼视频中的视觉信息(关键帧)、音频信息(自动语音识别,ASR)和文本信息(光学字符识别,OCR),并按照时间顺序组织成图像-文本交替的语料库。与现有数据集相比,该视频中心的教科书语料库提供了更连贯的上下文、更丰富的知识以及更好的图像-文本对齐。实验结果表明,使用该语料库预训练的VLMs在知识密集型和推理密集型任务(如ScienceQA和MathVista)中表现出色,并展现出卓越的交替上下文感知能力。
链接: https://arxiv.org/abs/2501.00958
作者: Wenqi Zhang,Hang Zhang,Xin Li,Jiashuo Sun,Yongliang Shen,Weiming Lu,Deli Zhao,Yueting Zhuang,Lidong Bing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review
Abstract:Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnoteOur code are available at \urlthis https URL.
zh
[NLP-44] Incremental Dialogue Management: Survey Discussion and Implications for HRI
【速读】: 该论文试图解决当前自然语言处理(NLP)系统在机器人交互中的局限性问题,特别是现有模型主要基于句子或多句子级别的输入,而非人类在对话中使用的逐词输入方式。这种局限性影响了机器人在与人类进行语音交互时的响应速度和自然度。论文的核心解决方案是探讨和推动增量式(incremental)交互系统的发展,即在词级别或更细粒度上处理输入。关键点在于增量式对话管理(incremental dialogue management),特别是对话管理器(dialogue manager)的决策机制。论文通过综述增量式建模在语音识别和语言生成等对话关键环节的应用,提出了实现增量式对话管理的实际需求,并探讨了其在具身机器人平台上的应用前景。
链接: https://arxiv.org/abs/2501.00953
作者: Casey Kennington,Pierre Lison,David Schlangen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Efforts towards endowing robots with the ability to speak have benefited from recent advancements in NLP, in particular large language models. However, as powerful as current models have become, they still operate on sentence or multi-sentence level input, not on the word-by-word input that humans operate on, affecting the degree of responsiveness that they offer, which is critical in situations where humans interact with robots using speech. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and the implications of incremental dialogue for embodied, robotic platforms.
zh
[NLP-45] Aligning Netlist to Source Code using SynAlign
【速读】: 该论文旨在解决当前芯片设计流程中,使用多种工具生成门级网表(gate-level netlist)时,源代码关联性丢失的问题。这一问题导致设计者在迭代设计过程中难以追踪网表单元到原始源代码的映射,从而影响设计效率和效果。SynAlign 提供了一种自动化对齐(alignment)解决方案,能够在不同工具之间保持源代码与网表的关联性,而无需修改编译器或综合流程。其关键策略在于利用芯片设计周期中设计结构的一致性,即使编译器流程发生变化,也能确保修改后的设计与原始源代码之间的关联性得以维持。此外,SynAlign 能够容忍高达 61% 的网表设计变更,而不影响对齐的准确性。
链接: https://arxiv.org/abs/2501.00921
作者: Sakshi Garg,Jose Renau
机构: 未知
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:
Abstract:In current chip design processes, using multiple tools to obtain a gate-level netlist often results in the loss of source code correlation. SynAlign addresses this challenge by automating the alignment process, simplifying iterative design, reducing overhead, and maintaining correlation across various tools. This enhances the efficiency and effectiveness of chip design workflows. Improving characteristics such as frequency through iterative design is essential for enhancing accelerators and chip designs. While synthesis tools produce netlists with critical path information, designers often lack the tools to trace these netlist cells back to their original source code. Mapping netlist components to source code provides early feedback on timing and power for frontend designers. SynAlign automatically aligns post-optimized netlists with the original source code without altering compilers or synthesis processes. Its alignment strategy relies on the consistent design structure throughout the chip design cycle, even with changes in compiler flow. This consistency allows engineers to maintain a correlation between modified designs and the original source code across various tools. Remarkably, SynAlign can tolerate up to 61% design net changes without impacting alignment accuracy. Subjects: Hardware Architecture (cs.AR); Computation and Language (cs.CL) Cite as: arXiv:2501.00921 [cs.AR] (or arXiv:2501.00921v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2501.00921 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-46] AutoPresent: Designing Structured Visuals from Scratch
【速读】: 该论文旨在解决自动化幻灯片生成(automated slide generation)的挑战,即从自然语言(NL)指令生成演示文稿幻灯片。解决方案的关键在于引入了SlidesBench基准测试,这是首个用于幻灯片生成的基准测试,包含7k个训练样本和585个测试样本,涵盖10个领域的310个幻灯片集。SlidesBench支持基于参考的评估(reference-based evaluation)以衡量生成幻灯片与目标幻灯片的相似性,以及无参考评估(reference-free evaluation)以单独衡量生成幻灯片的设计质量。论文还比较了端到端图像生成和程序生成方法,发现程序生成方法能够生成更高质量且用户可交互的幻灯片。基于程序生成的成功,作者开发了AutoPresent,这是一个基于8B Llama的模型,训练了7k对指令与代码对,用于幻灯片生成,并取得了与闭源模型GPT-4o相当的结果。此外,论文还探索了迭代设计优化(iterative design refinement),即模型自我优化其输出,发现这一过程能够提升幻灯片的质量。
链接: https://arxiv.org/abs/2501.00912
作者: Jiaxin Ge,Zora Zhiruo Wang,Xuhui Zhou,Yi-Hao Peng,Sanjay Subramanian,Qinyue Tan,Maarten Sap,Alane Suhr,Daniel Fried,Graham Neubig,Trevor Darrell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide’s quality. We hope that our work will provide a basis for future work on generating structured visuals.
zh
[NLP-47] U-GIFT: Uncertainty-Guided Firewall for Toxic Speech in Few-Shot Scenario
【速读】: 该论文旨在解决社交媒体平台上用户生成内容中存在的有毒言论(toxic speech)检测问题,特别是在标注数据有限的少样本(few-shot)场景下。由于手动内容审核成本高且对审核人员心理压力大,自动化检测方法成为必要。然而,现有方法通常依赖大规模标注数据集,获取这些数据集既昂贵又具有挑战性。为此,论文提出了一种基于不确定性引导的防火墙方法 U-GIFT(Uncertainty-Guided Firewall for Toxic Speech),其核心在于结合主动学习(active learning)和贝叶斯神经网络(Bayesian Neural Networks, BNNs),通过自训练(self-training)从无标注数据中自动识别高质量样本,并基于模型预测的不确定性估计优先选择高置信度的伪标签进行训练。实验表明,U-GIFT 在少样本检测场景中显著优于现有基线模型,尤其在 5-shot 设置下性能提升达 14.92%。此外,U-GIFT 具有良好的用户友好性、适应性和跨领域泛化能力,为网络空间中的自动化内容审核提供了高效解决方案。
链接: https://arxiv.org/abs/2501.00907
作者: Jiaxin Song,Xinyu Wang,Yihao Wang,Yifan Tang,Ru Zhang,Jianyi Liu,Gongshen Liu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 16 pages, 6 figures and 10 tables. Comments are welcome
Abstract:With the widespread use of social media, user-generated content has surged on online platforms. When such content includes hateful, abusive, offensive, or cyberbullying behavior, it is classified as toxic speech, posing a significant threat to the online ecosystem’s integrity and safety. While manual content moderation is still prevalent, the overwhelming volume of content and the psychological strain on human moderators underscore the need for automated toxic speech detection. Previously proposed detection methods often rely on large annotated datasets; however, acquiring such datasets is both costly and challenging in practice. To address this issue, we propose an uncertainty-guided firewall for toxic speech in few-shot scenarios, U-GIFT, that utilizes self-training to enhance detection performance even when labeled data is limited. Specifically, U-GIFT combines active learning with Bayesian Neural Networks (BNNs) to automatically identify high-quality samples from unlabeled data, prioritizing the selection of pseudo-labels with higher confidence for training based on uncertainty estimates derived from model predictions. Extensive experiments demonstrate that U-GIFT significantly outperforms competitive baselines in few-shot detection scenarios. In the 5-shot setting, it achieves a 14.92% performance improvement over the basic model. Importantly, U-GIFT is user-friendly and adaptable to various pre-trained language models (PLMs). It also exhibits robust performance in scenarios with sample imbalance and cross-domain settings, while showcasing strong generalization across various language applications. We believe that U-GIFT provides an efficient solution for few-shot toxic speech detection, offering substantial support for automated content moderation in cyberspace, thereby acting as a firewall to promote advancements in cybersecurity.
zh
[NLP-48] Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization
【速读】: 该论文试图解决在信息快速变化的领域中,如何从大量事件相关的内容中构建连贯的时间线(timeline)的问题。这一任务在聚合相关文档以围绕一个中心主题构建有意义的事件图(event graph)时尤为复杂。论文提出的解决方案是CHRONOS(Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning),其关键在于利用大语言模型(Large Language Models, LLMs)通过迭代自我提问的方式,不断反思事件之间的关联,并针对特定新闻主题提出新问题,从而从在线或离线知识库中收集信息。LLMs在每一轮检索到的文档基础上生成并更新按时间顺序排列的摘要。此外,论文还构建了一个名为Open-TLS的新数据集,用于评估开放域时间线摘要任务,其中信息过载使得从网络上找到全面的相关文档变得不可能。实验结果表明,CHRONOS不仅在开放域时间线摘要任务中表现出色,而且在封闭域应用中也能与现有的最先进系统相媲美。
链接: https://arxiv.org/abs/2501.00888
作者: Weiqi Wu,Shen Huang,Yong Jiang,Pengjun Xie,Fei Huang,Hai Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In the fast-changing realm of information, the capacity to construct coherent timelines from extensive event-related content has become increasingly significant and challenging. The complexity arises in aggregating related documents to build a meaningful event graph around a central topic. This paper proposes CHRONOS - Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning, which offers a fresh perspective on the integration of Large Language Models (LLMs) to tackle the task of Timeline Summarization (TLS). By iteratively reflecting on how events are linked and posing new questions regarding a specific news topic to gather information online or from an offline knowledge base, LLMs produce and refresh chronological summaries based on documents retrieved in each round. Furthermore, we curate Open-TLS, a novel dataset of timelines on recent news topics authored by professional journalists to evaluate open-domain TLS where information overload makes it impossible to find comprehensive relevant documents from the web. Our experiments indicate that CHRONOS is not only adept at open-domain timeline summarization, but it also rivals the performance of existing state-of-the-art systems designed for closed-domain applications, where a related news corpus is provided for summarization.
zh
[NLP-49] Representation in large language models
【速读】: 该论文试图解决的核心问题是:大型语言模型(LLMs)的行为是否部分由基于表征的信息处理(representation-based information processing)驱动,类似于生物认知中的信息处理方式,还是完全由记忆和随机查表过程驱动。这一问题涉及LLMs所实现的算法类型,其答案对更高层次的问题(如这些系统是否具有信念、意图、概念、知识和理解)具有重要影响。论文的关键解决方案在于论证LLM行为确实部分由基于表征的信息处理驱动,并提出了一系列实用的技术方法来研究和解释这些表征,从而为未来关于语言模型及其后继者的理论构建奠定基础。
链接: https://arxiv.org/abs/2501.00885
作者: Cameron C. Yetman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Draft of paper under review. 27 pages, 2 figures
Abstract:The extraordinary success of recent Large Language Models (LLMs) on a diverse array of tasks has led to an explosion of scientific and philosophical theorizing aimed at explaining how they do what they do. Unfortunately, disagreement over fundamental theoretical issues has led to stalemate, with entrenched camps of LLM optimists and pessimists often committed to very different views of how these systems work. Overcoming stalemate requires agreement on fundamental questions, and the goal of this paper is to address one such question, namely: is LLM behavior driven partly by representation-based information processing of the sort implicated in biological cognition, or is it driven entirely by processes of memorization and stochastic table look-up? This is a question about what kind of algorithm LLMs implement, and the answer carries serious implications for higher level questions about whether these systems have beliefs, intentions, concepts, knowledge, and understanding. I argue that LLM behavior is partially driven by representation-based information processing, and then I describe and defend a series of practical techniques for investigating these representations and developing explanations on their basis. The resulting account provides a groundwork for future theorizing about language models and their successors.
zh
[NLP-50] rustRAG: Enhancing Robustness and Trustworthiness in RAG
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对语料库中毒攻击(corpus poisoning attacks)时的脆弱性问题。这些攻击通过注入恶意内容,显著降低大语言模型(LLMs)的性能。为解决这一问题,论文提出了TrustRAG框架,其核心在于通过两阶段防御机制来过滤受损和无关内容。首先,利用K-means聚类(K-means clustering)对检索到的文档进行语义嵌入分析,识别潜在的攻击模式并隔离可疑内容。其次,通过余弦相似度(cosine similarity)和ROUGE指标检测恶意文档,并通过自评估过程解决模型内部知识与外部信息之间的差异。TrustRAG作为一个即插即用、无需训练的模块,能够无缝集成到任何语言模型中,显著提高了检索准确性、效率和抗攻击能力。
链接: https://arxiv.org/abs/2501.00879
作者: Huichi Zhou,Kin-Hei Lee,Zhonghao Zhan,Yue Chen,Zhenhao Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user queries. However, these systems remain vulnerable to corpus poisoning attacks that can significantly degrade LLM performance through the injection of malicious content. To address these challenges, we propose TrustRAG, a robust framework that systematically filters compromised and irrelevant content before it reaches the language model. Our approach implements a two-stage defense mechanism: first, it employs K-means clustering to identify potential attack patterns in retrieved documents based on their semantic embeddings, effectively isolating suspicious content. Second, it leverages cosine similarity and ROUGE metrics to detect malicious documents while resolving discrepancies between the model’s internal knowledge and external information through a self-assessment process. TrustRAG functions as a plug-and-play, training-free module that integrates seamlessly with any language model, whether open or closed-source, maintaining high contextual relevance while strengthening defenses against attacks. Through extensive experimental validation, we demonstrate that TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches across multiple model architectures and datasets. We have made TrustRAG available as open-source software at \urlthis https URL.
zh
[NLP-51] LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
【速读】: 该论文试图解决基于大语言模型(LLMs)的嵌入模型在多语言任务中的局限性问题。当前这些模型主要针对英语,而多语言嵌入能力尚未得到充分探索。为了解决这一问题,论文提出了LUSIFER,一种新颖的零样本(zero-shot)方法,能够将基于LLM的嵌入模型适配到多语言任务中,而无需多语言监督数据。LUSIFER的关键在于其架构,它结合了一个多语言编码器(multilingual encoder)作为语言通用学习器,以及一个针对嵌入任务优化的基于LLM的嵌入模型。这两个组件通过一组最小的可训练参数无缝集成,这些参数充当连接器,有效地将多语言编码器的语言理解能力转移到专门的嵌入模型中。此外,论文还引入了一个新的基准测试,涵盖5个主要嵌入任务、123个多样化数据集和14种语言,以全面评估多语言嵌入性能。实验结果表明,LUSIFER显著提升了多语言嵌入任务的表现,特别是在中低资源语言上,且无需显式的多语言训练数据。
链接: https://arxiv.org/abs/2501.00874
作者: Hieu Man,Nghia Trung Ngo,Viet Dac Lai,Ryan A. Rossi,Franck Dernoncourt,Thien Huu Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER’s architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder’s language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.
zh
[NLP-52] Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation AAAI2025
【速读】: 该论文试图解决在流式输入(streaming inputs)场景下,生成式模型(generative models)在生成结果时需要同时决定输出时机的问题。现有的同步生成方法(simultaneous generation methods)通常采用传统的编码器-解码器(encoder-decoder)架构,并通过复杂的动态规划技术(dynamic programming techniques)来学习生成和决策能力。然而,尽管大语言模型(LLMs)在文本生成方面表现出色,但通过传统训练方法难以有效承担决策者的角色,限制了其在同步生成中的探索。为解决这一问题,论文提出了一种新颖的LLM驱动的同步生成框架(LLM-driven Simultaneous Generation, LSG),该框架允许现成的LLM决定生成时机并同时生成输出。LSG的关键在于选择最小化延迟的生成策略作为基线策略(baseline policy),并在此基础上使LLM能够制定出更好的生成策略,以平衡延迟和生成质量,从而在同步翻译和流式自动语音识别任务中实现最先进的性能。
链接: https://arxiv.org/abs/2501.00868
作者: Shoutao Guo,Shaolei Zhang,Zhengrui Ma,Yang Feng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI 2025. 13 pages, 7 tables, 10 figures
Abstract:Simultaneous generation models write generation results while reading streaming inputs, necessitating a policy-maker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.
zh
[NLP-53] Negative to Positive Co-learning with Aggressive Modality Dropout
【速读】: 该论文旨在解决多模态协同学习(multimodal co-learning)中的负协同学习(Negative Co-Learning, NCL)问题,即多模态模型在协同学习过程中性能下降的现象。论文提出的关键解决方案是采用激进的模态丢弃(aggressive modality dropout)技术。通过这一技术,作者成功地将负协同学习逆转为正协同学习(Positive Co-Learning, PCL),并在某些实验中实现了20%的准确率提升。此外,激进的模态丢弃技术还可以用于为单模态部署(unimodal deployment)准备多模态模型,从而显著提高模型在负协同学习中的性能。尽管该技术在正协同学习中的效果不如在负协同学习中显著,但它仍然对协同学习的整体改进有积极作用。
链接: https://arxiv.org/abs/2501.00865
作者: Nicholas Magal,Minh Tran,Riku Arakawa,Suzanne Nie
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper aims to document an effective way to improve multimodal co-learning by using aggressive modality dropout. We find that by using aggressive modality dropout we are able to reverse negative co-learning (NCL) to positive co-learning (PCL). Aggressive modality dropout can be used to “prep” a multimodal model for unimodal deployment, and dramatically increases model performance during negative co-learning, where during some experiments we saw a 20% gain in accuracy. We also benchmark our modality dropout technique against PCL to show that our modality drop out technique improves co-learning during PCL, although it does not have as much as an substantial effect as it does during NCL. Github: this https URL
zh
[NLP-54] DiffETM: Diffusion Process Enhanced Embedded Topic Model ICASSP2025
【速读】: 该论文试图解决传统嵌入主题模型(Embedded Topic Model, ETM)在假设文档-主题分布(document-topic distribution)符合逻辑正态分布(logistic normal distribution)时,对真实文档-主题分布的过度简化问题。这种简化限制了模型的性能。为解决这一问题,论文提出了一种新颖的方法,将扩散过程(diffusion process)引入文档-主题分布的采样过程中,从而克服这一限制,同时保持优化的简便性。通过在两主流数据集上的广泛实验,验证了该方法在提升主题建模性能方面的有效性。
链接: https://arxiv.org/abs/2501.00862
作者: Wei Shao,Mingyang Liu,Linqi Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, Accepted by ICASSP 2025
Abstract:The embedded topic model (ETM) is a widely used approach that assumes the sampled document-topic distribution conforms to the logistic normal distribution for easier optimization. However, this assumption oversimplifies the real document-topic distribution, limiting the model’s performance. In response, we propose a novel method that introduces the diffusion process into the sampling process of document-topic distribution to overcome this limitation and maintain an easy optimization process. We validate our method through extensive experiments on two mainstream datasets, proving its effectiveness in improving topic modeling performance.
zh
[NLP-55] LLM AL: Bridging Large Language Models and Action Languages for Complex Reasoning about Actions
【速读】: 该论文试图解决大型语言模型(LLMs)在处理复杂动作推理任务时的局限性,这些任务通常需要系统化的搜索和推理能力。现有的LLMs虽然在多种智能任务上取得了显著进展,但在涉及复杂动作推理的任务中表现不佳。为了解决这一问题,论文提出了一种名为“LLM+AL”的方法,该方法结合了LLMs在自然语言理解和常识知识生成方面的优势,以及动作语言(action languages)在基于编码知识的自动推理方面的专长。关键解决方案在于通过LLMs进行语义解析和常识知识生成,同时利用动作语言进行系统化的推理,从而弥补LLMs在复杂动作推理任务中的不足。实验结果表明,LLM+AL在复杂动作推理基准测试中表现优于当前最先进的LLMs,并且在较少的人工修正下能够持续得出正确答案,而单独的LLMs即使有人工反馈也难以改进。此外,LLM+AL还为动作语言的自动生成提供了支持。
链接: https://arxiv.org/abs/2501.00830
作者: Adam Ishay,Joohyung Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 42 pages
Abstract:Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we propose a method that bridges the natural language understanding capabilities of LLMs with the symbolic reasoning strengths of action languages. Our approach, termed “LLM+AL,” leverages the LLM’s strengths in semantic parsing and commonsense knowledge generation alongside the action language’s proficiency in automated reasoning based on encoded knowledge. We compare LLM+AL against state-of-the-art LLMs, including ChatGPT-4, Claude 3 Opus, Gemini Ultra 1.0, and o1-preview, using benchmarks for complex reasoning about actions. Our findings indicate that, although all methods exhibit errors, LLM+AL, with relatively minimal human corrections, consistently leads to correct answers, whereas standalone LLMs fail to improve even with human feedback. LLM+AL also contributes to automated generation of action languages.
zh
[NLP-56] Embedding Style Beyond Topics: Analyzing Dispersion Effects Across Different Language Models COLING2025
【速读】: 该论文旨在探讨写作风格(writing style)如何影响多个最先进的语言模型(state-of-the-art language models)中嵌入向量(embedding vectors)的分散性(dispersion)。早期的Transformer模型主要与主题建模(topic modeling)对齐,而本研究则聚焦于写作风格在塑造嵌入空间(embedding spaces)中的作用。通过使用一个在主题和风格之间交替的文学语料库,作者比较了法语和英语语言模型对风格的敏感性。解决方案的关键在于分析风格对嵌入分散性的具体影响,从而更好地理解语言模型如何处理风格信息,进而提升其整体可解释性(interpretability)。
链接: https://arxiv.org/abs/2501.00828
作者: Benjamin Icard,Evangelia Zve,Lila Sainero,Alice Breton,Jean-Gabriel Ganascia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi
Abstract:This paper analyzes how writing style affects the dispersion of embedding vectors across multiple, state-of-the-art language models. While early transformer models primarily aligned with topic modeling, this study examines the role of writing style in shaping embedding spaces. Using a literary corpus that alternates between topics and styles, we compare the sensitivity of language models across French and English. By analyzing the particular impact of style on embedding dispersion, we aim to better understand how language models process stylistic information, contributing to their overall interpretability.
zh
[NLP-57] Decoupling Knowledge and Reasoning in Transformers: A Modular Architecture with Generalized Cross-Attention
【速读】: 该论文试图解决传统Transformer架构在可解释性(interpretability)、适应性(adaptability)和可扩展性(scalability)方面存在的挑战。传统Transformer的单体架构在处理复杂任务时,难以有效分离知识和推理过程,导致模型的可解释性和适应性受限。为此,论文提出了一种新颖的模块化Transformer架构,通过引入广义交叉注意力机制(generalized cross-attention mechanism)显式地将知识与推理解耦,并设计了一个共享知识库(shared knowledge base)以实现高效的知识检索。关键解决方案在于,论文通过严格的数学推导证明了标准Transformer中的前馈网络(Feed-Forward Network, FFN)是广义交叉注意力的一个特例(closure),揭示了FFN在隐式知识检索中的作用,从而验证了所提出设计的合理性。这一理论框架为理解FFN提供了新的视角,并为未来研究增强模型的可解释性、适应性和可扩展性奠定了基础,同时促进了与外部知识库及其他系统的更丰富交互。
链接: https://arxiv.org/abs/2501.00823
作者: Zhenyu Guo,Wenguang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Transformers have achieved remarkable success across diverse domains, but their monolithic architecture presents challenges in interpretability, adaptability, and scalability. This paper introduces a novel modular Transformer architecture that explicitly decouples knowledge and reasoning through a generalized cross-attention mechanism to a shared knowledge base, specifically designed for effective knowledge retrieval. Critically, we provide a rigorous mathematical derivation demonstrating that the Feed-Forward Network (FFN) in a standard Transformer is a specialized case (a closure) of this generalized cross-attention, revealing its role in implicit knowledge retrieval and validating our design. This theoretical framework provides a new lens for understanding FFNs and lays the foundation for future research exploring enhanced interpretability, adaptability, and scalability, enabling richer interplay with external knowledge bases and other systems.
zh
[NLP-58] Reasoning-Oriented and Analogy-Based Methods for Locating and Editing in Zero-Shot Event-Relational Reasoning
【速读】: 该论文试图解决零样本事件关系推理(Zero-shot event-relational reasoning)任务中现有方法存在的两个主要问题:一是训练前缀(prefixes)消耗大量计算资源且缺乏可解释性;二是现有方法在利用任务间的关联性时效率低下。为解决这些问题,论文提出了两种关键方法:一是面向推理的定位与编辑(Reasoning-Oriented Locating and Editing, ROLE),通过定位和编辑语言模型中的关键模块来优化事件关系推理能力,提升可解释性并减少计算资源消耗;二是基于类比的定位与编辑(Analogy-Based Locating and Editing, ABLE),通过有效利用任务间的相似性和差异性来优化零样本推理能力。实验结果表明,ROLE在提升推理性能和可解释性的同时降低了计算成本,而ABLE在零样本推理任务中达到了当前最优(SOTA)结果。
链接: https://arxiv.org/abs/2501.00803
作者: Jingyao Tang,Lishuang Li,Liteng Mi,Haiming Wu,Hongbin Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-shot event-relational reasoning is an important task in natural language processing, and existing methods jointly learn a variety of event-relational prefixes and inference-form prefixes to achieve such tasks. However, training prefixes consumes large computational resources and lacks interpretability. Additionally, learning various relational and inferential knowledge inefficiently exploits the connections between tasks. Therefore, we first propose a method for Reasoning-Oriented Locating and Editing (ROLE), which locates and edits the key modules of the language model for reasoning about event relations, enhancing interpretability and also resource-efficiently optimizing the reasoning ability. Subsequently, we propose a method for Analogy-Based Locating and Editing (ABLE), which efficiently exploits the similarities and differences between tasks to optimize the zero-shot reasoning capability. Experimental results show that ROLE improves interpretability and reasoning performance with reduced computational cost. ABLE achieves SOTA results in zero-shot reasoning.
zh
[NLP-59] Navigating Nuance: In Quest for Political Truth
【速读】: 该论文旨在解决政治偏见(political bias)的识别问题,特别是在媒体内容中的政治倾向检测。研究通过评估Llama-3 (70B)语言模型在Media Bias Identification Benchmark (MBIB)上的表现,探讨了一种新颖的提示技术(prompting technique),该技术结合了识别政治倾向的微妙理由。关键解决方案在于利用迁移学习方法(transfer learning methods)来提升模型的性能,并通过提出的框架实现了与当前最先进的监督式全微调ConvBERT模型相当的表现。这一方法不仅展示了其在政治偏见检测任务中的有效性,还为开发更强大的工具以缓解错误信息和极化的传播提供了贡献。
链接: https://arxiv.org/abs/2501.00782
作者: Soumyadeep Sar,Dwaipayan Roy
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at JCDL 2024
Abstract:This study investigates the several nuanced rationales for countering the rise of political bias. We evaluate the performance of the Llama-3 (70B) language model on the Media Bias Identification Benchmark (MBIB), based on a novel prompting technique that incorporates subtle reasons for identifying political leaning. Our findings underscore the challenges of detecting political bias and highlight the potential of transfer learning methods to enhance future models. Through our framework, we achieve a comparable performance with the supervised and fully fine-tuned ConvBERT model, which is the state-of-the-art model, performing best among other baseline models for the political bias task on MBIB. By demonstrating the effectiveness of our approach, we contribute to the development of more robust tools for mitigating the spread of misinformation and polarization. Our codes and dataset are made publicly available in github.
zh
[NLP-60] Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations
【速读】: 该论文试图解决长序列因果推理(long-sequence causal reasoning)中的复杂依赖性和因果链验证难题,特别是在扩展对话中捕捉复杂情感因果关系的挑战。现有的大规模语言模型(如GPT-4)在处理长序列情感因果关系时存在局限性。为解决这一问题,论文提出了CauseMotion框架,其核心在于结合检索增强生成(Retrieval-Augmented Generation, RAG)和多模态融合(multimodal fusion)技术。与传统方法仅依赖文本信息不同,CauseMotion通过引入音频特征(如语音情感、情感强度和语速)来丰富语义表示,并结合滑动窗口机制有效检索和利用上下文相关的对话片段,从而推断跨越多轮对话的复杂情感因果链。实验结果表明,该框架显著提升了大规模语言模型在情感理解和因果推理方面的能力,并在多个评估指标上取得了最先进的结果。
链接: https://arxiv.org/abs/2501.00778
作者: Yuxuan Zhang,Yulong Li,Zichen Yu,Feilong Tang,Zhixiang Lu,Chong Li,Kang Dang,Jionglong Su
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 7pages
Abstract:Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
zh
[NLP-61] FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation
【速读】: 该论文旨在解决自然语言处理(NLP)和可解释人工智能(XAI)领域中自动化生成反事实示例(counterfactual examples)的挑战。尽管大语言模型(LLMs)在许多任务上表现出色,但生成高质量的反事实示例仍然是一个难题。论文提出了两种解决方案:首先,ZeroCF方法利用特征归因方法(feature attribution methods)提取重要词汇,在零样本(zero-shot)设置下生成反事实示例;其次,FitCF框架通过标签翻转验证(label flip verification)进一步验证这些反事实示例,并将其作为少样本提示(few-shot prompting)的演示,从而超越现有的两种最先进基线方法。FitCF的核心组件通过消融实验验证了其在提高反事实示例质量中的重要性,评估指标包括翻转率(flip rate)、困惑度(perplexity)和相似度(similarity)。此外,论文还展示了LIME和积分梯度(Integrated Gradients)作为FitCF的骨干归因方法的有效性,并发现演示数量对性能影响最大。最后,研究揭示了特征归因分数的忠实性与生成反事实示例质量之间的强相关性。
链接: https://arxiv.org/abs/2501.00777
作者: Qianli Wang,Nils Feldhus,Simon Ostermann,Luis Felipe Villa-Arenas,Sebastian Möller,Vera Schmitt
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: In submission
Abstract:Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming two state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF’s core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals.
zh
[NLP-62] Enhancing Transformers for Generalizable First-Order Logical Entailment
【速读】: 该论文旨在研究并提升Transformer模型在可泛化的一阶逻辑推理(first-order logical reasoning)能力方面的表现。具体而言,论文通过评估Transformer模型在一阶逻辑蕴涵(first-order logical entailment)任务中的表现,特别是其在知识图谱查询(knowledge graph query answering)中的性能,来探讨其推理能力。论文的关键解决方案包括:(1)建立分布偏移(distribution shifts)与知识图谱查询任务中未见知识和查询设置之间的联系,从而实现对细粒度泛化能力的表征;(2)通过实验分析输入查询语法、词嵌入(token embedding)和Transformer架构对推理能力的影响,发现现有Transformer架构中位置编码(positional encoding)与其他设计选择之间的不匹配问题;(3)提出一种更复杂的、逻辑感知的架构TEGA,以增强Transformer在可泛化的一阶逻辑蕴涵任务中的能力。
链接: https://arxiv.org/abs/2501.00759
作者: Tianshi Zheng,Jiazheng Wang,Zihao Wang,Jiaxin Bai,Hang Yin,Zheye Deng,Yangqiu Song,Jianxin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Transformers, as a fundamental deep learning architecture, have demonstrated remarkable capabilities in reasoning. This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and explores ways to improve it. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) the unseen knowledge and query settings discussed in the task of knowledge graph query answering, enabling a characterization of fine-grained generalizability. Results on our comprehensive dataset show that transformers outperform previous methods specifically designed for this task and provide detailed empirical evidence on the impact of input query syntax, token embedding, and transformer architectures on the reasoning capability of transformers. Interestingly, our findings reveal a mismatch between positional encoding and other design choices in transformer architectures employed in prior practices. This discovery motivates us to propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
zh
[NLP-63] DIVE: Diversified Iterative Self-Improvement
【速读】: 该论文试图解决大语言模型(LLMs)在迭代自我改进(Iterative Self-Improvement, ISI)过程中,由于持续训练自生成数据而导致输出多样性下降的问题。这一问题在推理任务中尤为关键,因为多样化的解决路径是必不可少的。论文提出的解决方案是DIVE(Diversified Iterative Self-Improvement)框架,其关键包括两个核心组件:样本池扩展(Sample Pool Expansion)用于更广泛的解决方案探索,以及数据选择(Data Selection)用于在偏好对中平衡多样性和质量。通过这两个组件,DIVE在MATH和GSM8k数据集上的实验表明,相较于传统的ISI方法,DIVE在保持性能质量的同时,输出多样性指标相对提升了10%至45%。消融研究进一步验证了这两个组件在实现这些改进中的重要性。
链接: https://arxiv.org/abs/2501.00747
作者: Yiwei Qin,Yixiu Liu,Pengfei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have demonstrated the effectiveness of Iterative Self-Improvement (ISI) techniques. However, continuous training on self-generated data leads to reduced output diversity, a limitation particularly critical in reasoning tasks where diverse solution paths are essential. We present DIVE (Diversified Iterative Self-Improvement), a novel framework that addresses this challenge through two key components: Sample Pool Expansion for broader solution exploration, and Data Selection for balancing diversity and quality in preference pairs. Experiments on MATH and GSM8k datasets show that DIVE achieves a 10% to 45% relative increase in output diversity metrics while maintaining performance quality compared to vanilla ISI. Our ablation studies confirm both components’ significance in achieving these improvements. Code is available at this https URL.
zh
[NLP-64] Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines
【速读】: 该论文旨在解决基于大语言模型(LLM)的搜索引擎在信息检索过程中面临的排名操纵攻击(ranking manipulation attacks)问题。攻击者通过精心设计网页内容来操纵LLM的排名,从而不公平地推广特定内容。论文将这一问题建模为无限重复囚徒困境(Infinitely Repeated Prisoners’ Dilemma),分析多个玩家在合作与攻击之间的策略选择。研究的关键在于识别影响玩家行为的关键因素,包括攻击成本、折现率、攻击成功率以及触发策略,并探讨在何种条件下合作能够持续。研究发现,当玩家具有前瞻性时,合作更可能维持;然而,从防御角度来看,单纯降低攻击成功率在某些情况下可能反而激励攻击行为,而限制攻击成功率上限的防御措施在某些场景中可能无效。这些发现揭示了保护LLM系统的复杂性,并为理解和缓解其脆弱性提供了理论基础和实践见解,强调了自适应安全策略和生态系统设计的重要性。
链接: https://arxiv.org/abs/2501.00745
作者: Xiyang Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Theoretical Economics (econ.TH)
备注:
Abstract:The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM’s ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners’ Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.
zh
[NLP-65] On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages
【速读】: 该论文探讨了在低资源语言环境下,通过层剪枝(layer pruning)技术开发更高效的BERT模型的有效性,旨在评估剪枝后的BERT模型在减少模型大小和复杂度的同时是否仍能保持高性能。研究的关键解决方案包括对多种BERT变体(如MahaBERT-v2和Google-Muril)应用不同的剪枝策略,并将其性能与从头训练的小型模型(如MahaBERT-Small和MahaBERT-Smaller)进行比较。实验结果表明,尽管剪枝模型具有较少的层数,但其性能与完整层数的模型相当,并且始终优于相似大小的从头训练模型。特别是,从模型中间层进行剪枝的策略最为有效,其性能与从顶部和底部剪枝的策略相当。此外,单语言BERT模型在这些实验中表现优于多语言模型。这一方法通过减少计算需求,提供了一种更快、更高效的替代方案,使得高级NLP模型在低资源语言环境中更具可及性,同时不牺牲分类准确性。
链接: https://arxiv.org/abs/2501.00733
作者: Mayur Shirke,Amey Shembade,Madhushri Wagh,Pavan Thorat,Raviraj Joshi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at FIRE 2024: 16th meeting of the Forum for Information Retrieval Evaluation
Abstract:This study explores the effectiveness of layer pruning for developing more efficient BERT models tailored to specific downstream tasks in low-resource languages. Our primary objective is to evaluate whether pruned BERT models can maintain high performance while reducing model size and complexity. We experiment with several BERT variants, including MahaBERT-v2 and Google-Muril, applying different pruning strategies and comparing their performance to smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. We fine-tune these models on Marathi datasets, specifically Short Headlines Classification (SHC), Long Paragraph Classification (LPC) and Long Document Classification (LDC), to assess their classification accuracy. Our findings demonstrate that pruned models, despite having fewer layers, achieve comparable performance to their fully-layered counterparts while consistently outperforming scratch-trained models of similar size. Notably, pruning layers from the middle of the model proves to be the most effective strategy, offering performance competitive with pruning from the top and bottom. However, there is no clear winner, as different pruning strategies perform better in different model and dataset combinations. Additionally, monolingual BERT models outperform multilingual ones in these experiments. This approach, which reduces computational demands, provides a faster and more efficient alternative to training smaller models from scratch, making advanced NLP models more accessible for low-resource languages without compromising classification accuracy.
zh
[NLP-66] ReviseRF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback
【速读】: 该论文试图解决的问题是如何通过自动化写作评估(AWE, Automated Writing Evaluation)系统帮助学生根据反馈进行作文修订,从而提高他们的写作能力。论文提出的解决方案是eRevise+RF系统,该系统能够评估学生作文修订的质量(例如,根据反馈对作文进行的改进),并提供修订反馈。关键点在于eRevise+RF系统能够有效评估学生在作文中使用证据的情况,提取跨作文的证据和推理修订,并确定修订是否成功回应了反馈。通过在实际教学中部署该系统,研究证实了其在提升学生议论文写作能力方面的有效性。
链接: https://arxiv.org/abs/2501.00715
作者: Zhexiong Liu,Diane Litman,Elaine Wang,Tianwen Li,Mason Gobat,Lindsay Clare Matsumura,Richard Correnti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to revise essays in response to feedback is important for students’ writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.
zh
[NLP-67] CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages COLING
【速读】: 该论文旨在解决多语言环境下生成对抗仇恨言论的鲁棒性反言论(counterspeech)问题,特别是在低资源语言(low-resource language)场景下的挑战。解决方案的关键在于引入了一种基于上下文感知的模型,该模型通过模拟退火算法(simulated annealing algorithm)在多语言数据集上进行微调,从而生成事实准确的回应。该模型在四种语言(巴斯克语、英语、意大利语和西班牙语)上均表现出色,尤其在巴斯克语上表现尤为突出,包揽了前三名。评估采用了传统指标(如BLEU、ROUGE、BERTScore、Novelty)以及基于大语言模型(LLM)的JudgeLM,展示了模型在多语言反言论生成任务中的先进性能。
链接: https://arxiv.org/abs/2501.00713
作者: Michael Bennie,Bushi Xiao,Chryseis Xinyi Liu,Demi Zhang,Jian Meng,Alayo Tripp
机构: 未知
类目: Computation and Language (cs.CL)
备注: to be published in MCG-COLING’s 2025 conference proceedings
Abstract:This paper introduces a context-aware model for robust counterspeech generation, which achieved significant success in the MCG-COLING-2025 shared task. Our approach particularly excelled in low-resource language settings. By leveraging a simulated annealing algorithm fine-tuned on multilingual datasets, the model generates factually accurate responses to hate speech. We demonstrate state-of-the-art performance across four languages (Basque, English, Italian, and Spanish), with our system ranking first for Basque, second for Italian, and third for both English and Spanish. Notably, our model swept all three top positions for Basque, highlighting its effectiveness in low-resource scenarios. Evaluation of the shared task employs both traditional metrics (BLEU, ROUGE, BERTScore, Novelty) and JudgeLM based on LLM. We present a detailed analysis of our results, including an empirical evaluation of the model performance and comprehensive score distributions across evaluation metrics. This work contributes to the growing body of research on multilingual counterspeech generation, offering insights into developing robust models that can adapt to diverse linguistic and cultural contexts in the fight against online hate speech. Comments: to be published in MCG-COLING’s 2025 conference proceedings Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.00713 [cs.CL] (or arXiv:2501.00713v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00713 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michael Bennie [view email] [v1] Wed, 1 Jan 2025 03:36:31 UTC (899 KB)
zh
[NLP-68] Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
【速读】: 该论文试图解决现有位置编码技术在Transformer模型中的局限性问题,特别是传统方法在位置编码中引入的固定模式限制了模型对长距离依赖关系的建模能力以及对不同任务的适应性。此外,大多数位置编码作为通用偏差学习,缺乏对数据集中不同实例的专门化处理。为解决这些问题,论文提出了一种名为TAPE(Contextualized Equivariant Position Embedding)的新框架。TAPE通过引入动态、上下文感知的位置编码,克服了传统固定模式的限制。其关键创新在于结合序列内容生成位置编码,并通过强制置换和正交等变性(permutation and orthogonal equivariance)确保位置编码在更新过程中的稳定性,从而提高了模型的鲁棒性和适应性。TAPE可以轻松集成到预训练的Transformer中,实现参数高效微调,并在语言建模、算术推理和长上下文检索任务中表现出优于现有位置编码技术的性能。
链接: https://arxiv.org/abs/2501.00712
作者: Jiajun Zhu,Peihao Wang,Ruisi Cai,Jason D. Lee,Pan Li,Zhangyang Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con \textbfT extualized equivari \textbfA nt \textbfP osition \textbfE mbedding ( \textbfTAPE ), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.
zh
[NLP-69] PANDA – Paired Anti-hate Narratives Dataset from Asia: Using an LLM -as-a-Judge to Create the First Chinese Counterspeech Dataset COLING2025
【速读】: 该论文试图解决现代标准汉语(Modern Standard Chinese)中反言论(counterspeech, CS)资源匮乏的问题,特别是在中国大陆地区对抗仇恨言论(hate speech, HS)的背景下。尽管现代标准汉语在全球范围内广泛使用,但针对汉语的反言论研究资源几乎不存在。论文提出了一种创新的解决方案,通过使用大语言模型(LLM-as-a-Judge)、模拟退火算法(simulated annealing)、零样本生成(zero-shot CN generation)和轮询算法(round-robin algorithm)来生成反言论,并随后进行人工验证以确保质量和上下文相关性。该方法不仅详细描述了如何创建有效的汉语反言论,还考虑了非欧洲中心语言中的独特文化模式和语言模式,特别是哪些群体被诋毁以及哪些话语标记被程序化地标记为仇恨言论。通过分析生成的语料库,论文提供了强有力的证据,表明开源且正确标注的汉语仇恨言论数据缺乏,并指出了使用大语言模型作为评分工具在汉语中的局限性。该语料库是首个基于东亚语言的反言论语料库,为未来的反言论生成和评估研究提供了重要资源。
链接: https://arxiv.org/abs/2501.00697
作者: Michael Bennie,Demi Zhang,Bushi Xiao,Jing Cao,Chryseis Xinyi Liu,Jian Meng,Alayo Tripp
机构: 未知
类目: Computation and Language (cs.CL)
备注: to be published in MCG-COLING 2025’s conference proceedings
Abstract:Despite the global prevalence of Modern Standard Chinese language, counterspeech (CS) resources for Chinese remain virtually nonexistent. To address this gap in East Asian counterspeech research we introduce the a corpus of Modern Standard Mandarin counterspeech that focuses on combating hate speech in Mainland China. This paper proposes a novel approach of generating CS by using an LLM-as-a-Judge, simulated annealing, LLMs zero-shot CN generation and a round-robin algorithm. This is followed by manual verification for quality and contextual relevance. This paper details the methodology for creating effective counterspeech in Chinese and other non-Eurocentric languages, including unique cultural patterns of which groups are maligned and linguistic patterns in what kinds of discourse markers are programmatically marked as hate speech (HS). Analysis of the generated corpora, we provide strong evidence for the lack of open-source, properly labeled Chinese hate speech data and the limitations of using an LLM-as-Judge to score possible answers in Chinese. Moreover, the present corpus serves as the first East Asian language based CS corpus and provides an essential resource for future research on counterspeech generation and evaluation.
zh
[NLP-70] Adjoint sharding for very long context training of state space models
【速读】: 该论文试图解决在极长上下文(very long contexts)中高效训练大规模语言模型(LLMs)的挑战。现有的方法通常在训练时使用较短的上下文(最多几千个token),并在推理时使用长上下文(超过1M token的上下文窗口)进行评估。然而,训练时处理极长上下文输入会受到GPU内存限制和训练时间过长的制约,而许多实际应用不仅需要推理,还需要在特定任务上进行长上下文的训练或微调。论文提出了一种称为“伴随分片”(adjoint sharding)的新技术,通过在训练过程中分片梯度计算,显著减少内存需求,使得在极长上下文中的训练变得计算可行。该技术基于伴随方法(adjoint method),计算与反向传播等效的梯度,并提出了截断伴随分片(truncated adjoint sharding)以加速算法同时保持性能。此外,论文还提供了分布式和并行版本的伴随分片,以进一步加速训练。实验结果表明,伴随分片算法在1M上下文长度的训练中,将1.27B参数的大规模语言模型的内存使用减少了最多3倍,使得在由五个AWS P4实例组成的训练基础设施上,1.27B参数模型的最大上下文长度从35K token增加到超过100K token。
链接: https://arxiv.org/abs/2501.00692
作者: Xingzi Xu,Amir Tavanaei,Kavosh Asadi,Karim Bouyarmane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite very fast progress, efficiently training large language models (LLMs) in very long contexts remains challenging. Existing methods fall back to training LLMs with short contexts (a maximum of a few thousands tokens in training) and use inference time techniques when evaluating on long contexts (above 1M tokens context window at inference). As opposed to long-context-inference, training on very long context input prompts is quickly limited by GPU memory availability and by the prohibitively long training times it requires on state-of-the-art hardware. Meanwhile, many real-life applications require not only inference but also training/fine-tuning with long context on specific tasks. Such applications include, for example, augmenting the context with various sources of raw reference information for fact extraction, fact summarization, or fact reconciliation tasks. We propose adjoint sharding, a novel technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude, making training on very long context computationally tractable. Adjoint sharding is based on the adjoint method and computes equivalent gradients to backpropagation. We also propose truncated adjoint sharding to speed up the algorithm while maintaining performance. We provide a distributed version, and a paralleled version of adjoint sharding to further speed up training. Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training. This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
zh
[NLP-71] Labels Generated by Large Language Model Helps Measuring Peoples Empathy in Vitro
【速读】: 该论文试图解决在共情计算(empathy computing)领域中,由于众包数据集(crowdsourced datasets)中存在的噪声标签(noisy labels)问题,这些噪声标签可能导致对共情能力的错误表征。论文提出的解决方案关键在于利用大语言模型(LLMs)生成的标签来辅助监督训练主流模型,具体通过两种方式实现:(1) 噪声标签校正(noisy label correction),即使用LLM生成的标签来修正现有数据集中的错误标签;(2) 训练数据增强(training data augmentation),即通过LLM生成的标签来扩充训练数据。通过这种方法,论文在预训练语言模型(PLMs)如RoBERTa上实现了显著的准确性提升,并在NewsEmp基准测试中达到了0.648的皮尔逊相关系数(Pearson correlation coefficient),达到了当前最先进的水平。
链接: https://arxiv.org/abs/2501.00691
作者: Md Rakibul Hasan,Yue Yao,Md Zakir Hossain,Aneesh Krishna,Imre Rudas,Shafin Rahman,Tom Gedeon
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Large language models (LLMs) have revolutionised numerous fields, with LLM-as-a-service (LLMSaaS) having a strong generalisation ability that offers accessible solutions directly without the need for costly training. In contrast to the widely studied prompt engineering for task solving directly (in vivo), this paper explores its potential in in-vitro applications. These involve using LLM to generate labels to help the supervised training of mainstream models by (1) noisy label correction and (2) training data augmentation with LLM-generated labels. In this paper, we evaluate this approach in the emerging field of empathy computing – automating the prediction of psychological questionnaire outcomes from inputs like text sequences. Specifically, crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. By leveraging LLM-generated labels to train pre-trained language models (PLMs) like RoBERTa, we achieve statistically significant accuracy improvements over baselines, achieving a state-of-the-art Pearson correlation coefficient of 0.648 on NewsEmp benchmarks. In addition, we bring insightful discussions, including current challenges in empathy computing, data biases in training data and evaluation metric selection. Code and LLM-generated data are available at this https URL (available once the paper is accepted).
zh
[NLP-72] IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently
【速读】: 该论文试图解决现代大语言模型(LLMs)在处理算术任务时表现不佳的问题。尽管算术任务是一项基础技能,但现有的LLMs在执行此类任务时存在显著困难。论文提出的解决方案是引入集成门控计算器(Integrated Gated Calculator, IGC),该模块通过在GPU上模拟计算器的方式,使LLMs能够在模型内部直接执行算术运算,而无需生成中间标记或依赖外部工具。IGC的关键优势在于其计算效率高、可解释性强,并且在不需要算术运算的任务上不会产生副作用。通过在Llama模型上微调并测试,IGC在BigBench Arithmetic基准测试中超越了现有最佳模型,包括那些规模大两个数量级的模型,并在所有子任务(包括之前未解决的乘法任务)上实现了98%到99%的准确率。
链接: https://arxiv.org/abs/2501.00684
作者: Florian Dietz,Dietrich Klakow
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Solving arithmetic tasks is a simple and fundamental skill, yet modern Large Language Models (LLMs) have great difficulty with them. We introduce the Integrated Gated Calculator (IGC), a module that enables LLMs to perform arithmetic by emulating a calculator on the GPU. We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art, outperforming all models on the benchmark, including models almost two orders of magnitude larger. Our approach takes only a single iteration to run and requires no external tools. It performs arithmetic operations entirely inside the LLM without the need to produce intermediate tokens. It is computationally efficient, interpretable, and avoids side-effects on tasks that do not require arithmetic operations. It reliably achieves 98% to 99% accuracy across multiple training runs and for all subtasks, including the substantially harder subtask of multiplication, which was previously unsolved.
zh
[NLP-73] ans: Learning to Memorize at Test Time
【速读】: 该论文试图解决在长上下文窗口(context window)中有效建模依赖关系的问题。传统的循环模型(recurrent models)通过将数据压缩到固定大小的隐藏状态(hidden state)中来处理序列数据,而注意力机制(attention)虽然能够捕捉所有标记(tokens)之间的直接依赖关系,但其计算成本随上下文长度呈二次方增长,限制了模型只能处理固定长度的上下文。论文提出了一种新的神经长期记忆模块(neural long-term memory module),该模块能够学习并记忆历史上下文信息,同时帮助注意力机制在当前上下文中利用过去的信息。这种神经记忆模块具有快速并行化训练和快速推理的优势。论文进一步提出了一种新的架构家族,称为Titans,并展示了三种变体,以探讨如何有效地将记忆模块整合到架构中。实验结果表明,Titans在语言建模、常识推理、基因组学和时间序列任务中比Transformer和现代线性循环模型更有效,并且能够在超过200万标记的上下文窗口中实现更高的准确性。
链接: https://arxiv.org/abs/2501.00663
作者: Ali Behrouz,Peilin Zhong,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.
zh
[NLP-74] Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? Revisiting a Petroglyph
【速读】: 该论文探讨了自回归Transformer语言模型是否必须使用显式的位置编码(PEs)的问题。研究表明,对于多层自回归Transformer模型,显式的位置编码并非必需,因为多层模型能够通过其结构区分输入序列中经过排列的标记(tokens)。这一特性早在GPT-2时代的研究中已被发现,但并未得到广泛传播,甚至最近被重新发现。论文的核心在于重新审视并解释这一现象,强调多层模型无需显式位置编码即可捕捉序列顺序信息,而单层模型则依赖位置编码来识别输入标记的顺序。通过回顾这一长期被忽视的解释,论文旨在重新确立这一结果为领域内的共识。
链接: https://arxiv.org/abs/2501.00659
作者: Kazuki Irie
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is “no” as long as they have more than one layer – they can distinguish sequences with permuted tokens without requiring explicit PEs. This property has been known since early efforts (those contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated and was even rediscovered recently. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2, but perhaps also due to the lack of a clear explanation in prior publications, despite being commonly understood by practitioners in the past. Here we review this long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their input tokens). We also review the origin of this result, and hope to re-establish it as a common knowledge.
zh
[NLP-75] 2 OLMo 2 Furious
【速读】: 该论文旨在提升开放语言模型(OLMo)的性能和效率,并解决现有模型在训练稳定性、计算效率和下游任务表现上的不足。解决方案的关键包括三个方面:首先,改进了模型架构和训练方法,提升了训练稳定性和每个token的计算效率;其次,引入了新的预训练数据混合策略(Dolmino Mix 1124),通过后期课程学习(late-stage curriculum training)显著提升了模型在下游任务中的表现;最后,结合Tülu 3的最佳实践,开发了OLMo 2-Instruct模型,采用宽松数据和可验证奖励的强化学习(RLVR)进行优化。通过这些改进,OLMo 2在性能和计算效率上达到了Pareto前沿,超越了多个开放权重模型(如Llama 3.1和Qwen 2.5),同时保持了完全透明的训练数据、代码和训练方法。
链接: https://arxiv.org/abs/2501.00656
作者: Team OLMo,Pete Walsh,Luca Soldaini,Dirk Groeneveld,Kyle Lo,Shane Arora,Akshita Bhagia,Yuling Gu,Shengyi Huang,Matt Jordan,Nathan Lambert,Dustin Schwenk,Oyvind Tafjord,Taira Anderson,David Atkinson,Faeze Brahman,Christopher Clark,Pradeep Dasigi,Nouha Dziri,Michal Guerquin,Hamish Ivison,Pang Wei Koh,Jiacheng Liu,Saumya Malik,William Merrill,Lester James V. Miranda,Jacob Morrison,Tyler Murray,Crystal Nam,Valentina Pyatkin,Aman Rangapur,Michael Schmitz,Sam Skjonsberg,David Wadden,Christopher Wilhelm,Michael Wilson,Luke Zettlemoyer,Ali Farhadi,Noah A. Smith,Hannaneh Hajishirzi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Model demo available at this http URL
Abstract:We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly – models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.
zh
[NLP-76] ICONS: Influence Consensus for Vision-Language Data Selection
【速读】: 该论文旨在解决视觉指令调优(Visual Instruction Tuning)过程中需要大量视觉-语言训练数据的问题,这些数据通常包含冗余信息,增加了计算成本但未能带来相应的性能提升。为此,作者提出了ICONS(Influence CONsensus)方法,这是一种基于梯度驱动的影响共识(gradient-driven Influence CONsensus)的视觉-语言数据选择方法,用于选择紧凑的训练数据集以实现高效的多任务训练。其核心在于跨任务影响共识(cross-task influence consensus),通过跨任务特定影响矩阵的多数投票机制,识别出在多个任务中具有一致价值的数据样本,从而优先选择能够优化整体性能的数据。实验表明,使用ICONS方法选择的训练数据(仅占LLaVA-665K数据集的20%)训练的模型,能够达到使用完整数据集时98.6%的相对性能。此外,作者还发布了LLaVA-ICONS-133K,这是LLaVA-665K数据集的一个紧凑且信息丰富的子集,保留了高效视觉-语言模型开发所需的高影响力训练数据。
链接: https://arxiv.org/abs/2501.00654
作者: Xindi Wu,Mengzhou Xia,Rulin Shao,Zhiwei Deng,Pang Wei Koh,Olga Russakovsky
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 19 figures
Abstract:Visual Instruction Tuning typically requires a large amount of vision-language training data. This data often containing redundant information that increases computational costs without proportional performance gains. In this work, we introduce ICONS, a gradient-driven Influence CONsensus approach for vision-language data Selection that selects a compact training dataset for efficient multi-task training. The key element of our approach is cross-task influence consensus, which uses majority voting across task-specific influence matrices to identify samples that are consistently valuable across multiple tasks, allowing us to effectively prioritize data that optimizes for overall performance. Experiments show that models trained on our selected data (20% of LLaVA-665K) achieve 98.6% of the relative performance obtained using the full dataset. Additionally, we release this subset, LLaVA-ICONS-133K, a compact yet highly informative subset of LLaVA-665K visual instruction tuning data, preserving high impact training data for efficient vision-language model development.
zh
[NLP-77] Efficient Standardization of Clinical Notes using Large Language Models
【速读】: 该论文试图解决临床笔记(clinician notes)中存在的不一致性问题,这些问题包括多样的写作风格、口语化表达、缩写、医学术语、语法错误和非标准格式等。这些不一致性阻碍了从电子健康记录(EHRs)中提取有意义的数据,影响了质量改进、人口健康、精准医学、决策支持和研究等领域的工作。论文提出了一种基于大语言模型(large language model)的方法,用于标准化1,618份临床笔记。该标准化过程平均纠正了每份笔记中的4.9个语法错误和3.3个拼写错误,将3.1个非标准术语转换为标准术语,并扩展了15.8个缩写和首字母缩略词。此外,笔记被重新组织为标准化的章节结构。这一过程为关键概念提取、映射到医学本体论以及转换为可互操作的数据格式(如FHIR)奠定了基础。专家对随机抽样的笔记进行审查后,未发现标准化过程中有显著的数据丢失。该概念验证研究表明,临床笔记的标准化可以提高其可读性、一致性和可用性,同时促进其转换为可互操作的数据格式。
链接: https://arxiv.org/abs/2501.00644
作者: Daniel B. Hier,Michael D. Carrithers,Thanh Son Do,Tayo Obafemi-Ajayi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinician notes are a rich source of patient information but often contain inconsistencies due to varied writing styles, colloquialisms, abbreviations, medical jargon, grammatical errors, and non-standard formatting. These inconsistencies hinder the extraction of meaningful data from electronic health records (EHRs), posing challenges for quality improvement, population health, precision medicine, decision support, and research. We present a large language model approach to standardizing a corpus of 1,618 clinical notes. Standardization corrected an average of 4.9 +/- 1.8 grammatical errors, 3.3 +/- 5.2 spelling errors, converted 3.1 +/- 3.0 non-standard terms to standard terminology, and expanded 15.8 +/- 9.1 abbreviations and acronyms per note. Additionally, notes were re-organized into canonical sections with standardized headings. This process prepared notes for key concept extraction, mapping to medical ontologies, and conversion to interoperable data formats such as FHIR. Expert review of randomly sampled notes found no significant data loss after standardization. This proof-of-concept study demonstrates that standardization of clinical notes can improve their readability, consistency, and usability, while also facilitating their conversion into interoperable data formats. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 92 ACMclasses: J.3; I.2 Cite as: arXiv:2501.00644 [cs.CL] (or arXiv:2501.00644v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-78] oward Corpus Size Requirements for Training and Evaluating Depression Risk Models Using Spoken Language
【速读】: 该论文探讨了心理健康风险预测(mental health risk prediction)领域中训练集和测试集大小对模型性能的影响问题。研究通过使用超过65K个标注数据点,采用完全交叉设计(fully crossed design)来评估不同训练/测试集大小组合下的模型表现。研究涉及两种模型类型:基于语言(NLP)和基于语音声学(speech acoustics)的模型。关键发现包括:(1)测试集样本量低于1K时,即使训练集较大,结果仍不稳定;(2)训练集至少需要2K样本才能获得稳定结果;(3)NLP和声学模型在训练/测试集大小变化下的表现相似;(4)年龄不匹配的测试集与匹配测试集表现出相同模式。研究还讨论了标签先验(label priors)、模型强度(model strength)、预训练(pre-training)、独特说话者(unique speakers)和数据长度(data lengths)等因素。研究结论强调了在心理健康风险预测研究中,适当大小的训练集和测试集的重要性。
链接: https://arxiv.org/abs/2501.00617
作者: Tomek Rutowski,Amir Harati,Elizabeth Shriberg,Yang Lu,Piotr Chlebek,Ricardo Oliveira
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Mental health risk prediction is a growing field in the speech community, but many studies are based on small corpora. This study illustrates how variations in test and train set sizes impact performance in a controlled study. Using a corpus of over 65K labeled data points, results from a fully crossed design of different train/test size combinations are provided. Two model types are included: one based on language and the other on speech acoustics. Both use methods current in this domain. An age-mismatched test set was also included. Results show that (1) test sizes below 1K samples gave noisy results, even for larger training set sizes; (2) training set sizes of at least 2K were needed for stable results; (3) NLP and acoustic models behaved similarly with train/test size variations, and (4) the mismatched test set showed the same patterns as the matched test set. Additional factors are discussed, including label priors, model strength and pre-training, unique speakers, and data lengths. While no single study can specify exact size requirements, results demonstrate the need for appropriately sized train and test sets for future studies of mental health risk prediction from speech and language.
zh
[NLP-79] Optimizing Speech-Input Length for Speaker-Independent Depression Classification
【速读】: 该论文试图解决的问题是:在基于语音的抑郁症分类(depression classification)中,语音输入长度对模型性能的影响。尽管抑郁症分类的研究日益增多,但关于语音输入长度如何影响模型性能的理解仍然有限。论文通过分析一个包含超过1400小时语音的语料库,研究了两种不同性能的自然语言处理(NLP)系统在说话人独立(speaker-independent)抑郁症分类中的表现。研究结果表明,模型性能取决于语音的自然长度(natural length)、经过时间长度(elapsed length)以及会话中响应的顺序(ordering of the response)。两种系统共享一个最小长度阈值(minimum length threshold),但在响应饱和阈值(response saturation threshold)上存在差异,性能更好的系统具有更高的饱和阈值。在达到饱和时,提出新问题比继续当前响应更有利于分类。这些发现为如何设计应用程序以更好地获取和处理抑郁症分类中的最优输入长度提供了指导。
链接: https://arxiv.org/abs/2501.00608
作者: Tomasz Rutowski,Amir Harati,Yang Lu,Elizabeth Shriberg
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Machine learning models for speech-based depression classification offer promise for health care applications. Despite growing work on depression classification, little is understood about how the length of speech-input impacts model performance. We analyze results for speaker-independent depression classification using a corpus of over 1400 hours of speech from a human-machine health screening application. We examine performance as a function of response input length for two NLP systems that differ in overall performance. Results for both systems show that performance depends on natural length, elapsed length, and ordering of the response within a session. Systems share a minimum length threshold, but differ in a response saturation threshold, with the latter higher for the better system. At saturation it is better to pose a new question to the speaker, than to continue the current response. These and additional reported results suggest how applications can be better designed to both elicit and process optimal input lengths for depression classification. Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2501.00608 [cs.CL] (or arXiv:2501.00608v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00608 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings Interspeech, 2019
zh
[NLP-80] “Dialogue” vs “Dialog” in NLP and AI research: Statistics from a Confused Discourse
【速读】: 该论文旨在探讨计算研究领域中“dialogue”和“dialog”两种拼写形式的使用差异及其背后的原因。通过对数千篇研究论文的分析,作者发现72%的顶级会议论文使用“dialogue”,24%使用“dialog”,5%在同一标题和摘要中同时使用两种拼写。这种拼写分布差异在计算领域比其他学科更为常见。作者进一步研究了近20年自然语言处理(NLP)和人工智能(AI)研究中的趋势,发现拼写选择并未随时间发生明显变化。尽管作者国籍与拼写选择存在微弱相关性,但无法完全解释这种混合使用现象。通过句法分析和语言模型嵌入等方法,作者发现上下文对拼写选择的影响有限。综合这些结果,论文讨论了可能导致“dialogue”和“dialog”拼写差异的不同理论。
链接: https://arxiv.org/abs/2501.00598
作者: David Gros
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Within computing research, there are two spellings for an increasingly important term - dialogue and dialog. We analyze thousands of research papers to understand this “dialog(ue) debacle”. Among publications in top venues that use “dialog(ue)” in the title or abstract, 72% use “dialogue”, 24% use “dialog”, and 5% use both in the same title and abstract. This split distribution is more common in Computing than any other academic discipline. We investigate trends over ~20 years of NLP/AI research, not finding clear evidence of a shift over time. Author nationality is weakly correlated with spelling choice, but far from explains the mixed use. Many prolific authors publish papers with both spellings. We use several methods (such as syntactic parses and LM embeddings) to study how dialog(ue) context influences spelling, finding limited influence. Combining these results together, we discuss different theories that might explain the dialog(ue) divergence.
zh
[NLP-81] Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation
【速读】: 该论文试图解决在资源有限的语言(如土耳其语)中,评估大语言模型(LLMs)在理解和生成人类语言方面的能力时所面临的挑战。为了解决这一问题,作者引入了土耳其 MMLU(TR-MMLU)基准,这是一个全面的评估框架,旨在评估大语言模型在土耳其语中的语言和概念能力。TR-MMLU 的关键在于其构建了一个包含 6200 道多项选择题的数据集,这些题目来自土耳其教育系统中的 67 个学科和超过 800 个主题,涵盖了 62 个部分。该基准提供了一个透明、可重复且文化相关的工具,用于评估模型性能,并为土耳其自然语言处理(NLP)研究提供了一个标准框架,从而促进更强大和准确的语言模型的开发。
链接: https://arxiv.org/abs/2501.00593
作者: M. Ali Bayram,Ali Arda Fincan,Ahmet Semih G"um"uş,Banu Diri,Savaş Yıldırım,"Oner Aytaş
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 tables, submitted to arXiv for review. Includes a comprehensive evaluation framework for Turkish NLP tasks and state-of-the-art LLM evaluations
Abstract:Language models have made remarkable advancements in understanding and generating human language, achieving notable success across a wide array of applications. However, evaluating these models remains a significant challenge, particularly for resource-limited languages such as Turkish. To address this gap, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is constructed from a carefully curated dataset comprising 6200 multiple-choice questions across 62 sections, selected from a pool of 280000 questions spanning 67 disciplines and over 800 topics within the Turkish education system. This benchmark provides a transparent, reproducible, and culturally relevant tool for evaluating model performance. It serves as a standard framework for Turkish NLP research, enabling detailed analyses of LLMs’ capabilities in processing Turkish text and fostering the development of more robust and accurate language models. In this study, we evaluate state-of-the-art LLMs on TR-MMLU, providing insights into their strengths and limitations for Turkish-specific tasks. Our findings reveal critical challenges, such as the impact of tokenization and fine-tuning strategies, and highlight areas for improvement in model design. By setting a new standard for evaluating Turkish language models, TR-MMLU aims to inspire future innovations and support the advancement of Turkish NLP research.
zh
[NLP-82] Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders
【速读】: 该论文试图解决大语言模型(LLMs)在关键应用中行为与人类价值观对齐的挑战。当前的方法,如基于人类反馈的强化学习(RLHF),通常只关注有限的价值观集合,并且资源消耗较大。此外,价值观之间的相关性在很大程度上被忽视且未充分利用。论文提出的解决方案关键在于构建一个因果图(causal graph),以揭示LLMs中各种价值观之间的隐含关系。基于该因果图,作者实现了两种轻量级的价值观引导机制:提示模板引导(prompt template steering)和稀疏自编码器特征引导(Sparse Autoencoder feature steering),并分析了改变某一价值观维度对其他维度的影响。通过在Gemma-2B-IT和Llama3-8B-IT模型上的广泛实验,验证了这些引导方法的有效性和可控性。
链接: https://arxiv.org/abs/2501.00581
作者: Yipeng Kang,Junqi Wang,Yexin Li,Fangwei Zhong,Xue Feng,Mengmeng Wang,Wenming Tu,Quansen Wang,Hengli Li,Zilong Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), often focus on a limited set of values and can be resource-intensive. Furthermore, the correlation between values has been largely overlooked and remains underutilized. Our framework addresses this limitation by mining a causal graph that elucidates the implicit relationships among various values within the LLMs. Leveraging the causal graph, we implement two lightweight mechanisms for value steering: prompt template steering and Sparse Autoencoder feature steering, and analyze the effects of altering one value dimension on others. Extensive experiments conducted on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our steering methods.
zh
[NLP-83] KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities
【速读】: 该论文试图解决文档级关系抽取(Doc-RE)中的两个主要问题:一是现有方法缺乏利用外部知识进行综合推理的能力,尤其是在处理长文档时;二是现有方法通常只优化单一的推理能力,而未能全面考虑跨句实体、上下文和外部常识知识之间的复杂交互。为解决这些问题,论文提出了一种名为KnowRA的知识检索增强方法,其关键解决方案包括:首先,构建文档图进行语义编码,并将共指消解模型集成到KnowRA中以增强共指推理能力;其次,通过检索外部知识库将文档图扩展为文档知识图,并引入轴注意力机制以分别提升常识推理和逻辑推理能力;最后,在常识和共指推理模块中引入知识过滤方法,以过滤掉不相关的知识。实验结果表明,该方法在两个数据集上均优于现有的最先进基线方法。
链接: https://arxiv.org/abs/2501.00571
作者: Chengcheng Mai,Yuxiang Wang,Ziyu Gong,Hanxiang Wang,Yihua Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist DocRE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model into KnowRA to augment the co-reference reasoning ability. Then, we further expanded the document graph into a document knowledge graph by retrieving the external knowledge base and introduced the axis attention mechanism into KnowRA to improve its common-sense and logical reasoning abilities, respectively. Finally, a knowledge filtering method was presented in the common-sense and co-reference reasoning module to filter out irrelevant knowledge. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at this https URL.
zh
[NLP-84] An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems
【速读】: 该论文探讨了如何利用大语言模型(Large Language Models, LLMs)来解决传统方法难以处理的开放式问题。传统方法通常依赖于算法规范和静态领域知识(如性能指标和基础构建模块库),而大语言模型则能够支持更广泛的问题解决活动,包括问题框架构建、探索可能的解决途径、特征细化和组合、更高级的实现评估以及处理意外情况。论文总结了当前大语言模型的研究进展,包括模型提示(model prompting)、强化学习(Reinforcement Learning)和检索增强生成(Retrieval-Augmented Generation),并讨论了未来研究的需求。解决方案的关键在于利用大语言模型的生成能力和动态知识整合,以支持更灵活和创新的问题解决策略。
链接: https://arxiv.org/abs/2501.00562
作者: Hashmath Shaik,Alex Doboli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models offer new opportunities to devise automated implementation generation methods that can tackle problem solving activities beyond traditional methods, which require algorithmic specifications and can use only static domain knowledge, like performance metrics and libraries of basic building blocks. Large Language Models could support creating new methods to support problem solving activities for open-ended problems, like problem framing, exploring possible solving approaches, feature elaboration and combination, more advanced implementation assessment, and handling unexpected situations. This report summarized the current work on Large Language Models, including model prompting, Reinforcement Learning, and Retrieval-Augmented Generation. Future research requirements were also discussed.
zh
[NLP-85] Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
【速读】: 该论文旨在解决如何自动评估和排名不同大语言模型(LLMs)的能力,以更好地理解其性能及其与人类偏好的一致性。由于人工评估成本高且耗时,自动LLM评估框架(automatic LLM bencher)成为不可或缺的工具。该框架由四个关键组件构成:输入集(input set,如用户指令)、评估模型(evaluation model,如LLM)、评估类型(evaluation type,如成对比较)和聚合方法(aggregation method,如ELO评分系统)。然而,以往的研究并未深入探讨如何选择这些组件以及它们的不同组合如何影响评估结果。本文通过控制实验,提出了一系列关于如何选择这些组件的建议,以实现更高效的LLM自动评估。此外,研究发现,当评估性能相近的LLMs时,自动评估框架的性能显著下降,揭示了当前评估框架的局限性,并呼吁未来研究进一步改进。最后,论文指出,评估模型在实例层面的表现(如选择最佳输出的准确性)并不总是与其作为评估框架组件的有效性一致,强调了系统级评估的重要性。
链接: https://arxiv.org/abs/2501.00560
作者: Mingqi Gao,Yixin Liu,Xinyu Hu,Xiaojun Wan,Jonathan Bragg,Arman Cohan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models’ performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.
zh
[NLP-86] AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLM s Knowledge In STEM Subjects
【速读】: 该论文试图解决当前大型语言模型(LLMs)在非英语语言环境下知识评估不足的问题。现有的评估基准主要集中于英语,而许多LLMs是多语言的,仅依赖英语基准评估其知识水平是不够的。为此,论文提出了AraSTEM,一个阿拉伯语的多选题数据集,旨在评估LLMs在STEM(科学、技术、工程和数学)领域的知识掌握情况。该数据集涵盖了不同难度和主题的题目,要求模型展示对阿拉伯语科学内容的深刻理解。研究结果表明,现有的公开模型在处理该数据集时表现不佳,凸显了开发更多本地化语言模型的必要性。AraSTEM数据集已在Hugging Face平台上公开提供。
链接: https://arxiv.org/abs/2501.00559
作者: Ahmad Mustapha,Hadi Al-Khansa,Hadi Al-Mubasher,Aya Mourad,Ranam Hamoud,Hasan El-Husseini,Marwah Al-Sakkaf,Mariette Awad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable capabilities, not only in generating human-like text, but also in acquiring knowledge. This highlights the need to go beyond the typical Natural Language Processing downstream benchmarks and asses the various aspects of LLMs including knowledge and reasoning. Numerous benchmarks have been developed to evaluate LLMs knowledge, but they predominantly focus on the English language. Given that many LLMs are multilingual, relying solely on benchmarking English knowledge is insufficient. To address this issue, we introduce AraSTEM, a new Arabic multiple-choice question dataset aimed at evaluating LLMs knowledge in STEM subjects. The dataset spans a range of topics at different levels which requires models to demonstrate a deep understanding of scientific Arabic in order to achieve high accuracy. Our findings show that publicly available models of varying sizes struggle with this dataset, and underscores the need for more localized language models. The dataset is freely accessible on Hugging Face.
zh
[NLP-87] MCP-Solver: Integrating Language Models with Constraint Programming Systems
【速读】: 该论文试图解决大型语言模型(LLMs)在精确形式推理和问题严格规范方面的不足。尽管LLMs在自然语言任务中表现出色,但在处理需要严格逻辑推理和规范化的任务时存在局限性。为此,论文提出了MCP-Solver,一个基于模型上下文协议(Model Context Protocol)的原型系统,旨在实现LLMs与约束编程系统的系统集成。解决方案的关键在于通过MCP-Solver提供的接口,支持约束模型的创建、编辑和验证,确保每次修改步骤中的模型一致性,并支持结构化迭代优化。该系统还处理并发求解会话,并维护一个持久的建模知识库。初步实验表明,这种集成能够有效结合LLMs的自然语言理解能力与约束求解能力,为自然语言处理与基于约束的推理的有机结合迈出了重要一步。
链接: https://arxiv.org/abs/2501.00539
作者: Stefan Szeider
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:While Large Language Models (LLMs) perform exceptionally well at natural language tasks, they often struggle with precise formal reasoning and the rigorous specification of problems. We present MCP-Solver, a prototype implementation of the Model Context Protocol that demonstrates the potential for systematic integration between LLMs and constraint programming systems. Our implementation provides interfaces for the creation, editing, and validation of a constraint model. Through an item-based editing approach with integrated validation, the system ensures model consistency at every modification step and enables structured iterative refinement. The system handles concurrent solving sessions and maintains a persistent knowledge base of modeling insights. Initial experiments suggest that this integration can effectively combine LLMs’ natural language understanding with constraint-solving capabilities. Our open-source implementation is proof of concept for integrating formal reasoning systems with LLMs through standardized protocols. While further research is needed to establish comprehensive formal guarantees, this work takes a first step toward principled integration of natural language processing with constraint-based reasoning.
zh
[NLP-88] Superposition in Transformers: A Novel Way of Building Mixture of Experts
【速读】: 该论文试图解决大语言模型(LLMs)在适应新任务或领域时出现的灾难性遗忘(catastrophic forgetting)问题。传统微调方法通常会覆盖模型原有的知识,导致在原始任务上的性能下降。论文提出的解决方案是“Transformer中的叠加”(Superposition in Transformers),这是一种新颖的架构,利用自动编码器(autoencoders)在共享参数空间内叠加基础模型和微调模型的隐藏表示。通过基于B样条的混合系数和根据输入数据分布自适应重建隐藏状态的自动编码器,该方法有效缓解了灾难性遗忘,并实现了“模型内叠加”的新范式。这一方法在保留原始模型能力的同时,允许添加紧凑的领域特定知识,并支持在推理过程中动态切换模型状态。
链接: https://arxiv.org/abs/2501.00530
作者: Ayoub Ben Chaliah,Hela Dellagi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Catastrophic forgetting remains a major challenge when adapting large language models (LLMs) to new tasks or domains. Conventional fine-tuning often overwrites existing knowledge, causing performance degradation on original tasks. We introduce Superposition in Transformers, a novel architecture that leverages autoencoders to superimpose the hidden representations of a base model and a fine-tuned model within a shared parameter space. By using B-spline-based blending coefficients and autoencoders that adaptively reconstruct hidden states based on the input data distribution, our method effectively mitigates catastrophic forgetting and enables a new paradigm of “in-model” superposition. This approach preserves original model capabilities while allowing compact domain-specific expertise to be added, and it supports dynamic switching between model states during inference.
zh
[NLP-89] Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
【速读】: 该论文试图解决低资源语言(如僧伽罗语)在缺乏技术素养和便利性背景下,广泛使用罗马化(即用罗马字母代替本地文字)而非本地化工具的问题。具体而言,研究聚焦于罗马化僧伽罗语的转写问题。论文提出了两种解决方案:第一种是基于规则的方法(rule-based method),作为基线模型;第二种是将转写问题视为序列到序列任务(sequence-to-sequence task),类似于神经机器翻译(Neural Machine Translation, NMT),并提出了基于Transformer的编码器-解码器(Transformer-based Encode-Decoder)解决方案。研究表明,相较于基于规则的方法,基于Transformer的方法能够更好地捕捉罗马化文本中的临时模式(ad-hoc patterns)。
链接: https://arxiv.org/abs/2501.00529
作者: Yomal De Mel,Kasun Wickramasinghe,Nisansa de Silva,Surangika Ranathunga
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 tables
Abstract:Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - this https URL
zh
[NLP-90] nyHelens First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment
【速读】: 该论文旨在解决训练语言模型(LMs)及其应用代理时因大规模数据集和模型带来的高成本问题,尤其是测试失败的高昂代价。为了解决这一问题,论文提出了一种简化的语言环境(Simplified Language Environments),通过减少语言数据集的噪声和复杂性,同时保留文本分布的关键特征,从而提高语言模型的学习效率,并减少训练和评估所需的模型大小和数据量。解决方案的关键在于提出了一种数据精炼管道(pipeline),通过消除噪声、最小化词汇量并保持特定类型的文本模式(如书籍、对话、代码等),生成了一系列精简的训练和评估数据集(如71M Leaner-Pretrain、7M Leaner-Instruct等)。实验表明,使用这些精简数据集进行预训练可以显著提升语言模型的学习效率,尤其是在指令跟随任务中,小型模型的表现优于使用原始数据集训练的模型。
链接: https://arxiv.org/abs/2501.00522
作者: Ke Yang,Volodymyr Kindratenko,ChengXiang Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset’s alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00522 [cs.CL] (or arXiv:2501.00522v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-91] Fotheidil: an Automatic Transcription System for the Irish Language COLING2025
【速读】: 该论文旨在解决爱尔兰语(Irish language)自动转录的问题,特别是针对该语言在自动语音识别(ASR)和标点符号恢复方面的挑战。解决方案的关键在于开发了一个名为Fotheidil的基于网络的转录系统,该系统集成了多种语音相关的AI技术。具体而言,系统采用了现成的预训练语音活动检测(voice activity detection)和说话人分割(speaker diarisation)模型,并专门训练了用于爱尔兰语自动语音识别、大写字母和标点符号恢复的模型。此外,论文探索了半监督学习(semi-supervised learning)来改进模块化TDNN-HMM ASR系统的声学模型,显著提升了在域外测试集和监督训练集中代表性不足的方言上的表现。论文还提出了一种基于序列到序列模型(sequence-to-sequence models)的新方法用于大写字母和标点符号恢复,与传统分类模型相比,实验结果显示性能有显著提升。该系统将免费向公众开放,并通过社区驱动的循环方式逐步改进ASR模型。
链接: https://arxiv.org/abs/2501.00509
作者: Liam Lonergan,Ibon Saratxaga,John Sloan,Oscar Maharog,Mengjie Qian,Neasa Ní Chiaráin,Christer Gobl,Ailbhe Ní Chasaide
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to the 5th Celtic Language Technology Workshop within COLING 2025
Abstract:This paper sets out the first web-based transcription system for the Irish language - Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. The system includes both off-the-shelf pre-trained voice activity detection and speaker diarisation models and models trained specifically for Irish automatic speech recognition and capitalisation and punctuation restoration. Semi-supervised learning is explored to improve the acoustic model of a modular TDNN-HMM ASR system, yielding substantial improvements for out-of-domain test sets and dialects that are underrepresented in the supervised training set. A novel approach to capitalisation and punctuation restoration involving sequence-to-sequence models is compared with the conventional approach using a classification model. Experimental results show here also substantial improvements in performance. The system will be made freely available for public use, and represents an important resource to researchers and others who transcribe Irish language materials. Human-corrected transcriptions will be collected and included in the training dataset as the system is used, which should lead to incremental improvements to the ASR model in a cyclical, community-driven fashion.
zh
[NLP-92] wo Cases of Deduction with Non-referring Descriptions
【速读】: 该论文试图解决在形式推理中处理非指称性术语(non-denoting terms),特别是非指称描述(non-referring descriptions)如“法国国王”的问题。现有的研究多采用自由逻辑(free logic)和序列演算(sequent calculus),而本文提出了一种基于部分类型理论(partial type theory)和序列风格的自然演绎(natural deduction in sequent style)的替代方案。通过结合Montague和Tichý风格的自然语言形式化方法,论文成功处理了带有非指称描述作为补语的意向性及物动词(intensional transitives)的推理,并推导出Strawsonian规则来处理此类描述的存在预设(existential presuppositions)。解决方案的关键在于使用部分类型理论和序列风格的自然演绎,从而避免了自由逻辑和序列演算的局限性。
链接: https://arxiv.org/abs/2501.00485
作者: Jiří Raclavský(Masaryk University, Brno, Czech Republic)
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
备注: In Proceedings NCL’24, arXiv:2412.20053
Abstract:Formal reasoning with non-denoting terms, esp. non-referring descriptions such as “the King of France”, is still an under-investigated area. The recent exception being a series of papers e.g. by Indrzejczak, Zawidzki and Krbis. The present paper offers an alternative to their approach since instead of free logic and sequent calculus, it’s framed in partial type theory with natural deduction in sequent style. Using a Montague- and Tichý-style formalization of natural language, the paper successfully handles deduction with intensional transitives whose complements are non-referring descriptions, and derives Strawsonian rules for existential presuppositions of sentences with such descriptions.
zh
[NLP-93] Differentiable Prompt Learning for Vision Language Models
【速读】: 该论文试图解决的问题是如何自动化设计深度连续提示(deep continuous prompts),以优化大规模预训练基础模型在下游任务中的表现。目前,手动设计的深度连续提示虽然显著提升了零样本预训练模型的性能,但其设计过程是否最优仍是一个未充分探索的领域。论文提出的解决方案是可微分提示学习(Differentiable Prompt Learning, DPL),该方法通过将提示设计问题转化为优化问题,自动确定每一层应添加的提示上下文长度,目标是最大化模型在下游任务中的性能。DPL方法的关键在于其能够通过有限的数据,高效地找到高置信度的深度连续提示配置,并且与现有基线方法兼容,能够进一步提升性能。实验表明,DPL方法在11个数据集上平均提升了2.60%的测试准确率,展示了自动化设计的优越性。
链接: https://arxiv.org/abs/2501.00457
作者: Zhenhan Huang,Tejaswini Pedapati,Pin-Yu Chen,Jianxi Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60% on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance. The DPL method can be deployed to large language models or computer vision models at no cost.
zh
[NLP-94] Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning ECCV2024
【速读】: 该论文试图解决零样本图像描述(zero-shot image captioning)任务中,由于仅使用文本数据进行训练而导致的合成图像与文本之间语义不对齐的问题。具体来说,现有的文本到图像扩散模型(text-to-image diffusion model)生成的合成图像在显著区域存在细节缺陷,导致图像与文本之间的语义不一致,从而影响描述生成的效果。
解决方案的关键在于提出了一种新颖的Patch-wise Cross-modal feature Mix-up (PCM)机制,即通过细粒度的跨模态特征混合来自适应地减少合成图像中的不忠实内容。具体而言,PCM-Net首先在CLIP空间中检测输入图像的显著视觉概念,然后选择性地将图像的局部视觉特征与这些显著视觉概念的文本特征进行融合,生成一个缺陷较少的多模态特征图。接着,利用视觉-语义编码器对特征图进行细化,并将其输入到句子解码器中生成图像描述。此外,为了优化模型在合成数据上的训练,论文还设计了一种基于CLIP加权的交叉熵损失函数,优先考虑高质量的图像-文本对。实验结果表明,PCM-Net在MSCOCO和Flickr30k数据集上均优于现有的基于视觉语言模型(VLMs)的方法,并在零样本图像描述任务中取得了领先的成绩。
链接: https://arxiv.org/abs/2501.00437
作者: Jianjie Luo,Jingwen Chen,Yehao Li,Yingwei Pan,Jianlin Feng,Hongyang Chao,Ting Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: ECCV 2024
Abstract:Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at this https URL.
zh
[NLP-95] Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents
【速读】: 该论文旨在解决大型语言模型(LLMs)在处理复杂科学推理任务时面临的准确度不足和思维退化(degeneration of thought)问题。为了解决这些问题,作者提出了反应与反思多路径推理(Reactive and Reflection agents with Multi-Path Reasoning, RR-MP)框架。该框架的关键在于通过多路径推理机制,每条路径由一个反应代理(reactive agent)和一个反思代理(reflection agent)协作,以防止单代理依赖导致的思维退化。此外,RR-MP框架无需额外训练,而是利用多个对话实例进行推理,并通过一个独立的总结器(summarizer)整合所有路径的见解,从而增强推理能力。实验结果表明,RR-MP框架在道德场景、大学物理和数学任务中的零样本和少样本评估中优于基线方法,验证了其在复杂科学推理任务中的有效性和优势。
链接: https://arxiv.org/abs/2501.00430
作者: Chengbo He,Bochao Zou,Xin Li,Jiansheng Chen,Junliang Xing,Huimin Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Agents have demonstrated their potential in scientific reasoning tasks through large language models. However, they often face challenges such as insufficient accuracy and degeneration of thought when handling complex reasoning tasks, which impede their performance. To overcome these issues, we propose the Reactive and Reflection agents with Multi-Path Reasoning (RR-MP) Framework, aimed at enhancing the reasoning capabilities of LLMs. Our approach improves scientific reasoning accuracy by employing a multi-path reasoning mechanism where each path consists of a reactive agent and a reflection agent that collaborate to prevent degeneration of thought inherent in single-agent reliance. Additionally, the RR-MP framework does not require additional training; it utilizes multiple dialogue instances for each reasoning path and a separate summarizer to consolidate insights from all paths. This design integrates diverse perspectives and strengthens reasoning across each path. We conducted zero-shot and few-shot evaluations on tasks involving moral scenarios, college-level physics, and mathematics. Experimental results demonstrate that our method outperforms baseline approaches, highlighting the effectiveness and advantages of the RR-MP framework in managing complex scientific reasoning tasks.
zh
[NLP-96] Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages
【速读】: 该论文旨在解决低资源语言(low-resource languages)在语音转文本(Speech-to-Text)和自动语音识别(Automatic Speech Recognition, ASR)任务中的挑战,特别是针对阿拉伯语、俄语和葡萄牙语等多方言、多口音的语言。由于这些语言的标注数据稀缺且方言多样性高,传统的ASR模型在这些语言上的表现显著下降。论文提出了一种端到端(end-to-end)的框架,通过数据增强(data augmentation)技术来优化基于Wav2Vec2的ASR系统。该框架在Mozilla的Common Voice项目中的阿拉伯语、俄语和葡萄牙语数据集上进行了实验验证,结果表明其在词错误率(Word Error Rate, WER)和字符错误率(Character Error Rate, CER)上分别实现了33.9%和53.2%的相对提升,显著优于预训练的Wav2Vec2和Whisper ASR模型。解决方案的关键在于通过数据增强技术提升模型在低资源语言上的鲁棒性,尤其是在处理不同方言和发音变体时的表现。
链接: https://arxiv.org/abs/2501.00425
作者: Or Haim Anidjar,Revital Marbel,Roi Yozevitch
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 15 pagesm 3 figures
Abstract:Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models’ success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2’s improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework’s effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla’s Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9% in Word Error Rate and a 53.2% relative improvement in Character Error Rate.
zh
[NLP-97] SPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
【速读】: 该论文旨在提升音频-语言模型(Audio-Language Models, ALMs)在零样本音频分类任务中的性能。零样本分类任务要求模型在测试时对未见过的音频片段进行分类,通常通过利用描述性自然语言提示来实现。论文提出的解决方案是TSPE(Task-Specific Prompt Ensemble),这是一种无需训练的硬提示方法,通过为不同的音频分类任务定制上下文丰富的提示来提升ALMs的性能。与使用通用模板提示(如“汽车的声音”)不同,TSPE利用标签信息识别合适的声音属性(如“响亮”和“微弱”)和声音来源(如“隧道”和“街道”),并将这些信息整合到提示中。此外,为了增强音频与文本的对齐,TSPE在生成的任务特定提示上进行提示集成。实验结果表明,TSPE在12个不同的音频分类数据集上显著提升了ALMs的性能,相较于传统的零样本评估方法,绝对性能提升范围为1.23%至16.36%。
链接: https://arxiv.org/abs/2501.00398
作者: Nishit Anand,Ashish Seth,Ramani Duraiswami,Dinesh Manocha
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages
Abstract:Audio-language models (ALMs) excel in zero-shot audio classification, a task where models classify previously unseen audio clips at test time by leveraging descriptive natural language prompts. We introduce TSPE (Task-Specific Prompt Ensemble), a simple, training-free hard prompting method that boosts ALEs’ zero-shot performance by customizing prompts for diverse audio classification tasks. Rather than using generic template-based prompts like “Sound of a car” we generate context-rich prompts, such as “Sound of a car coming from a tunnel”. Specifically, we leverage label information to identify suitable sound attributes, such as “loud” and “feeble”, and appropriate sound sources, such as “tunnel” and “street” and incorporate this information into the prompts used by Audio-Language Models (ALMs) for audio classification. Further, to enhance audio-text alignment, we perform prompt ensemble across TSPE-generated task-specific prompts. When evaluated on 12 diverse audio classification datasets, TSPE improves performance across ALMs by showing an absolute improvement of 1.23-16.36% over vanilla zero-shot evaluation.
zh
[NLP-98] Efficient Relational Context Perception for Knowledge Graph Completion
【速读】: 该论文试图解决知识图谱(Knowledge Graphs, KGs)中的不完备性问题,特别是通过知识图谱补全(Knowledge Graph Completion, KGC)来推断缺失的事实。现有知识图谱嵌入模型在捕捉表达性特征方面存在局限,尤其是与更深层次的多层模型相比。此外,现有方法通常为每个实体和关系分配单一的静态嵌入,忽略了实体和关系在不同图上下文中的动态行为。为了解决这些问题,论文提出了三重接受感知(Triple Receptance Perception, TRP)架构,通过建模序列信息来学习实体和关系的动态上下文。随后,利用张量分解(tensor decomposition)计算三元组得分,提供强大的关系解码能力。这种集成方法能够生成更具表达性的表示。通过在YAGO3-10、UMLS、FB15k和FB13等基准数据集上的链接预测和三元组分类任务实验,证明了该方法的有效性,并优于多种最先进的模型。
链接: https://arxiv.org/abs/2501.00397
作者: Wenkai Tu,Guojia Wan,Zhengchun Shang,Bo Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Knowledge Graphs (KGs) provide a structured representation of knowledge but often suffer from challenges of incompleteness. To address this, link prediction or knowledge graph completion (KGC) aims to infer missing new facts based on existing facts in KGs. Previous knowledge graph embedding models are limited in their ability to capture expressive features, especially when compared to deeper, multi-layer models. These approaches also assign a single static embedding to each entity and relation, disregarding the fact that entities and relations can exhibit different behaviors in varying graph contexts. Due to complex context over a fact triple of a KG, existing methods have to leverage complex non-linear context encoder, like transformer, to project entity and relation into low dimensional representations, resulting in high computation cost. To overcome these limitations, we propose Triple Receptance Perception (TRP) architecture to model sequential information, enabling the learning of dynamic context of entities and relations. Then we use tensor decomposition to calculate triple scores, providing robust relational decoding capabilities. This integration allows for more expressive representations. Experiments on benchmark datasets such as YAGO3-10, UMLS, FB15k, and FB13 in link prediction and triple classification tasks demonstrate that our method performs better than several state-of-the-art models, proving the effectiveness of the integration.
zh
[NLP-99] rajectories of Change: Approaches for Tracking Knowledge Evolution
【速读】: 该论文旨在探讨知识系统的局部与全局演化问题,通过社会-认知网络(Socio-Epistemic Networks, SEN)框架,结合社会、符号(物质)和语义三个相互关联的层次,提出了一种多层次的方法来理解知识的结构发展。论文的核心解决方案包括两个方面:首先,使用基于相对熵的信息理论度量来检测语义变化,评估其显著性,并识别关键驱动特征;其次,通过分析文档嵌入密度的变化,揭示语义邻域的变化,追踪相似文档的集中、稳定或分散情况。通过这些方法,论文能够基于内容(主题)或元数据(作者、机构)追踪文档的演化轨迹,并以Joseph Silk和Hans-Jürgen Treder的案例研究为例,展示了该方法在广义相对论和引力研究中的应用、局限性和进一步潜力。
链接: https://arxiv.org/abs/2501.00391
作者: Raphael Schlattmann,Malte Vogl
机构: 未知
类目: Computation and Language (cs.CL); History and Philosophy of Physics (physics.hist-ph)
备注:
Abstract:We explore local vs. global evolution of knowledge systems through the framework of socio-epistemic networks (SEN), applying two complementary methods to a corpus of scientific texts. The framework comprises three interconnected layers-social, semiotic (material), and semantic-proposing a multilayered approach to understanding structural developments of knowledge. To analyse diachronic changes on the semantic layer, we first use information-theoretic measures based on relative entropy to detect semantic shifts, assess their significance, and identify key driving features. Second, variations in document embedding densities reveal changes in semantic neighbourhoods, tracking how concentration of similar documents increase, remain stable, or disperse. This enables us to trace document trajectories based on content (topics) or metadata (authorship, institution). Case studies of Joseph Silk and Hans-Jürgen Treder illustrate how individual scholar’s work aligns with broader disciplinary shifts in general relativity and gravitation research, demonstrating the applications, limitations, and further potential of this approach.
zh
[NLP-100] RAG-Instruct: Boosting LLM s with Diverse Retrieval-Augmented Instructions
【速读】: 该论文试图解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法面临的两个主要问题:一是现有方法仅覆盖有限的RAG场景,二是由于缺乏通用的RAG数据集,导致任务多样性不足。为解决这些问题,作者提出了RAG-Instruct,一种基于任意源语料库生成多样且高质量RAG指令数据的通用方法。该方案的关键在于:(1)采用五种RAG范式,涵盖多种查询-文档关系;(2)通过指令模拟(instruction simulation)利用现有指令数据集的优势,增强指令的多样性和质量。通过这种方法,作者从维基百科构建了一个包含40K指令的数据集,全面覆盖了多种RAG场景和任务。实验表明,RAG-Instruct显著提升了大型语言模型(LLMs)的RAG能力,在零样本(zero-shot)性能上表现优异,并在多种任务上显著优于现有的RAG基线方法。
链接: https://arxiv.org/abs/2501.00353
作者: Wanlong Liu,Junying Chen,Ke Ji,Li Zhou,Wenyu Chen,Benyou Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs’ RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at this https URL.
zh
[NLP-101] Chunk-Distilled Language Modeling
【速读】: 该论文旨在解决当前大语言模型(LLMs)中的两个主要问题:一是基于单令牌(token-level)生成的效率低下,二是模型在面对新数据和知识时的适应困难。为解决这些问题,论文提出了一种名为“块蒸馏语言建模”(Chunk-Distilled Language Modeling, CD-LM)的方法。该方法的核心理念是将基于深度网络的大语言模型与一个简单的检索模块相结合,从而在单个解码步骤中生成多令牌文本块(multi-token text chunks)。通过这种检索框架,模型可以灵活地构建特定于模型或领域的数据存储(datastores),既可以利用现有模型的内部知识,也可以整合来自人工标注语料库的专家见解。这种适应性使得在不进行额外训练的情况下,能够增强对语言模型分布的控制。实验结果表明,CD-LM在多种下游任务中显著提升了语言模型的性能和效率。
链接: https://arxiv.org/abs/2501.00343
作者: Yanhong Li,Karen Livescu,Jiawei Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model’s distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model performance and efficiency across a diverse set of downstream tasks. Code and data will be made publicly available.
zh
[NLP-102] Rethinking Layer Removal: Preserving Critical Components with Task-Aware Singular Value Decomposition
【速读】: 该论文试图解决在大语言模型(LLMs)压缩过程中,直接移除层(layer removal)所导致的内部一致性(internal consistency)破坏和性能下降的问题。直接移除层虽然能够减少模型大小并加速推理,但由于不同模型架构的冗余程度不同,这种方法往往会导致性能不稳定和任务表现下降。论文提出的解决方案是Taco-SVD,一种任务感知的框架,通过保留任务关键的单值方向(task-critical singular value directions),在压缩模型的同时保持内部一致性。Taco-SVD的关键在于利用基于梯度的归因方法(gradient-based attribution methods),将单值与下游任务目标对齐,从而在减少计算开销的同时,最大限度地保留任务关键变换,避免性能下降。实验表明,Taco-SVD在不同架构下均优于现有方法,且在困惑度(perplexity)和任务表现上均有显著提升。
链接: https://arxiv.org/abs/2501.00339
作者: Kainan Liu,Yong Zhang,Ning Cheng,Zhitao Li,Shaojun Wang,Jing Xiao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Layer removal has emerged as a promising approach for compressing large language models (LLMs) by leveraging redundancy within layers to reduce model size and accelerate inference. However, this technique often compromises internal consistency, leading to performance degradation and instability, with varying impacts across different model architectures. In this work, we propose Taco-SVD, a task-aware framework that retains task-critical singular value directions, preserving internal consistency while enabling efficient compression. Unlike direct layer removal, Taco-SVD preserves task-critical transformations to mitigate performance degradation. By leveraging gradient-based attribution methods, Taco-SVD aligns singular values with downstream task objectives. Extensive evaluations demonstrate that Taco-SVD outperforms existing methods in perplexity and task performance across different architectures while ensuring minimal computational overhead.
zh
[NLP-103] Loss-Aware Curriculum Learning for Chinese Grammatical Error Correction ICASSP2025
【速读】: 该论文试图解决中文语法错误纠正(Chinese Grammatical Error Correction, CGEC)任务中,现有方法忽视不同样本的纠正难度差异,导致模型学习难度增加的问题。解决方案的关键在于提出了一种多粒度课程学习(multi-granularity Curriculum Learning, CL)框架。具体而言,该框架首先计算样本的纠正难度,并按从易到难的顺序分批输入模型;其次,通过实例级课程学习(Instance-Level CL)自动调节损失函数,帮助模型在适当的方向上进行优化。实验结果表明,该方法在多个数据集上均表现出显著的有效性。
链接: https://arxiv.org/abs/2501.00334
作者: Ding Zhang,Yangning Li,Lichen Bai,Hao Zhang,Yinghui Li,Haiye Lin,Hai-Tao Zheng,Xin Su,Zifei Shan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICASSP 2025
Abstract:Chinese grammatical error correction (CGEC) aims to detect and correct errors in the input Chinese sentences. Recently, Pre-trained Language Models (PLMS) have been employed to improve the performance. However, current approaches ignore that correction difficulty varies across different instances and treat these samples equally, enhancing the challenge of model learning. To address this problem, we propose a multi-granularity Curriculum Learning (CL) framework. Specifically, we first calculate the correction difficulty of these samples and feed them into the model from easy to hard batch by batch. Then Instance-Level CL is employed to help the model optimize in the appropriate direction automatically by regulating the loss function. Extensive experimental results and comprehensive analyses of various datasets prove the effectiveness of our method.
zh
[NLP-104] MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
【速读】: 该论文试图解决大型语言模型(LLMs)在生成信息时可能产生过时或不准确内容的问题,特别是在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,检索到的文档质量不佳(如不相关或噪声文档)会降低系统性能、增加计算开销并削弱响应可靠性。为解决这一问题,论文提出了多智能体过滤检索增强生成(Multi-Agent Filtering Retrieval-Augmented Generation, MAIN-RAG)框架。该框架的关键在于利用多个LLM智能体协同过滤和评分检索到的文档,并通过自适应过滤机制动态调整相关性过滤阈值,从而在最小化噪声的同时保持高召回率。此外,MAIN-RAG通过智能体间的共识确保文档选择的鲁棒性,无需额外训练数据或微调。实验结果表明,MAIN-RAG在多个问答基准测试中显著优于传统RAG方法,答案准确率提高了2-11%,同时减少了不相关文档的检索数量。
链接: https://arxiv.org/abs/2501.00332
作者: Chia-Yuan Chang,Zhimeng Jiang,Vineeth Rakesh,Menghai Pan,Chin-Chia Michael Yeh,Guanchu Wang,Mingzhi Hu,Zhichao Xu,Yan Zheng,Mahashweta Das,Na Zou
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.
zh
[NLP-105] Exploring the Implicit Semantic Ability of Multimodal Large Language Models : A Pilot Study on Entity Set Expansion ICASSP2025
【速读】: 该论文旨在解决多模态大语言模型(MLLMs)在提取隐含语义信息方面的局限性,特别是在多模态实体集扩展(Multi-modal Entity Set Expansion, MESE)任务中的应用。MESE任务的目标是通过提供多模态信息,从少量种子实体扩展出属于同一语义类别的新实体。论文通过引入一种列表排序方法LUSAR(Listwise Ranking with Local to Global Mapping),将局部得分映射到全局排序,从而提升MLLMs在MESE任务中的表现。LUSAR方法的关键在于通过全局视角优化实体排序,显著提高了MLLMs在隐含语义理解方面的能力,并首次将生成式MLLM应用于实体集扩展任务,扩展了列表排序方法的适用性。
链接: https://arxiv.org/abs/2501.00330
作者: Hebin Wang,Yangning Li,Yinghui Li,Hai-Tao Zheng,Wenhao Jiang,Hong-Gee Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ICASSP 2025
Abstract:The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, and multi-modal information is provided with each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM’s performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.
zh
[NLP-106] VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition ICASSP2025
【速读】: 该论文旨在解决说话人识别(speaker recognition)中由于注册和测试话语之间的差异,特别是在多类型(multi-genre)现象下话语属于不同语音类型时所带来的脆弱性问题。现有的越南语说话人识别资源要么规模有限,要么未关注语音类型的多样性,导致多类型效应的研究尚未深入。为此,论文提出了VoxVietnam,这是首个针对越南语说话人识别的多类型数据集,包含来自1,406位说话者的超过187,000条话语,并通过自动化流程从公开资源中大规模构建数据集。实验表明,使用单一类型数据集训练的模型在多类型现象下表现不佳,而将VoxVietnam纳入训练过程后,性能显著提升。解决方案的关键在于引入多类型数据集,以增强模型在多类型环境下的鲁棒性和识别能力。
链接: https://arxiv.org/abs/2501.00328
作者: Hoang Long Vu,Phuong Tuan Dat,Pham Thao Nhi,Nguyen Song Hao,Nguyen Thi Thu Trang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Abstract:Recent research in speaker recognition aims to address vulnerabilities due to variations between enrolment and test utterances, particularly in the multi-genre phenomenon where the utterances are in different speech genres. Previous resources for Vietnamese speaker recognition are either limited in size or do not focus on genre diversity, leaving studies in multi-genre effects unexplored. This paper introduces VoxVietnam, the first multi-genre dataset for Vietnamese speaker recognition with over 187,000 utterances from 1,406 speakers and an automated pipeline to construct a dataset on a large scale from public sources. Our experiments show the challenges posed by the multi-genre phenomenon to models trained on a single-genre dataset, and demonstrate a significant increase in performance upon incorporating the VoxVietnam into the training process. Our experiments are conducted to study the challenges of the multi-genre phenomenon in speaker recognition and the performance gain when the proposed dataset is used for multi-genre training.
zh
[NLP-107] MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
【速读】: 该论文试图解决当前基础模型(foundation models)在地图或基于位置推理能力方面的不足,这种能力对于优化导航、资源发现和物流管理等方面具有重要意义。尽管基础模型在自主工具使用和推理能力方面取得了显著进展,但其在地理空间推理(geo-spatial reasoning)方面的表现尚未得到系统研究。为了填补这一空白,作者提出了MapEval基准测试,旨在评估模型在处理复杂地图相关用户查询时的表现。MapEval包含三种任务类型(文本、API和视觉任务),要求模型通过地图工具收集世界信息,处理异构地理空间上下文(如命名实体、旅行距离、用户评论或评分、图像等),并进行组合推理。通过对28个主流基础模型的全面评估,发现尽管Claude-3.5-Sonnet、GPT-4o和Gemini-1.5-Pro在整体表现上较为突出,但所有模型在复杂地图图像和严格地理空间推理任务上仍显著落后于人类表现,平均差距超过20%。这一差距凸显了MapEval在推动具有更强地理空间理解能力的通用基础模型发展中的关键作用。
链接: https://arxiv.org/abs/2501.00316
作者: Mahir Labib Dihan,Md Tanvir Hassan,Md Tanvir Parvez,Md Hasebul Hasan,Md Almash Alam,Muhammad Aamir Cheema,Mohammed Eunus Ali,Md Rizwan Parvez
机构: 未知
类目: Computation and Language (cs.CL)
备注: 40 pages, 21 figures
Abstract:Recent advancements in foundation models have enhanced AI systems’ capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models’ ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval’s critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.
zh
[NLP-108] Retrieval-Augmented Generation with Graphs (GraphRAG)
【速读】: 该论文旨在解决如何将图结构数据(graph-structured data)有效地应用于检索增强生成(Retrieval-Augmented Generation, RAG)技术中的问题。由于图数据具有异构性和关系性特征,传统的RAG方法在神经嵌入空间(neural-embedding space)中统一设计检索器、生成器和外部数据源的方式无法直接适用于图数据。因此,论文提出了GraphRAG框架,通过定义其关键组件(如查询处理器、检索器、组织器、生成器和数据源)来解决这一挑战。解决方案的关键在于针对不同领域的图数据设计专用的GraphRAG技术,以应对其独特的关系模式和领域特定知识。此外,论文还探讨了当前的研究挑战和未来跨学科的研究方向,以推动GraphRAG的进一步发展。
链接: https://arxiv.org/abs/2501.00309
作者: Haoyu Han,Yu Wang,Harry Shomer,Kai Guo,Jiayuan Ding,Yongjia Lei,Mahantesh Halappanavar,Ryan A. Rossi,Subhabrata Mukherjee,Xianfeng Tang,Qi He,Zhigang Hua,Bo Long,Tong Zhao,Neil Shah,Amin Javari,Yinglong Xia,Jiliang Tang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic “nodes connected by edges” nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at this https URL.
zh
[NLP-109] LLM -Rubric: A Multidimensional Calibrated Approach to Automated Evaluation of Natural Language Texts DATE
【速读】: 该论文旨在解决自然语言文本的自动化评估问题,特别是在多维度评估中如何准确预测人类评委的评分。解决方案的关键在于使用一个手动构建的评估标准(rubric),并通过大语言模型(LLM)对每个标准问题进行预测,生成潜在响应的分布。尽管LLM的预测与人类评委的评分存在不一致,但通过结合多个LLM的分布,可以训练一个小型前馈神经网络(feed-forward neural network),该网络包含评委特定和评委无关的参数,从而预测每个评委在所有问题上的评分,包括评估整体质量或相关性的总结性问题。在评估人机信息检索任务中的对话系统时,LLM-Rubric通过9个问题(如自然性、简洁性和引用质量等维度)预测人类评委对整体用户满意度的评估,其均方根误差(RMS error)为0.5,相比未校准的基线提高了2倍。
链接: https://arxiv.org/abs/2501.00274
作者: Helia Hashemi,Jason Eisner,Corby Rosset,Benjamin Van Durme,Chris Kedzie
机构: 未知
类目: Computation and Language (cs.CL)
备注: Updated version of 17 June 2024
Abstract:This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges – indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be \textitcombined to \textitpredict each human judge’s annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges’ assessment of overall user satisfaction, on a scale of 1–4, with RMS error 0.5 , a 2\times improvement over the uncalibrated baseline.
zh
[NLP-110] Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs
【速读】: 该论文探讨了当前大型语言模型(LLMs)在创意内容生成中的多样性问题,特别是这些模型是否能够提供足够多样化的创意来增强集体创造力。研究通过分析GPT-4和LLaMA-3在故事生成中的表现,发现LLM生成的故事往往包含重复出现的剧情元素。为了量化这一现象,研究者引入了Sui Generis评分(Sui Generis score),该评分用于估计某一剧情元素在相同LLM生成的其他故事线中出现的可能性。通过对100个短故事的评估,研究发现LLM生成的故事中经常出现跨代重复的独特剧情元素组合,而人类原创故事则很少被重现或部分重复。此外,人类评估显示,Sui Generis评分与人类对故事段落的惊喜程度判断存在中等程度的相关性,尽管评分计算完全自动化,不依赖于人类判断。解决方案的关键在于引入Sui Generis评分来量化和评估LLM生成内容的多样性,从而揭示其在创意生成中的局限性。
链接: https://arxiv.org/abs/2501.00273
作者: Weijia Xu,Nebojsa Jojic,Sudha Rao,Chris Brockett,Bill Dolan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster the collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines generated by the same LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations, while the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of Sui Generis scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.
zh
[NLP-111] A review of faithfulness metrics for hallucination assessment in Large Language Models
【速读】: 该论文探讨了在开放式摘要生成、问答和机器翻译任务中如何评估生成内容的忠实性(faithfulness)问题。研究发现,使用大语言模型(LLMs)作为忠实性评估工具时,其评估结果与人类判断的相关性最高。论文还讨论了其他研究中减少生成内容中的幻觉(hallucinations)的方法,其中检索增强生成(Retrieval Augmented Generation, RAG)和提示框架(prompting framework)方法被证明能够显著提高生成内容的忠实性。此外,论文强调了忠实性研究对于大语言模型广泛应用的重要性,因为不忠实的生成内容可能带来重大风险。通过评估开放式生成任务,可以更全面地衡量大语言模型的性能,从而增强对其的信任。
链接: https://arxiv.org/abs/2501.00269
作者: Ben Malin,Tatiana Kalganova,Nikoloas Boulgouris
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 tables
Abstract:This review examines the means with which faithfulness has been evaluated across open-ended summarization, question-answering and machine translation tasks. We find that the use of LLMs as a faithfulness evaluator is commonly the metric that is most highly correlated with human judgement. The means with which other studies have mitigated hallucinations is discussed, with both retrieval augmented generation (RAG) and prompting framework approaches having been linked with superior faithfulness, whilst other recommendations for mitigation are provided. Research into faithfulness is integral to the continued widespread use of LLMs, as unfaithful responses can pose major risks to many areas whereby LLMs would otherwise be suitable. Furthermore, evaluating open-ended generation provides a more comprehensive measure of LLM performance than commonly used multiple-choice benchmarking, which can help in advancing the trust that can be placed within LLMs.
zh
[NLP-112] EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta
【速读】: 该论文试图解决大语言模型(LLMs)在开放性问题推理任务中评估方法的不足,特别是现有方法在评估事实准确性和复杂推理能力时存在的流畅性偏差(fluency bias)和过度依赖多项选择题形式的问题。为了解决这些评估差距,论文提出了EQUATOR评估框架(Evaluation of Question Answering Thoroughness in Open-ended Reasoning),该框架结合了确定性评分和对事实准确性及推理能力的重点评估。EQUATOR通过使用向量数据库将开放性问题与人工评估的答案配对,从而实现更精确和可扩展的评估。此外,论文还引入了基于本地托管的小型LLMs(如LLaMA 3.2B)的自动化评估流程,进一步减少了对人工评估的依赖并提高了评估的可扩展性。这一解决方案的关键在于通过结合自动化评分和人工评估的优势,显著提升了LLMs在开放性问题推理任务中的评估效果。
链接: https://arxiv.org/abs/2501.00257
作者: Raymond Bernard,Shaina Raza(PhD),Subhabrata Das(PhD),Rahul Murugan
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning effectively. LLMs thus frequently generate factually inaccurate responses, especially in complex reasoning tasks, highlighting two prominent challenges: (1) the inadequacy of existing methods to evaluate reasoning and factual accuracy effectively, and (2) the reliance on human evaluators for nuanced judgment, as illustrated by Williams and Huckle (2024)[1], who found manual grading indispensable despite automated grading advancements. To address evaluation gaps in open-ended reasoning tasks, we introduce the EQUATOR Evaluator (Evaluation of Question Answering Thoroughness in Open-ended Reasoning). This framework combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment. Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations. In practice, EQUATOR significantly reduces reliance on human evaluators for scoring and improves scalability compared to Williams and Huckle’s (2004)[1] methods. Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards. Additionally, we introduce an automated evaluation process leveraging smaller, locally hosted LLMs. We used LLaMA 3.2B, running on the Ollama binaries to streamline our assessments. This work establishes a new paradigm for evaluating LLM performance, emphasizing factual accuracy and reasoning ability, and provides a robust methodological foundation for future research. Subjects: Computation and Language (cs.CL) MSC classes: 68T20 ACMclasses: I.2.7; I.2.6; H.3.3 Cite as: arXiv:2501.00257 [cs.CL] (or arXiv:2501.00257v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-113] Automatically Planning Optimal Parallel Strategy for Large Language Models
【速读】: 该论文试图解决大规模语言模型(基于Transformer架构)在训练过程中,随着模型参数和计算集群规模的增加,如何高效利用计算资源进行并行计算的问题。解决方案的关键在于提出了一种自动并行算法,该算法能够根据模型和硬件信息自动规划出具有最大吞吐量的并行策略。通过将训练时间分解为计算、通信和重叠部分,建立了一个训练时长模拟模型,并基于此模型对并行解决方案空间进行剪枝,从而缩短搜索时间。实验结果表明,该算法能够实时估计并行训练时长,平均准确率达到96%,且其推荐的策略始终为全局最优。
链接: https://arxiv.org/abs/2501.00254
作者: Zongbiao Li(1),Xiezhao Li(1),Yinghao Cui(1),Yijun Chen(1),Zhixuan Gu(1),Yuxuan Liu(1),Wenbo Zhu(1),Fei Jia(1),Ke Liu(1),Qifeng Li(1),Junyao Zhan(1),Jiangtao Zhou(1),Chenxi Zhang(1),Qike Liu(1) ((1) HUAWEI)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The number of parameters in large-scale language models based on transformers is gradually increasing, and the scale of computing clusters is also growing. The technology of quickly mobilizing large amounts of computing resources for parallel computing is becoming increasingly important. In this paper, we propose an automatic parallel algorithm that automatically plans the parallel strategy with maximum throughput based on model and hardware information. By decoupling the training time into computation, communication, and overlap, we established a training duration simulation model. Based on this simulation model, we prune the parallel solution space to shorten the search time required. The multi-node experiment results show that the algorithm can estimate the parallel training duration in real time with an average accuracy of 96%. In our test, the recommendation strategy provided by the algorithm is always globally optimal.
zh
[NLP-114] Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
【速读】: 该论文旨在解决大语言模型(LLMs)在生成文本时缺乏事实准确性的问题,并探讨结构知识提示(Structural Knowledge Prompting, SKP)范式的泛化能力。尽管SKP通过引入外部知识的结构化表示在许多知识密集型任务中取得了最先进的结果,但现有方法往往局限于特定问题,缺乏对SKP泛化能力和能力边界的全面探索。论文从粒度(Granularity)、可迁移性(Transferability)、可扩展性(Scalability)和普适性(Universality)四个角度评估和反思SKP的泛化能力。为了进行全面评估,作者引入了一个名为SUBARU的多粒度、多层次基准测试,包含9个不同粒度和难度的任务。解决方案的关键在于通过多角度评估SKP的泛化能力,并利用SUBARU基准测试来验证其在不同任务中的表现。
链接: https://arxiv.org/abs/2501.00244
作者: Yichi Zhang,Zhuo Chen,Lingbing Guo,Yajing Xu,Shaokai Chen,Mengshu Sun,Binbin Hu,Zhiqiang Zhang,Lei Liang,Wen Zhang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Large language models (LLMs) have demonstrated exceptional performance in text generation within current NLP research. However, the lack of factual accuracy is still a dark cloud hanging over the LLM skyscraper. Structural knowledge prompting (SKP) is a prominent paradigm to integrate external knowledge into LLMs by incorporating structural representations, achieving state-of-the-art results in many knowledge-intensive tasks. However, existing methods often focus on specific problems, lacking a comprehensive exploration of the generalization and capability boundaries of SKP. This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality. To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.
zh
[NLP-115] Exploring Variability in Fine-Tuned Models for Text Classification with DistilBERT
【速读】: 该论文旨在评估使用DistilBERT模型(具体为distilbert-base-uncased-finetuned-sst-2-english变体)进行文本分类时的微调策略。研究通过结构化实验探讨了学习率(learning rate)、批量大小(batch size)和训练轮数(epochs)等超参数对模型准确性(accuracy)、F1分数(F1-score)和损失(loss)的影响。关键解决方案包括使用多项式回归分析来捕捉这些超参数的基础和增量影响,并重点关注相对于基线模型的微调调整。研究结果表明,超参数配置对性能指标的影响存在显著差异,揭示了不同指标之间的权衡。例如,较高的学习率在相对分析中减少了损失(p=0.027),但对准确性的提升提出了挑战;批量大小对准确性和F1分数有显著影响(p=0.028和p=0.005),但对损失优化的影响有限(p=0.170)。此外,训练轮数与批量大小之间的交互作用最大化了F1分数(p=0.001),强调了超参数之间相互作用的重要性。这些发现表明,微调策略需要解决非线性超参数交互问题,以在多个性能指标之间取得平衡。
链接: https://arxiv.org/abs/2501.00241
作者: Giuliano Lorenzoni,Ivens Portugal,Paulo Alencar,Donald Cowan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates fine-tuning strategies for text classification using the DistilBERT model, specifically the distilbert-base-uncased-finetuned-sst-2-english variant. Through structured experiments, we examine the influence of hyperparameters such as learning rate, batch size, and epochs on accuracy, F1-score, and loss. Polynomial regression analyses capture foundational and incremental impacts of these hyperparameters, focusing on fine-tuning adjustments relative to a baseline model. Results reveal variability in metrics due to hyperparameter configurations, showing trade-offs among performance metrics. For example, a higher learning rate reduces loss in relative analysis (p=0.027) but challenges accuracy improvements. Meanwhile, batch size significantly impacts accuracy and F1-score in absolute regression (p=0.028 and p=0.005) but has limited influence on loss optimization (p=0.170). The interaction between epochs and batch size maximizes F1-score (p=0.001), underscoring the importance of hyperparameter interplay. These findings highlight the need for fine-tuning strategies addressing non-linear hyperparameter interactions to balance performance across metrics. Such variability and metric trade-offs are relevant for tasks beyond text classification, including NLP and computer vision. This analysis informs fine-tuning strategies for large language models and promotes adaptive designs for broader model applicability. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00241 [cs.CL] (or arXiv:2501.00241v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-116] Zero-Shot Strategies for Length-Controllable Summarization
【速读】: 该论文旨在解决大语言模型(LLMs)在零样本(zero-shot)设置下难以精确控制生成文本长度的问题。通过对LLaMA 3模型的实验,研究发现模型在不同长度控制指标上表现出显著差异,并且存在固有的偏差。为解决这一问题,论文提出了一系列方法:长度近似(length approximation)、目标调整(target adjustment)、样本过滤(sample filtering)和自动修订(automated revisions)。这些方法的结合显著提高了生成文本的长度符合度,同时保持或提升了摘要质量,提供了无需模型微调或架构修改的高效零样本策略。通过这些工作,论文不仅深化了对LLMs在受控文本生成中行为的理解,还为实际应用中更可靠和适应性强的摘要系统铺平了道路。
链接: https://arxiv.org/abs/2501.00233
作者: Fabian Retkowski,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs’ length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model. To address these challenges, we introduce a set of methods: length approximation, target adjustment, sample filtering, and automated revisions. By combining these methods, we demonstrate substantial improvements in length compliance while maintaining or enhancing summary quality, providing highly effective zero-shot strategies for precise length control without the need for model fine-tuning or architectural changes. With our work, we not only advance our understanding of LLM behavior in controlled text generation but also pave the way for more reliable and adaptable summarization systems in real-world applications.
zh
[NLP-117] Generative Emergent Communication: Large Language Model is a Collective World Model
【速读】: 该论文试图解决如何通过一个统一的理论框架来理解语言和符号系统的涌现问题,特别是在多智能体强化学习(MARL)环境中。论文提出的解决方案关键是一个称为生成式涌现通信(generative EmCom)的框架,该框架通过集体预测编码(CPC)的视角,将涌现通信、世界模型和大语言模型(LLMs)联系起来。具体来说,该框架通过去中心化的贝叶斯推理在多智能体之间形式化语言的涌现,超越了传统的基于判别模型的涌现通信方法。论文的两个关键贡献是:首先,提出了生成式EmCom作为理解涌现通信的新框架,展示了如何在多智能体强化学习中通过控制即推理(control as inference)推导出通信的涌现,并澄清了其与传统判别方法的关系;其次,提出了一个数学公式,将LLMs解释为通过CPC整合多个智能体经验的集体世界模型。这一框架为理解共享符号系统如何通过集体预测编码过程涌现提供了统一的理论基础,连接了个体认知发展和社会语言演化。
链接: https://arxiv.org/abs/2501.00226
作者: Tadahiro Taniguchi,Ryo Ueda,Tomoaki Nakamura,Masahiro Suzuki,Akira Taniguchi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This study proposes a unifying theoretical framework called generative emergent communication (generative EmCom) that bridges emergent communication, world models, and large language models (LLMs) through the lens of collective predictive coding (CPC). The proposed framework formalizes the emergence of language and symbol systems through decentralized Bayesian inference across multiple agents, extending beyond conventional discriminative model-based approaches to emergent communication. This study makes the following two key contributions: First, we propose generative EmCom as a novel framework for understanding emergent communication, demonstrating how communication emergence in multi-agent reinforcement learning (MARL) can be derived from control as inference while clarifying its relationship to conventional discriminative approaches. Second, we propose a mathematical formulation showing the interpretation of LLMs as collective world models that integrate multiple agents’ experiences through CPC. The framework provides a unified theoretical foundation for understanding how shared symbol systems emerge through collective predictive coding processes, bridging individual cognitive development and societal language evolution. Through mathematical formulations and discussion on prior works, we demonstrate how this framework explains fundamental aspects of language emergence and offers practical insights for understanding LLMs and developing sophisticated AI systems for improving human-AI interaction and multi-agent systems.
zh
[NLP-118] Extracting effective solutions hidden in large language models via generated comprehensive specialists: case studies in developing electronic devices
【速读】: 该论文试图解决的是如何利用大语言模型(LLMs)生成跨学科的有效解决方案,特别是在面对复杂、跨学科的研究和开发挑战时。现有的知识往往难以直接提供解决方案,因此需要一种系统化的方法来整合不同学科的知识,以生成突破性的解决方案。论文提出的解决方案SELLM(Solution Enumeration via comprehensive List and LLM)框架,其关键在于利用LLMs的广泛知识库,并结合MECE(Mutually Exclusive, Collectively Exhaustive)原则,如国际专利分类(IPC)和元素周期表,系统地构建专家代理,从而生成跨学科的有效解决方案。通过在实际挑战中的应用,如提高有机发光二极管(OLED)照明的光提取效率和开发下一代存储材料的电极,SELLM展示了其在生成有效解决方案方面的显著优势。
链接: https://arxiv.org/abs/2501.00224
作者: Hikari Tomita,Nobuhiro Nakamura,Shoichi Ishida,Toshio Kamiya,Kei Terayama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 4 figures
Abstract:Recently, many studies have increasingly explored the use of large language models (LLMs) to generate research ideas and scientific hypotheses. However, real-world research and development often require solving complex, interdisciplinary challenges where solutions may not be readily found through existing knowledge related to the problem. Therefore, it is desirable to leverage the vast, comprehensive knowledge of LLMs to generate effective, breakthrough solutions by integrating various perspectives from other disciplines. Here, we propose SELLM (Solution Enumeration via comprehensive List and LLM), a framework leveraging LLMs and structured guidance using MECE (Mutually Exclusive, Collectively Exhaustive) principles, such as International Patent Classification (IPC) and the periodic table of elements. SELLM systematically constructs comprehensive expert agents from the list to generate cross-disciplinary and effective solutions. To evaluate SELLM’s practicality, we applied it to two challenges: improving light extraction in organic light-emitting diode (OLED) lighting and developing electrodes for next-generation memory materials. The results demonstrate that SELLM significantly facilitates the generation of effective solutions compared to cases without specific customization or effort, showcasing the potential of SELLM to enable LLMs to generate effective solutions even for challenging problems.
zh
[NLP-119] An Empirical Evaluation of Large Language Models on Consumer Health Questions
【速读】: 该论文旨在评估多个大型语言模型(LLMs)在MedRedQA数据集上的表现,该数据集包含从AskDocs子论坛中提取的由验证专家回答的消费者医疗问题。尽管LLMs在临床问答(QA)基准测试中表现出色,但其在真实世界、消费者导向的医疗问题上的有效性仍不明确。MedRedQA提出了独特的挑战,如非正式语言的使用以及需要为非专业查询提供精确的响应。为了评估模型性能,研究使用了五种LLMs(GPT-4o mini、Llama 3.1: 70B、Mistral-123B、Mistral-7B和Gemini-Flash)生成响应,并采用交叉评估方法,即每个模型评估自己及其他模型的响应,以减少偏见。研究结果表明,根据五个模型中的四个评估,GPT-4o mini与专家响应的对齐度最高,而Mistral-7B在五个模型中的三个评估中得分最低。该研究揭示了当前LLMs在消费者健康医疗问答中的潜力和局限性,并指出了进一步发展的方向。
链接: https://arxiv.org/abs/2501.00208
作者: Moaiz Abrar,Yusuf Sermet,Ibrahim Demir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates the performance of several Large Language Models (LLMs) on MedRedQA, a dataset of consumer-based medical questions and answers by verified experts extracted from the AskDocs subreddit. While LLMs have shown proficiency in clinical question answering (QA) benchmarks, their effectiveness on real-world, consumer-based, medical questions remains less understood. MedRedQA presents unique challenges, such as informal language and the need for precise responses suited to non-specialist queries. To assess model performance, responses were generated using five LLMs: GPT-4o mini, Llama 3.1: 70B, Mistral-123B, Mistral-7B, and Gemini-Flash. A cross-evaluation method was used, where each model evaluated its responses as well as those of others to minimize bias. The results indicated that GPT-4o mini achieved the highest alignment with expert responses according to four out of the five models’ judges, while Mistral-7B scored lowest according to three out of five models’ judges. This study highlights the potential and limitations of current LLMs for consumer health medical question answering, indicating avenues for further development.
zh
[NLP-120] GPT-4 on Clinic Depression Assessment: An LLM -Based Pilot Study
【速读】: 该论文试图解决临床抑郁症(clinical depression)早期检测中的两个关键问题:一是专业人员的短缺,二是传统诊断方法耗时且依赖专家。为了解决这些问题,作者探索了使用GPT-4进行基于转录分析的临床抑郁症评估。解决方案的关键在于通过调整提示词复杂度(prompt complexity)和温度设置(temperature settings)来优化GPT-4的分类性能。研究结果表明,在较低的温度值(0.0-0.2)下,使用复杂提示词时,GPT-4表现出较高的准确性和F1分数。然而,当温度超过一定阈值(0.3)时,随机性与性能之间的关系变得不可预测,提示词复杂度的优势也随之减弱。因此,提示工程(prompt engineering)和模型参数的精细校准对于确保GPT-4在临床评估中的一致性至关重要。
链接: https://arxiv.org/abs/2501.00199
作者: Giuliano Lorenzoni,Pedro Elkind Velmovitsky,Paulo Alencar,Donald Cowan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Depression has impacted millions of people worldwide and has become one of the most prevalent mental disorders. Early mental disorder detection can lead to cost savings for public health agencies and avoid the onset of other major comorbidities. Additionally, the shortage of specialized personnel is a critical issue because clinical depression diagnosis is highly dependent on expert professionals and is time consuming. In this study, we explore the use of GPT-4 for clinical depression assessment based on transcript analysis. We examine the model’s ability to classify patient interviews into binary categories: depressed and not depressed. A comparative analysis is conducted considering prompt complexity (e.g., using both simple and complex prompts) as well as varied temperature settings to assess the impact of prompt complexity and randomness on the model’s performance. Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations, with optimal performance observed at lower temperature values (0.0-0.2) for complex prompts. However, beyond a certain threshold (temperature = 0.3), the relationship between randomness and performance becomes unpredictable, diminishing the gains from prompt complexity. These findings suggest that, while GPT-4 shows promise for clinical assessment, the configuration of the prompts and model parameters requires careful calibration to ensure consistent results. This preliminary study contributes to understanding the dynamics between prompt engineering and large language models, offering insights for future development of AI-powered tools in clinical settings. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00199 [cs.CL] (or arXiv:2501.00199v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00199 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Giuliano Lorenzoni [view email] [v1] Tue, 31 Dec 2024 00:32:43 UTC (87 KB)
zh
[NLP-121] MLLM -as-a-Judge for Image Safety without Human Labeling
【速读】: 该论文试图解决的问题是如何在零样本(zero-shot)设置下,利用预训练的多模态大语言模型(Multimodal Large Language Models, MLLMs)来检测不安全图像(unsafe images),如包含色情或暴力内容的图像。现有的方法通常依赖于人工标注的数据集对MLLMs进行微调,但这种方法存在成本高、劳动密集且难以应对安全规则频繁更新的问题。论文提出的解决方案的关键在于:通过将安全规则客观化(objectifying safety rules)、评估规则与图像之间的相关性(assessing relevance)、基于去偏见的词元概率(debiased token probabilities)进行快速判断,并结合逻辑完整但简化的前提链(precondition chains)进行推理。此外,必要时通过级联的思维链(cascaded chain-of-thought processes)进行更深入的推理。实验结果表明,该方法在零样本图像安全判断任务中表现出高效性。
链接: https://arxiv.org/abs/2501.00192
作者: Zhenting Wang,Shuming Hu,Shiyu Zhao,Xiaowen Lin,Felix Juefei-Xu,Zhuowei Li,Ligong Han,Harihar Subramanyam,Li Chen,Jianfa Chen,Nan Jiang,Lingjuan Lyu,Shiqing Ma,Dimitris N. Metaxas,Ankit Jain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.
zh
[NLP-122] he Text Classification Pipeline: Starting Shallow going Deeper
【速读】: 该论文旨在深入探讨文本分类(Text Classification, TC)的整个流程,并评估每个组成部分对TC模型整体性能的影响。文本分类作为自然语言处理(Natural Language Processing, NLP)领域的核心任务,近年来在深度学习的推动下取得了显著进展。论文的关键解决方案在于对TC流程的全面分析,包括最先进的数据集、文本预处理技术、文本表示方法、分类模型、评估指标、当前结果及未来趋势。通过对这些阶段的细致研究,论文不仅提供了技术创新的详细描述,还通过比较分析、案例研究和实验评估,深入探讨了不同分类策略的优劣。这些贡献超越了传统的综述,为TC领域提供了深刻且具有洞察力的探索。
链接: https://arxiv.org/abs/2501.00174
作者: Marco Siino,Ilenia Tinnirello,Marco La Cascia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Text Classification (TC) stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through the lens of computer science and engineering. The past decade has seen deep learning revolutionize TC, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature is rich with datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of TC models relies heavily on their ability to capture intricate textual relationships and nonlinear correlations, necessitating a comprehensive examination of the entire TC pipeline. This monograph provides an in-depth exploration of the TC pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of TC models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, current results and future trends. Each chapter meticulously examines these stages, presenting technical innovations and significant recent findings. The work critically assesses various classification strategies, offering comparative analyses, examples, case studies, and experimental evaluations. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of TC. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2501.00174 [cs.CL] (or arXiv:2501.00174v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-123] DeepLL: Considering Linear Logic for the Analysis of Deep Learning Experiments
【速读】: 该论文试图解决深度学习实验中数据处理的准确性和与硬件加速器交互的API使用效率问题。具体来说,软件错误可能导致实验数据污染和错误结果,而编码不当的API可能导致硬件资源使用效率低下和不可靠的结论。论文提出使用线性逻辑(Linear Logic)来分析深度学习实验,通过线性逻辑的原语和操作符来表达实验的控制流抽象表示、可用实验资源(如与底层数据结构和硬件交互的API调用)以及实验过程中资源正确消耗的推理规则。该模型不仅轻量级且易于理解,具有符号和视觉组件,其生成的工件本身就是线性逻辑中的证明,可以通过现成的推理器进行验证。解决方案的关键在于利用线性逻辑的形式化方法来确保实验资源的正确管理和使用。
链接: https://arxiv.org/abs/2501.00169
作者: Nick Papoulias
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 8 pages, 3 figures
Abstract:Deep Learning experiments have critical requirements regarding the careful handling of their datasets as well as the efficient and correct usage of APIs that interact with hardware accelerators. On the one hand, software mistakes during data handling can contaminate experiments and lead to incorrect results. On the other hand, poorly coded APIs that interact with the hardware can lead to sub-optimal usage and untrustworthy conclusions. In this work we investigate the use of Linear Logic for the analysis of Deep Learning experiments. We show that primitives and operators of Linear Logic can be used to express: (i) an abstract representation of the control flow of an experiment, (ii) a set of available experimental resources, such as API calls to the underlying data-structures and hardware as well as (iii) reasoning rules about the correct consumption of resources during experiments. Our proposed model is not only lightweight but also easy to comprehend having both a symbolic and a visual component. Finally, its artifacts are themselves proofs in Linear Logic that can be readily verified by off-the-shelf reasoners.
zh
[NLP-124] Measuring Large Language Models Capacity to Annotate Journalistic Sourcing
【速读】: 该论文试图解决大语言模型(LLMs)在新闻领域中的应用问题,特别是其在新闻故事中识别和标注来源(sourcing)的能力。新闻来源是新闻原创内容的关键支柱,而当前的研究尚未充分开发针对新闻来源和伦理的评估场景。论文提出了一种基于新闻学研究(Gans, 2004)的五类分类法,用于评估LLMs在新闻故事中识别和标注来源的表现。解决方案的关键在于构建一个系统化的基准测试方法,包括使用案例、数据集和评估指标,以评估LLMs在识别新闻来源类型及其合理性方面的能力。初步结果表明,LLMs在识别所有来源声明和匹配来源类型方面仍有改进空间,尤其是在识别来源合理性方面更具挑战性。
链接: https://arxiv.org/abs/2501.00164
作者: Subramaniam Vincent,Phoebe Wang,Zhan Shi,Sahas Koka,Yi Fang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Since the launch of ChatGPT in late 2022, the capacities of Large Language Models and their evaluation have been in constant discussion and evaluation both in academic research and in the industry. Scenarios and benchmarks have been developed in several areas such as law, medicine and math (Bommasani et al., 2023) and there is continuous evaluation of model variants. One area that has not received sufficient scenario development attention is journalism, and in particular journalistic sourcing and ethics. Journalism is a crucial truth-determination function in democracy (Vincent, 2023), and sourcing is a crucial pillar to all original journalistic output. Evaluating the capacities of LLMs to annotate stories for the different signals of sourcing and how reporters justify them is a crucial scenario that warrants a benchmark approach. It offers potential to build automated systems to contrast more transparent and ethically rigorous forms of journalism with everyday fare. In this paper we lay out a scenario to evaluate LLM performance on identifying and annotating sourcing in news stories on a five-category schema inspired from journalism studies (Gans, 2004). We offer the use case, our dataset and metrics and as the first step towards systematic benchmarking. Our accuracy findings indicate LLM-based approaches have more catching to do in identifying all the sourced statements in a story, and equally, in matching the type of sources. An even harder task is spotting source justifications.
zh
[NLP-125] mporal reasoning for timeline summarisation in social media
【速读】: 该论文探讨了增强大型语言模型(LLMs)的时间推理能力是否能够提高时间线摘要的质量,特别是针对包含事件序列的长文本(如社交媒体线程)的摘要任务。论文提出了一个名为NarrativeReason的新数据集,该数据集专注于叙事中顺序事件之间的时间关系,区别于现有主要处理成对事件关系的时间推理数据集。解决方案的关键在于通过知识蒸馏框架将时间推理与时间线摘要相结合:首先在时间推理任务上微调一个教师模型,然后将这些知识蒸馏到学生模型中,同时训练学生模型进行时间线摘要任务。实验结果表明,该模型在心理健康相关的时间线摘要任务中表现优异,尤其是在处理包含重复事件和混合情感的社交媒体线程时,凸显了利用时间推理来改进时间线摘要的重要性。
链接: https://arxiv.org/abs/2501.00152
作者: Jiayu Song,Mahmud Akhter,Dana Atzil Slonim,Maria Liakata
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarization, the task of summarising long texts containing sequences of events, particularly social media threads . We introduce \textitNarrativeReason, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarization through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarization. Experimental results demonstrate that our model achieves superior performance on mental health-related timeline summarization tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance of leveraging temporal reasoning to improve timeline summarisation.
zh
[NLP-126] A Data-Centric Approach to Detecting and Mitigating Demographic Bias in Pediatric Mental Health Text: A Case Study in Anxiety Detection
【速读】: 该论文旨在解决医疗AI模型在儿科心理健康筛查中因训练数据中的非生物学差异(如性别)而导致的偏见问题。具体来说,研究聚焦于检测和缓解与性别相关的语言差异对模型预测的影响。解决方案的关键在于开发一种数据驱动的去偏方法,通过中和带有偏见的术语,同时保留关键的临床信息。该方法在自动焦虑检测模型中进行了测试,结果显示,去偏方法显著减少了性别相关的诊断偏差,提升了模型在不同人口群体中的公平性。
链接: https://arxiv.org/abs/2501.00129
作者: Julia Ive,Paulina Bondaronek,Vishal Yadav,Daniel Santel,Tracy Glauser,Tina Cheng,Jeffrey R. Strawn,Greeshma Agasthya,Jordan Tschida,Sanghyun Choo,Mayanka Chandrashekar,Anuj J. Kapadia,John Pestian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Introduction: Healthcare AI models often inherit biases from their training data. While efforts have primarily targeted bias in structured data, mental health heavily depends on unstructured data. This study aims to detect and mitigate linguistic differences related to non-biological differences in the training data of AI models designed to assist in pediatric mental health screening. Our objectives are: (1) to assess the presence of bias by evaluating outcome parity across sex subgroups, (2) to identify bias sources through textual distribution analysis, and (3) to develop a de-biasing method for mental health text data. Methods: We examined classification parity across demographic groups and assessed how gendered language influences model predictions. A data-centric de-biasing method was applied, focusing on neutralizing biased terms while retaining salient clinical information. This methodology was tested on a model for automatic anxiety detection in pediatric patients. Results: Our findings revealed a systematic under-diagnosis of female adolescent patients, with a 4% lower accuracy and a 9% higher False Negative Rate (FNR) compared to male patients, likely due to disparities in information density and linguistic differences in patient notes. Notes for male patients were on average 500 words longer, and linguistic similarity metrics indicated distinct word distributions between genders. Implementing our de-biasing approach reduced diagnostic bias by up to 27%, demonstrating its effectiveness in enhancing equity across demographic groups. Discussion: We developed a data-centric de-biasing framework to address gender-based content disparities within clinical text. By neutralizing biased language and enhancing focus on clinically essential information, our approach demonstrates an effective strategy for mitigating bias in AI healthcare models trained on text.
zh
[NLP-127] CaseSumm: A Large-Scale Dataset for Long-Context Summarization from U.S. Supreme Court Opinions
【速读】: 该论文旨在解决法律领域长文本摘要(long-context summarization)评估中缺乏复杂且长文本数据集的问题。为此,作者提出了CaseSumm,这是一个包含25.6K份美国最高法院(SCOTUS)判决意见及其官方摘要(称为“syllabuses”)的数据集。CaseSumm是迄今为止最大的公开法律案例摘要数据集,并且首次涵盖了自1815年以来的SCOTUS判决摘要。论文还通过自动指标和专家人工评估对生成式大语言模型(LLM)生成的摘要进行了全面评估,揭示了自动评估与人工评估之间的差异。评估结果表明,尽管较小的开源模型Mistral 7b在大多数自动指标上表现优于更大的模型,并能够生成类似syllabus的摘要,但人工评估发现其摘要中存在幻觉(hallucinations)问题。相比之下,GPT-4生成的摘要在清晰度、敏感性和特异性方面表现更佳。此外,论文指出,基于LLM的评估方法与人工评估的相关性并不比传统自动指标更高。这些发现揭示了当前自动评估方法在法律摘要任务中的局限性,并强调了人工评估在复杂、高风险领域中评估摘要质量的关键作用。
链接: https://arxiv.org/abs/2501.00097
作者: Mourad Heddaya,Kyle MacMillan,Anup Malani,Hongyuan Mei,Chenhao Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces CaseSumm, a novel dataset for long-context summarization in the legal domain that addresses the need for longer and more complex datasets for summarization evaluation. We collect 25.6K U.S. Supreme Court (SCOTUS) opinions and their official summaries, known as “syllabuses.” Our dataset is the largest open legal case summarization dataset, and is the first to include summaries of SCOTUS decisions dating back to 1815. We also present a comprehensive evaluation of LLM-generated summaries using both automatic metrics and expert human evaluation, revealing discrepancies between these assessment methods. Our evaluation shows Mistral 7b, a smaller open-source model, outperforms larger models on most automatic metrics and successfully generates syllabus-like summaries. In contrast, human expert annotators indicate that Mistral summaries contain hallucinations. The annotators consistently rank GPT-4 summaries as clearer and exhibiting greater sensitivity and specificity. Further, we find that LLM-based evaluations are not more correlated with human evaluations than traditional automatic metrics. Furthermore, our analysis identifies specific hallucinations in generated summaries, including precedent citation errors and misrepresentations of case facts. These findings demonstrate the limitations of current automatic evaluation methods for legal summarization and highlight the critical role of human evaluation in assessing summary quality, particularly in complex, high-stakes domains. CaseSumm is available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2501.00097 [cs.CL] (or arXiv:2501.00097v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.00097 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-128] Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings COLING2025
【速读】: 该论文探讨了在没有显式位置编码(positional encoding)的情况下,因果注意力(causal attention)的Transformer模型如何存储和处理位置信息的问题。论文提出并验证了一个新的假设:通过观察嵌入向量(embeddings)之间的相似性,Transformer模型可以重建令牌(tokens)的位置信息。具体来说,论文发现,邻近的嵌入向量比远离的嵌入向量更相似,这种模式在训练后和随机初始化的Transformer模型中均存在,且在一定范围的超参数下成立。这一发现表明,Transformer模型可能通过嵌入向量的局部相似性来隐式地编码位置信息,从而在不使用显式位置编码的情况下解决需要位置信息的任务。
链接: https://arxiv.org/abs/2501.00073
作者: Chunsheng Zuo,Pavel Guerzhoy,Michael Guerzhoy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Forthcoming at the International Conference on Computational Linguistics 2025 (COLING 2025)
Abstract:Transformers with causal attention can solve tasks that require positional information without using positional encodings. In this work, we propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens. We show that this pattern can occur in both the trained and the randomly initialized Transformer models with causal attention and no positional encodings over a common range of hyperparameters.
zh
[NLP-129] ICLR: In-Context Learning of Representations
【速读】: 该论文探讨了在大语言模型(LLM)中,预训练数据所定义的语义如何影响不同概念的表示组织,并进一步研究了在提供上下文示例时,模型是否会调整这些预训练语义以采用上下文指定的新语义。具体来说,论文通过设计一个“图追踪”任务,分析模型在给定上下文示例后,是否能够根据新的语义重新组织其表示。研究发现,随着上下文规模的增加,模型的表示会从预训练的语义表示突然转变为与图结构对齐的上下文表示。此外,当参考概念在语义上存在相关性时,上下文指定的图结构仍然存在于表示中,但无法完全取代预训练的结构。论文通过类比能量最小化过程来解释这些结果,表明模型可能通过隐式优化过程推断上下文指定的语义。总体而言,研究结果表明,通过增加上下文规模,可以灵活地重新组织模型的表示,从而可能解锁新的能力。
链接: https://arxiv.org/abs/2501.00070
作者: Core Francisco Park,Andrew Lee,Ekdeep Singh Lubana,Yongyi Yang,Maya Okawa,Kento Nishi,Martin Wattenberg,Hidenori Tanaka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint (Under Review)
Abstract:Recent work has demonstrated that semantics specified by pretraining data influence how representations of different concepts are organized in a large language model (LLM). However, given the open-ended nature of LLMs, e.g., their ability to in-context learn, we can ask whether models alter these pretraining semantics to adopt alternative, context-specified ones. Specifically, if we provide in-context exemplars wherein a concept plays a different role than what the pretraining data suggests, do models reorganize their representations in accordance with these novel semantics? To answer this question, we take inspiration from the theory of conceptual role semantics and define a toy “graph tracing” task wherein the nodes of the graph are referenced via concepts seen during training (e.g., apple, bird, etc.) and the connectivity of the graph is defined via some predefined structure (e.g., a square grid). Given exemplars that indicate traces of random walks on the graph, we analyze intermediate representations of the model and find that as the amount of context is scaled, there is a sudden re-organization from pretrained semantic representations to in-context representations aligned with the graph structure. Further, we find that when reference concepts have correlations in their semantics (e.g., Monday, Tuesday, etc.), the context-specified graph structure is still present in the representations, but is unable to dominate the pretrained structure. To explain these results, we analogize our task to energy minimization for a predefined graph topology, providing evidence towards an implicit optimization process to infer context-specified semantics. Overall, our findings indicate scaling context-size can flexibly re-organize model representations, possibly unlocking novel capabilities.
zh
[NLP-130] Adversarial Negotiation Dynamics in Generative Language Models NEURIPS2024
【速读】: 该论文试图解决在合同起草和增强过程中,生成式语言模型(Generative Language Models)在对抗性环境中的竞争鲁棒性和安全性问题。随着不同方使用不同的语言模型进行对抗,模型可能会生成带有偏见、有害或法律问题的文本,从而引发AI安全和安全性的担忧。论文通过模拟真实世界的合同谈判,评估主要开源语言模型在对抗性竞争中的表现和漏洞,揭示潜在风险。解决方案的关键在于通过对抗性测试(adversarial testing)来暴露模型的脆弱性,从而为开发更安全、可靠的模型提供依据,并为在竞争性法律环境中的模型选择和优化提供可操作的策略。
链接: https://arxiv.org/abs/2501.00069
作者: Arinbjörn Kolbeinsson,Benedikt Kolbeinsson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper at NeurIPS 2024 Workshop on Red Teaming GenAI
Abstract:Generative language models are increasingly used for contract drafting and enhancement, creating a scenario where competing parties deploy different language models against each other. This introduces not only a game-theory challenge but also significant concerns related to AI safety and security, as the language model employed by the opposing party can be unknown. These competitive interactions can be seen as adversarial testing grounds, where models are effectively red-teamed to expose vulnerabilities such as generating biased, harmful or legally problematic text. Despite the importance of these challenges, the competitive robustness and safety of these models in adversarial settings remain poorly understood. In this small study, we approach this problem by evaluating the performance and vulnerabilities of major open-source language models in head-to-head competitions, simulating real-world contract negotiations. We further explore how these adversarial interactions can reveal potential risks, informing the development of more secure and reliable models. Our findings contribute to the growing body of research on AI safety, offering insights into model selection and optimisation in competitive legal contexts and providing actionable strategies for mitigating risks.
zh
[NLP-131] On Adversarial Robustness of Language Models in Transfer Learning
【速读】: 该论文探讨了在迁移学习(transfer learning)场景下,大语言模型(LLMs)的对抗鲁棒性(adversarial robustness)问题。研究发现,尽管迁移学习能够提升模型在标准性能指标上的表现,但它往往会导致模型在面对对抗攻击时更加脆弱。论文通过在多组数据集(如MBIB Hate Speech、MBIB Political Bias、MBIB Gender Bias)和多种模型架构(如BERT、RoBERTa、GPT-2、Gemma、Phi)上的实验,揭示了模型大小、架构和适应方法之间的复杂关系。关键解决方案在于发现较大的模型在面对对抗攻击时表现出更强的鲁棒性,这为在迁移学习中平衡性能与安全性提供了重要见解。研究强调了在迁移学习场景中考虑对抗鲁棒性的必要性,并为实际应用中开发和部署LLMs提供了重要参考。
链接: https://arxiv.org/abs/2501.00066
作者: Bohdan Turbal,Anastasiia Mazur,Jiaxu Zhao,Mykola Pechenizkiy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:We investigate the adversarial robustness of LLMs in transfer learning scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures (BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods. Our work highlights the crucial need for considering adversarial robustness in transfer learning scenarios and provides insights into maintaining model security without compromising performance. These findings have significant implications for the development and deployment of LLMs in real-world applications where both performance and robustness are paramount.
zh
[NLP-132] ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis
【速读】: 该论文探讨了双向变压器(Bidirectional Transformers)和大型语言模型(LLM)在情感分析任务中的协同作用,特别是ELECTRA和GPT-4o在三向情感分类中的合作效果。论文的核心问题是:通过将ELECTRA的预测结果(包括预测标签、概率和检索到的示例)输入到GPT模型中,是否能够提升模型的性能。研究结果表明,将经过微调的ELECTRA Base的预测结果与GPT-4o-mini结合使用,显著提高了模型的性能(82.74 macro F1),超过了单独使用任一模型的表现(79.29 ELECTRA Base FT,79.52 GPT-4o-mini),并且具有最低的成本/性能比(0.12/F1点)。然而,当GPT模型经过微调后,加入ELECTRA的预测结果反而降低了性能。最终,微调后的GPT-4o FT-M表现最佳(86.99 macro F1),而GPT-4o-mini FT紧随其后(86.77 macro F1),且成本显著降低(0.38 vs. 1.59/F1点)。研究结论表明,通过将微调编码器的预测结果增强提示(prompt augmentation)是一种有效的性能提升方法,且微调后的GPT-4o-mini在成本大幅降低的情况下,性能接近GPT-4o FT,为资源有限的项目提供了经济高效的解决方案。
链接: https://arxiv.org/abs/2501.00062
作者: James P. Beno
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures. Source code and data available at this https URL
Abstract:Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.74 macro F1 vs. 79.29 ELECTRA Base FT, 79.52 GPT-4o-mini) and yielded the lowest cost/performance ratio (\ 0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.77) at much less cost (\ 0.38 vs. \ 1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.
zh
[NLP-133] Large Language Models for Mathematical Analysis
【速读】: 该论文旨在解决当前人工智能(AI)领域在数学问题求解方面的关键挑战,特别是针对数学分析(Mathematical Analysis)领域中的证明类问题。现有研究和大规模语言模型(LLMs)主要集中在计算任务上,而缺乏对需要严格证明和形式化推理的数学分析问题的关注。为此,作者开发了DEMI-MathAnalysis数据集,涵盖序列与极限(Sequences and Limits)、无穷级数(Infinite Series)和凸函数(Convex Functions)等数学分析主题的证明类问题。同时,作者设计了一个指导框架,通过在该数据集上对LLMs进行微调,显著提升了其生成逻辑严密、完整且优雅的证明的能力。这一工作填补了数学推理领域的关键空白,并推动了能够处理形式化数学语言的可信AI的发展。
链接: https://arxiv.org/abs/2501.00059
作者: Ziye Chen,Hao Qi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mathematical problem-solving is a key field in artificial intelligence (AI) and a critical benchmark for evaluating the capabilities of large language models (LLMs). While extensive research has focused on mathematical problem-solving, most existing work and datasets concentrate on computational tasks, leaving gaps in areas like mathematical analysis, which demands rigorous proofs and formal reasoning. We developed the DEMI-MathAnalysis dataset, comprising proof-based problems from mathematical analysis topics such as Sequences and Limits, Infinite Series, and Convex Functions. We also designed a guiding framework to rigorously enhance LLMs’ ability to solve these problems. Through fine-tuning LLMs on this dataset and employing our framework, we observed significant improvements in their capability to generate logical, complete, and elegant proofs. This work addresses critical gaps in mathematical reasoning and contributes to advancing trustworthy AI capable of handling formalized mathematical language. The code is publicly accessible at LLMs for Mathematical Analysis.
zh
[NLP-134] LLM -Virus: Evolutionary Jailbreak Attack on Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在面对对抗性查询(如越狱攻击)时的安全性问题。现有的越狱攻击方法主要依赖于不透明的优化技术(如基于梯度的优化)和启发式搜索方法(如LLM精炼),这些方法在透明度、可迁移性和计算成本方面存在不足。为解决这些问题,论文提出了一种基于进化算法的越狱攻击方法,称为LLM-Virus。该方法将越狱攻击视为进化和迁移学习问题,利用LLMs作为启发式进化算子,以确保攻击的高效性、可迁移性和低时间成本。实验结果表明,LLM-Virus在多个安全基准测试中表现优于现有攻击方法。
链接: https://arxiv.org/abs/2501.00055
作者: Miao Yu,Junfeng Fang,Yingjie Zhou,Xing Fan,Kun Wang,Shirui Pan,Qingsong Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
zh
[NLP-135] AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors
【速读】: 该论文试图解决文本到图像扩散模型(text-to-image diffusion models)在去除不适当概念(unlearning inappropriate concepts)时面临的性能权衡问题。具体而言,现有的微调方法通常通过将不安全提示(unsafe prompts)的预测分布与预定义的文本锚点(text anchors)对齐来实现概念去除,但这种方法在消除不良概念和保留其他概念之间存在显著的性能折衷。论文提出的解决方案AdvAnchor通过生成对抗性锚点(adversarial anchors)来缓解这一问题。这些对抗性锚点被设计为在嵌入空间中接近不良概念的嵌入,从而保持模型的整体性能,同时选择性地排除这些概念的关键属性以实现有效去除。实验结果表明,AdvAnchor在性能上优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.00054
作者: Mengnan Zhao,Lihe Zhang,Xingyi Yang,Tianhang Zheng,Baocai Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Security concerns surrounding text-to-image diffusion models have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods typically align the prediction distributions of unsafe prompts with those of predefined text anchors. However, these techniques exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. In this paper, we systematically analyze the impact of diverse text anchors on unlearning performance. Guided by this analysis, we propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue. These adversarial anchors are crafted to closely resemble the embeddings of undesirable concepts to maintain overall model performance, while selectively excluding defining attributes of these concepts for effective erasure. Extensive experiments demonstrate that AdvAnchor outperforms state-of-the-art methods. Our code is publicly available at this https URL.
zh
[NLP-136] Seq2Seq Model-Based Chatbot with LSTM and Attention Mechanism for Enhanced User Interaction AAAI-2025
【速读】: 该论文试图解决现有聊天机器人(chatbot)依赖预定义API(Application Programming Interface)所导致的供应商锁定(vendor lock-in)和高成本问题。为了解决这些问题,论文提出了一种基于序列到序列(Sequence-to-Sequence, Seq2Seq)模型的聊天机器人,该模型采用编码器-解码器(encoder-decoder)架构,并结合了注意力机制(attention mechanisms)和长短期记忆(Long Short-Term Memory, LSTM)单元。通过避免使用预定义API,该方法确保了系统的灵活性和成本效益。该聊天机器人在摩洛哥Draa-Tafilalet地区旅游领域的专用数据集上进行了训练、验证和测试,结果显示其在训练、验证和测试阶段分别达到了约99.58%、98.03%和94.12%的高准确率,证明了其在特定领域内提供相关且连贯响应的有效性。
链接: https://arxiv.org/abs/2501.00049
作者: Lamya Benaddi,Charaf Ouaddi,Adnane Souha,Abdeslam Jakimi,Mohamed Rahouti,Mohammed Aledhari,Diogo Oliveira,Brahim Ouchao
机构: 未知
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: The Third Workshop on Deployable AI at AAAI-2025
Abstract:A chatbot is an intelligent software application that automates conversations and engages users in natural language through messaging platforms. Leveraging artificial intelligence (AI), chatbots serve various functions, including customer service, information gathering, and casual conversation. Existing virtual assistant chatbots, such as ChatGPT and Gemini, demonstrate the potential of AI in Natural Language Processing (NLP). However, many current solutions rely on predefined APIs, which can result in vendor lock-in and high costs. To address these challenges, this work proposes a chatbot developed using a Sequence-to-Sequence (Seq2Seq) model with an encoder-decoder architecture that incorporates attention mechanisms and Long Short-Term Memory (LSTM) cells. By avoiding predefined APIs, this approach ensures flexibility and cost-effectiveness. The chatbot is trained, validated, and tested on a dataset specifically curated for the tourism sector in Draa-Tafilalet, Morocco. Key evaluation findings indicate that the proposed Seq2Seq model-based chatbot achieved high accuracies: approximately 99.58% in training, 98.03% in validation, and 94.12% in testing. These results demonstrate the chatbot’s effectiveness in providing relevant and coherent responses within the tourism domain, highlighting the potential of specialized AI applications to enhance user experience and satisfaction in niche markets.
zh
[NLP-137] Cross-Linguistic Examination of Machine Translation Transfer Learning
【速读】: 该论文旨在探讨迁移学习(transfer learning)在不同语系机器翻译中的有效性,特别是针对高资源语言预训练模型在低资源语言上的微调效果。研究通过评估五种不同语系的语言对(包括闪米特语系、班图语系、罗曼语系、斯拉夫语系以及孤立语言),分析了超参数(如学习率、批量大小、训练轮数和权重衰减)对模型性能的影响。研究结果表明,迁移学习在不同语系中普遍有效,但超参数的选择对结果有显著影响。其中,适中的批量大小(如32)通常效果更好,而过高的学习率可能破坏模型训练。该研究的关键在于通过一致的超参数设置,简化并提升多语言模型训练的效率和效果。
链接: https://arxiv.org/abs/2501.00045
作者: Saughmon Boujkian
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This study investigates the effectiveness of transfer learning in machine translation across diverse linguistic families by evaluating five distinct language pairs. Leveraging pre-trained models on high-resource languages, these models were fine-tuned on low-resource languages, examining variations in hyperparameters such as learning rate, batch size, number of epochs, and weight decay. The research encompasses language pairs from different linguistic backgrounds: Semitic (Modern Standard Arabic - Levantine Arabic), Bantu (Hausa - Zulu), Romance (Spanish - Catalan), Slavic (Slovakian - Macedonian), and language isolates (Eastern Armenian - Western Armenian). Results demonstrate that transfer learning is effective across different language families, although the impact of hyperparameters varies. A moderate batch size (e.g., 32) is generally more effective, while very high learning rates can disrupt model training. The study highlights the universality of transfer learning in multilingual contexts and suggests that consistent hyperparameter settings can simplify and enhance the efficiency of multilingual model training.
zh
[NLP-138] Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs
【速读】: 该论文试图解决大语言模型(LLMs)在推理过程中由于模型规模庞大和资源需求高而面临的部署挑战。具体来说,现有的量化方法虽然能够缓解内存压力,但常用的组量化格式(group quantization formats)在计算过程中引入了显著的开销,尤其是在反量化(dequantization)过程中,导致大量计算指令无法有效执行乘法操作,从而难以满足在商用CPU上部署LLMs所需的低延迟要求。
解决方案的关键在于提出了一组高度优化的内核(kernels),旨在加速LLM推理并充分发挥CPU(特别是Arm CPU)的潜力。这些内核通过在多行输出中分摊操作数加载和权重解包的代价,并引入优化的交错组数据布局(interleaved group data layout)以及反量化路径优化,以减少不必要的操作和反量化开销,同时最大化向量和矩阵乘法操作的使用效率。此外,论文还提出了一种基于组非均匀码本(groupwise non-uniform codebook-based quantization)的超低精度量化方法,以更好地匹配LLM权重分布中的非均匀模式,从而在生成token时实现更高的吞吐量,同时确保优于现有技术的质量。这些优化使得4位LLM在Arm CPU上的提示处理速度提升了3-3.2倍,自回归解码速度提升了2倍。
链接: https://arxiv.org/abs/2501.00032
作者: Dibakar Gope,David Mansell,Danny Loh,Ian Bratt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to this http URL-based solution. The optimized kernels are available at this https URL.
zh
[NLP-139] Distilling Large Language Models for Efficient Clinical Information Extraction
【速读】: 该论文试图解决大型语言模型(LLMs)在临床信息抽取任务中计算需求高、难以实际部署的问题。解决方案的关键在于知识蒸馏(Knowledge Distillation),即通过将大型模型的知识转移到更小的模型中,从而在保持较高性能的同时显著降低计算成本和推理时间。具体而言,研究者使用先进的LLMs(如Gemini和OpenAI模型)和医学本体(如RxNorm和SNOMED)作为教师模型,指导蒸馏后的BERT模型在临床命名实体识别(NER)任务中的表现。实验结果表明,蒸馏后的BERT模型在疾病、药物和症状抽取任务中表现接近教师模型,且推理速度显著提升(最快可达12倍),成本大幅降低(最多可达101倍)。这一方法为临床信息抽取提供了一种计算高效且可扩展的替代方案。
链接: https://arxiv.org/abs/2501.00031
作者: Karthik S. Vedula,Annika Gupta,Akshay Swaminathan,Ivan Lopez,Suhana Bedi,Nigam H. Shah
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 1 figure, 10 tables
Abstract:Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation–the process of transferring knowledge from larger to smaller models–offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.
zh
[NLP-140] Underutilization of Syntactic Processing by Chinese Learners of English in Comprehending English Sentences Evidenced from Adapted Garden-Path Ambiguity Experiment
【速读】: 该论文试图解决的问题是:在英语句子理解过程中,中国英语学习者对句法处理(syntactic processing)的利用不足。尽管以往研究强调了语义处理(semantic processing)在句子理解中的主导作用,但本研究从句法角度出发,揭示了中国学习者在理解英语句子时对句法处理的忽视。
解决方案的关键在于:通过创新的实验设计,研究者采用了语义模糊但句法明确的句子(semantically ambiguous but syntactically unambiguous sentences),取代了传统的局部模糊但全局明确的句子(locally ambiguous but globally unambiguous sentences)。这一设计使得研究者能够更准确地评估学习者在句法处理上的表现。通过描述性和推断性统计分析,研究发现中国学习者在句法处理上存在部分和完全两种类型的利用不足,并指出这种不足与句法处理中的试错(trial and error)有关。基于这些发现,研究为开发一种新的解析方法奠定了基础,旨在将句法处理充分整合到句子理解中,从而提高中国学习者的英语句子理解水平。
链接: https://arxiv.org/abs/2501.00030
作者: Jiapeng Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages
Abstract:Many studies have revealed that sentence comprehension relies more on semantic processing than on syntactic processing. However, previous studies have predominantly emphasized the preference for semantic processing, focusing on the semantic perspective. In contrast, this current study highlights the under-utilization of syntactic processing, from a syntactic perspective. Based on the traditional garden-path experiment, which involves locally ambiguous but globally unambiguous sentences, this study’s empirical experiment innovatively crafted an adapted version featuring semantically ambiguous but syntactically unambiguous sentences to meet its specific research objective. This experiment, involving 140 subjects, demonstrates through descriptive and inferential statistical analyses using SPSS, Graph Pad Prism, and Cursor that Chinese learners of English tend to under-utilize syntactic processing when comprehending English sentences. The study identifies two types of parsing under-utilization: partial and complete. Further exploration reveals that trial and error in syntactic processing contributes to both. Consequently, this study lays a foundation for the development of a novel parsing method designed to fully integrate syntactic processing into sentence comprehension, thereby enhancing the level of English sentence comprehension for Chinese learners of English.
zh
[NLP-141] A Breadth-First Catalog of Text Processing Speech Processing and Multimodal Research in South Asian Languages
【速读】: 该论文旨在综述2022年1月至2024年10月期间南亚语言在文本处理、多模态模型和语音处理领域的最新研究进展,特别聚焦于21种低资源南亚语言(如Saraiki、Assamese、Balochi等)。论文通过采用基于大语言模型(LLMs)的相关性分类和聚类方法,识别了该领域的研究趋势、挑战及未来方向。其关键解决方案在于利用大语言模型进行系统化的文献分析,为对南亚语言技术感兴趣的NLP研究者提供一个广度优先的综述,以促进该领域的研究发展。
链接: https://arxiv.org/abs/2501.00029
作者: Pranav Gupta
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:We review the recent literature (January 2022- October 2024) in South Asian languages on text-based language processing, multimodal models, and speech processing, and provide a spotlight analysis focused on 21 low-resource South Asian languages, namely Saraiki, Assamese, Balochi, Bhojpuri, Bodo, Burmese, Chhattisgarhi, Dhivehi, Gujarati, Kannada, Kashmiri, Konkani, Khasi, Malayalam, Meitei, Nepali, Odia, Pashto, Rajasthani, Sindhi, and Telugu. We identify trends, challenges, and future research directions, using a step-wise approach that incorporates relevance classification and clustering based on large language models (LLMs). Our goal is to provide a breadth-first overview of the recent developments in South Asian language technologies to NLP researchers interested in working with South Asian languages.
zh
[NLP-142] NewsHomepages: Homepage Layouts Capture Information Prioritization Decisions
【速读】: 该论文旨在解决信息优先级(information prioritization)在人类感知和理解世界中的重要作用问题,特别是通过新闻网站首页布局作为信息优先级的具体体现。论文提出了一个名为NewsHomepages的大规模数据集,包含超过3,000个新闻网站首页(涵盖地方、国家和专题类媒体),并在三年内每天两次进行抓取。关键解决方案是通过开发模型对新闻条目进行成对比较(pairwise comparisons),以推断它们的相对重要性。此外,论文还展示了该模型在评估旧金山十年间通过的地方市议会政策的“新闻价值”(newsworthiness)中的应用,表明组织层次结构的建模具有更广泛的意义。研究结果为利用隐式组织线索(implicit organizational cues)深化对信息优先级的理解奠定了基础。
链接: https://arxiv.org/abs/2501.00004
作者: Ben Welsh,Naitian Zhou,Arda Kaz,Michael Vu,Alexander Spangher
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Information prioritization plays an important role in how humans perceive and understand the world. Homepage layouts serve as a tangible proxy for this prioritization. In this work, we present NewsHomepages, a large dataset of over 3,000 new website homepages (including local, national and topic-specific outlets) captured twice daily over a three-year period. We develop models to perform pairwise comparisons between news items to infer their relative significance. To illustrate that modeling organizational hierarchies has broader implications, we applied our models to rank-order a collection of local city council policies passed over a ten-year period in San Francisco, assessing their “newsworthiness”. Our findings lay the groundwork for leveraging implicit organizational cues to deepen our understanding of information prioritization.
zh
[NLP-143] SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation ICASSP2025
【速读】: 该论文试图解决基于语音单元的无文本语音语言模型(SLMs)在生成自然语音时缺乏语义连贯性的问题。解决方案的关键在于将大型语言模型(LLMs)与SLMs集成,提出了SLM和LLM集成用于自发口语对话生成(SLIDE)的方法。具体步骤包括:首先利用LLM生成口语对话的文本内容,然后将文本对话转换为音素序列,并使用基于双塔Transformer的时长预测器预测每个音素的时长,最后通过SLM在音素序列的条件下将文本对话转化为语音。实验结果表明,该系统能够在保持高语义连贯性的同时生成自然的语音对话。
链接: https://arxiv.org/abs/2501.00805
作者: Haitian Lu,Gaofeng Cheng,Liuping Luo,Leying Zhang,Yanmin Qian,Pengyuan Zhang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ICASSP 2025
Abstract:Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
zh
[NLP-144] Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing ICASSP2025
【速读】: 该论文试图解决在语言声学中有效区分不同书面文本之间发音相关性的问题。传统方法依赖于人工设计的发音词典来获取这些发音相关性,而本文提出了一种数据驱动的方法,称为自动文本发音相关性(ATPC),以自动获取这些相关性。解决方案的关键在于利用与训练端到端自动语音识别(E2E-ASR)系统相同的监督信息,即语音和相应的文本标注。具体步骤包括:首先使用迭代训练的时间戳估计器(ITSE)算法将语音与对应的文本符号对齐;然后通过语音编码器将语音转换为语音嵌入;最后通过比较不同文本符号的语音嵌入距离来获得ATPC。实验结果表明,ATPC在普通话的上下文偏置任务中提升了E2E-ASR的性能,并有望应用于缺乏人工发音词典的方言或语言。
链接: https://arxiv.org/abs/2501.00804
作者: Gaofeng Cheng,Haitian Lu,Chengxu Yang,Xuyang Wang,Ta Li,Yonghong Yan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025
Abstract:Effectively distinguishing the pronunciation correlations between different written texts is a significant issue in linguistic acoustics. Traditionally, such pronunciation correlations are obtained through manually designed pronunciation lexicons. In this paper, we propose a data-driven method to automatically acquire these pronunciation correlations, called automatic text pronunciation correlation (ATPC). The supervision required for this method is consistent with the supervision needed for training end-to-end automatic speech recognition (E2E-ASR) systems, i.e., speech and corresponding text annotations. First, the iteratively-trained timestamp estimator (ITSE) algorithm is employed to align the speech with their corresponding annotated text symbols. Then, a speech encoder is used to convert the speech into speech embeddings. Finally, we compare the speech embeddings distances of different text symbols to obtain ATPC. Experimental results on Mandarin show that ATPC enhances E2E-ASR performance in contextual biasing and holds promise for dialects or languages lacking artificial pronunciation lexicons.
zh
[NLP-145] Speech Recognition With LLM s Adapted to Disordered Speech Using Reinforcement Learning ICASSP2025
【速读】: 该论文试图解决大语言模型(LLM)在处理语音输入时,尤其是适应无序语音(disordered speech)时的性能问题。传统的微调方法在处理不同环境下的语音识别时表现有限,而本文提出了一种基于强化学习(Reinforcement Learning, RL)的调优策略,特别是通过人类偏好强化学习(Reinforcement Learning on Human Preference, RLHF)来进一步提升模型的适应性。解决方案的关键在于:首先,将LLM词汇表中的低频文本标记替换为音频标记,并通过带有转录文本的语音数据进行微调,使模型能够识别语音;其次,使用基于句法和语义准确性度量的奖励机制进行强化学习,进一步泛化模型以识别无序语音。实验结果表明,这种基于强化学习的调优策略在适应不同环境下的语音识别任务中,显著优于传统的监督微调方法,为语音识别提供了一种有前景的替代调优策略。
链接: https://arxiv.org/abs/2501.00039
作者: Chirag Nagpal,Subhashini Venugopalan,Jimmy Tobin,Marilyn Ladewig,Katherine Heller,Katrin Tomanek
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at ICASSP 2025
Abstract:We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables it to adapt better to disordered speech than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM’s vocabulary with audio tokens and enables the model to recognize speech by fine-tuning it on speech with transcripts. We then use RL with rewards based on syntactic and semantic accuracy measures generalizing the LLM further to recognize disordered speech. While the resulting LLM does not outperform existing systems for speech recognition, we find that tuning with reinforcement learning using custom rewards leads to substantially better performance than supervised fine-tuning of the language model, specifically when adapting to speech in a different setting. This presents a compelling alternative tuning strategy for speech recognition using large language models.
zh
计算机视觉
[CV-0] GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
【速读】: 该论文旨在解决当前2D视觉-语言模型(VLMs)在3D空间理解任务中的局限性,特别是在全局-局部对应关系(global-local correspondence)方面的不足。尽管现有的方法通过使用3D点云和多视角图像作为输入取得了一定进展,但这些方法仍然依赖于复杂的多模态输入。论文提出了一种基于纯视觉的解决方案,灵感来自人类感知,仅依赖视觉线索来实现3D空间理解。关键解决方案是引入了GPT4Scene,一种新颖的视觉提示范式(visual prompting paradigm),通过构建3D鸟瞰图(Bird’s Eye View, BEV)并在视频帧和BEV图像中标记一致的对象ID,帮助模型建立全局-局部关系。这种方法显著提升了模型对室内场景的3D空间理解能力,并在零样本评估中超越了闭源模型如GPT-4o。此外,论文还通过微调开源VLMs,使用包含165K文本注释的视频数据集,进一步提升了模型在所有3D理解任务中的性能。
链接: https://arxiv.org/abs/2501.01428
作者: Zhangyang Qi,Zhixiong Zhang,Ye Fang,Jiaqi Wang,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-text understanding tasks. However, their performance in 3D spatial comprehension, which is critical for embodied intelligence, remains limited. Recent advances have leveraged 3D point clouds and multi-view images as inputs, yielding promising results. However, we propose exploring a purely vision-based solution inspired by human perception, which merely relies on visual cues for 3D spatial understanding. This paper empirically investigates the limitations of VLMs in 3D spatial knowledge, revealing that their primary shortcoming lies in the lack of global-local correspondence between the scene and individual frames. To address this, we introduce GPT4Scene, a novel visual prompting paradigm in VLM training and inference that helps build the global-local relationship, significantly improving the 3D spatial understanding of indoor scenes. Specifically, GPT4Scene constructs a 3D Bird’s Eye View (BEV) image from the video and marks consistent object IDs across both frames and the BEV image. The model then inputs the concatenated BEV image and video frames with markers. In zero-shot evaluations, GPT4Scene improves performance over closed-source VLMs like GPT-4o. Additionally, we prepare a processed video dataset consisting of 165K text annotation to fine-tune open-source VLMs, achieving state-of-the-art performance on all 3D understanding tasks. Surprisingly, after training with the GPT4Scene paradigm, VLMs consistently improve during inference, even without visual prompting and BEV image as explicit correspondence. It demonstrates that the proposed paradigm helps VLMs develop an intrinsic ability to understand 3D scenes, which paves the way for a noninvasive approach to extending pre-trained VLMs for 3D scene understanding.
zh
[CV-1] VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
【速读】: 该论文试图解决在视频生成过程中插入给定物体时面临的挑战,特别是如何在保持参考物体外观细节的同时准确建模连贯的运动。解决方案的关键在于提出了VideoAnydoor框架,该框架通过以下关键组件实现高保真细节保留和精确运动控制:首先,利用ID提取器(ID extractor)注入全局身份信息,并通过框序列(box sequence)控制整体运动;其次,设计了像素变形器(pixel warper),该组件以参考图像和任意关键点及其轨迹为输入,根据轨迹对像素细节进行变形,并将变形后的特征与扩散U-Net(diffusion U-Net)融合,从而提升细节保留并支持用户对运动轨迹的操控。此外,论文提出了一种结合视频和静态图像的训练策略,通过重加权重建损失(reweight reconstruction loss)进一步提高插入质量。VideoAnydoor在多种下游应用中表现出显著优势,且无需任务特定的微调。
链接: https://arxiv.org/abs/2501.01427
作者: Yuanpeng Tu,Hao Luo,Xi Chen,Sihui Ji,Xiang Bai,Hengshuang Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Method for object insertion in videos
Abstract:Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
zh
[CV-2] Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions
【速读】: 该论文试图解决在生成视频中同时控制动态物体和相机运动的难题。由于缺乏包含全面运动标注的数据集,现有算法无法同时控制相机和物体的运动,导致生成内容的可控性受限。为解决这一问题,论文提出了一个合成数据集——自由形式运动控制合成数据集(SynFMC)。该数据集包含多样化的物体和环境,并根据特定规则覆盖了多种运动模式,模拟了常见且复杂的真实场景。数据集提供的完整6D姿态信息有助于模型学习从视频中解耦物体和相机的运动效果。为验证SynFMC的有效性和泛化能力,论文进一步提出了一种自由形式运动控制(FMC)方法。FMC能够独立或同时控制物体和相机的运动,生成高保真视频,并且兼容多种个性化文本到图像(T2I)模型,适用于不同内容风格。实验结果表明,FMC在多种场景下优于现有方法。
链接: https://arxiv.org/abs/2501.01425
作者: Xincheng Shuai,Henghui Ding,Zhenyuan Qin,Hao Luo,Xingjun Ma,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Controlling the movements of dynamic objects and the camera within generated videos is a meaningful yet challenging task. Due to the lack of datasets with comprehensive motion annotations, existing algorithms can not simultaneously control the motions of both camera and objects, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). The proposed SynFMC dataset includes diverse objects and environments and covers various motion patterns according to specific rules, simulating common and complex real-world scenarios. The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video. To validate the effectiveness and generalization of SynFMC, we further propose a method, Free-Form Motion Control (FMC). FMC enables independent or simultaneous control of object and camera movements, producing high-fidelity videos. Moreover, it is compatible with various personalized text-to-image (T2I) models for different content styles. Extensive experiments demonstrate that the proposed FMC outperforms previous methods across multiple scenarios.
zh
[CV-3] Object-level Visual Prompts for Compositional Image Generation
【速读】: 该论文旨在解决在文本到图像扩散模型(text-to-image diffusion model)中生成语义连贯的物体级视觉提示(object-level visual prompts)的问题。具体而言,研究的目标是在不同场景和风格中生成多样化的图像组合,同时保持输入视觉提示中物体的身份(identity)不变。关键挑战在于如何在生成多样化图像组合的同时,确保物体身份的准确保留。为解决这一问题,论文提出了一种新的KV混合交叉注意力机制(KV-mixed cross-attention mechanism),其中键(keys)和值(values)分别从不同的视觉表示中学习:键来自具有小瓶颈的编码器,用于控制布局(layout control),而值则来自具有更大瓶颈的编码器,以捕捉细粒度的外观细节。通过混合这些互补来源的键和值,模型能够在保持视觉提示身份的同时,支持物体排列、姿态和组合的灵活变化。此外,在推理过程中,论文还提出了物体级组合指导(object-level compositional guidance),以进一步提升身份保留和布局正确性。实验结果表明,该方法能够生成多样化的场景组合,同时保留每个视觉提示的独特特征,扩展了文本到图像生成的创作潜力。
链接: https://arxiv.org/abs/2501.01424
作者: Gaurav Parmar,Or Patashnik,Kuan-Chieh Wang,Daniil Ostashev,Srinivasa Narasimhan,Jun-Yan Zhu,Daniel Cohen-Or,Kfir Aberman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project: this https URL
Abstract:We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method’s identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
zh
[CV-4] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
【速读】: 该论文试图解决潜在扩散模型(Latent Diffusion Models)与Transformer架构在生成高保真图像时面临的优化困境。具体来说,现有设计在视觉分词器(visual tokenizers)中增加每个token的特征维度虽然能提高重建质量,但需要更大的扩散模型和更多的训练迭代才能达到相当的生成性能,导致系统要么因分词器内的信息丢失而产生视觉伪影,要么因计算成本过高而无法完全收敛。这一困境源于学习无约束高维潜在空间的固有难度。
解决方案的关键在于提出了一种新的视觉分词器训练方法,即与预训练的视觉基础模型(Vision Foundation Models)对齐的变分自编码器(VA-VAE)。通过这种方法,VA-VAE显著扩展了潜在扩散模型的重建-生成边界,使得扩散Transformer(DiT)在高维潜在空间中能够更快收敛。此外,论文还构建了一个增强的DiT基线模型(LightningDiT),通过改进的训练策略和架构设计,进一步提升了性能。最终,集成系统在ImageNet 256x256生成任务上实现了1.35的FID分数,并在仅64个epoch内达到了2.11的FID分数,相比原始DiT实现了超过21倍的收敛加速。
链接: https://arxiv.org/abs/2501.01423
作者: Jingfeng Yao,Xinggang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Models and codes are available at: this https URL
Abstract:Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs–representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: this https URL.
zh
[CV-5] Multi-Modal Video Feature Extraction for Popularity Prediction
【速读】: 该论文旨在通过短视频本身及其相关特征来预测短视频的流行度。流行度通过四个关键参与度指标来衡量:观看次数、点赞次数、评论次数和分享次数。解决方案的关键在于结合视频模态特征和文本内容理解。首先,研究采用不同架构和训练方法的视频分类模型作为骨干网络,提取视频模态特征。其次,通过精心设计的提示框架,将清理后的视频字幕与视频一起作为输入,生成详细的基于文本的视频内容理解,并使用预训练的BERT模型将这些文本编码为向量。基于上述六组向量,分别为四个预测指标训练神经网络。此外,研究还基于视频和表格数据进行数据挖掘和特征工程,构建了诸如标签出现总频率、提及出现总频率、视频时长、帧数、帧率和总在线时间等实用特征。最终,通过训练多个机器学习模型,选择最稳定的XGBoost模型,并将神经网络和XGBoost模型的预测结果进行平均,得到最终结果。
链接: https://arxiv.org/abs/2501.01422
作者: Haixu Liu,Wenning Wang,Haoxiang Zheng,Penghao Jiang,Qirui Wang,Ruiqing Yan,Qiuzhuang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: INFORMS 2024 Data Challenge Competition
Abstract:This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
zh
[CV-6] R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
【速读】: 该论文旨在解决基于场景坐标回归(Scene Coordinate Regression, SCR)的视觉定位方法在复杂光照变化或图像级模糊性数据集上鲁棒性不足的问题。为了解决这一问题,作者提出了一种基于共视性图(covisibility graph)的全局编码学习和数据增强策略,并结合深度调整的重投影损失(depth-adjusted reprojection loss)来促进隐式三角测量(implicit triangulation)。此外,作者还对网络架构和局部特征提取模块进行了重新设计。通过这些改进,该方法在不依赖网络集成或3D监督的情况下,在具有挑战性的大规模数据集上实现了最先进的性能。特别是在Aachen Day-Night数据集上,该方法比之前的SCR方法精度提高了10倍,同时所需的地图尺寸至少缩小了5倍,且仍保持了更高的精度。
链接: https://arxiv.org/abs/2501.01421
作者: Xudong Jiang,Fangjinhua Wang,Silvano Galliani,Christoph Vogel,Marc Pollefeys
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10 \times more accurate than previous SCR methods with similar map sizes and require at least 5 \times smaller map sizes than any other SCR method while still delivering superior accuracy. Code will be available at: this https URL .
zh
[CV-7] A Multi-task Supervised Compression Model for Split Computing WACV2025
【速读】: 该论文试图解决在资源受限的边缘计算系统中,多任务分割计算(split computing)应用现有方法时模型精度下降和/或运行时延迟显著增加的问题。解决方案的关键在于提出了Ladon,这是首个用于多任务分割计算的多任务头监督压缩模型。该模型通过在早期层学习压缩表示,在ILSVRC 2012、COCO 2017和PASCAL VOC 2012数据集上的预测性能优于或与强轻量级基线模型相当,同时显著减少了端到端延迟(最多减少95.4%)和移动设备的能耗(最多减少88.2%)。
链接: https://arxiv.org/abs/2501.01420
作者: Yoshitomo Matsubara,Matteo Mendula,Marco Levorato
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted at WACV 2025. Code and models are available at this https URL
Abstract:Split computing ( \neq split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.
zh
[CV-8] Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension AAAI2025
【速读】: 该论文试图解决广义指代表达理解(Generalized Referring Expression Comprehension, GREC)中的挑战性问题。与传统的指代表达理解(Referring Expression Comprehension, REC)不同,GREC不仅涵盖单目标表达,还进一步扩展到了无目标和多目标表达的场景。现有REC方法在处理GREC中的复杂情况时面临挑战,主要由于其固定的输出和多模态表示(multi-modal representations)的局限性。为解决这些问题,论文提出了一种层次对齐增强的自适应定位网络(Hierarchical Alignment-enhanced Adaptive Grounding Network, HieA2G)。该网络通过引入层次多模态语义对齐模块(Hierarchical Multi-modal Semantic Alignment, HMSA),实现了词-对象、短语-对象和文本-图像三个层次的对齐,从而增强了多模态理解能力。此外,为了应对GREC中目标对象数量不定的问题,论文提出了自适应定位计数器(Adaptive Grounding Counter, AGC),动态确定输出目标的数量,并通过辅助对比损失(contrastive loss)提升对象计数能力。实验结果表明,HieA2G在GREC及其他四个任务(REC、短语定位、指代表达分割和广义指代表达分割)中均达到了新的最先进性能,展示了其显著的优越性和泛化能力。
链接: https://arxiv.org/abs/2501.01416
作者: Yaxian Wang,Henghui Ding,Shuting He,Xudong Jiang,Bifan Wei,Jun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025
Abstract:In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.
zh
[CV-9] On Unifying Video Generation and Camera Pose Estimation
【速读】: 该论文探讨了视频生成模型是否具备3D感知能力,并试图解决如何利用视频生成模型的中间特征来支持相机姿态估计(camera pose estimation)的问题。研究以结构光运动(Structure-from-Motion, SfM)作为3D任务的基准,通过将OpenSora视频生成模型的中间特征输入到SfM预测模块(如DUSt3R)中,评估这些特征在相机姿态估计中的表现。研究发现,虽然视频生成模型的中间特征本身具备有限的3D感知能力,但通过任务特定的微调(fine-tuning)可以显著提升其在相机姿态估计中的准确性。最终,论文提出了一种名为JOG3R的统一模型,该模型在不降低视频生成质量的情况下,能够生成具有竞争力的相机姿态估计结果。解决方案的关键在于通过微调和任务特定的监督来增强视频生成模型的3D感知能力,从而实现相机姿态估计的优化。
链接: https://arxiv.org/abs/2501.01409
作者: Chun-Hao Paul Huang,Jae Shin Yoon,Hyeonho Jeong,Niloy Mitra,Duygu Ceylan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Inspired by the emergent 3D capabilities in image generators, we explore whether video generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a benchmark for 3D tasks, we investigate if intermediate features from OpenSora, a video generation model, can support camera pose estimation. We first examine native 3D awareness in video generation features by routing raw intermediate outputs to SfM-prediction modules like DUSt3R. Then, we explore the impact of fine-tuning on camera pose estimation to enhance 3D awareness. Results indicate that while video generator features have limited inherent 3D awareness, task-specific supervision significantly boosts their accuracy for camera pose estimation, resulting in competitive performance. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality without degrading video generation quality.
zh
[CV-10] Nested Attention: Semantic-aware Attention Values for Concept Personalization
【速读】: 该论文试图解决在个性化文本到图像生成模型中,如何在保持生成图像与输入文本提示(text prompt)对齐的同时,有效保留特定主题(subject)的身份特征(identity preservation)的问题。当前方法要么通过单一文本标记(single textual token)表示主题,限制了表达的丰富性,要么采用更丰富的表示方式,但破坏了模型的先验知识(prior),导致文本提示对齐效果下降。
解决方案的关键在于引入了一种称为“嵌套注意力机制”(Nested Attention)的新方法。该机制通过将丰富且具有表达力的图像表示注入到模型现有的交叉注意力层(cross-attention layers)中,生成依赖于查询的主题值(query-dependent subject values)。这些值通过嵌套注意力层学习为生成图像的每个区域选择相关的主题特征,从而在保持高身份保留的同时,确保与输入文本提示的对齐。此外,该方法具有先验保留(prior preservation)特性,能够将来自不同领域的多个个性化主题组合到同一图像中。
链接: https://arxiv.org/abs/2501.01407
作者: Or Patashnik,Rinon Gal,Daniil Ostashev,Sergey Tulyakov,Kfir Aberman,Daniel Cohen-Or
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page at this https URL
Abstract:Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
zh
[CV-11] nnY-Net: Swin-NeXt with Cross-Attention for 3D Medical Images Segmentation MICCAI
【速读】: 该论文旨在解决3D医学图像分割中的挑战,特别是如何更有效地利用患者特征(如病理和治疗信息)来提升分割精度。解决方案的关键在于提出了一种新颖的模型结构——nnY-Net,该结构在U-Net的底部引入了一个交叉注意力模块(Cross-Attention module),形成了Y形结构。该模块利用编码器的最底层特征图作为Key和Value,并将患者特征作为Query来计算注意力权重,从而增强了模型对关键信息的捕捉能力。此外,论文还结合了MedNeXt和SwinUNETR两种最新的SOTA模型的优势,创新性地设计了Swin-NeXt结构,其中Swin Transformer作为编码器,ConvNeXt作为解码器。为了进一步提升训练效率,论文还提出了DiceFocalCELoss损失函数,以应对体素分类中数据收敛不均衡的问题。
链接: https://arxiv.org/abs/2501.01406
作者: Haixu Liu,Zerui Tao,Wenzhen Dong,Qiuzhuang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI
Abstract:This paper provides a novel 3D medical image segmentation model structure called nnY-Net. This name comes from the fact that our model adds a cross-attention module at the bottom of the U-net structure to form a Y structure. We integrate the advantages of the two latest SOTA models, MedNeXt and SwinUNETR, and use Swin Transformer as the encoder and ConvNeXt as the decoder to innovatively design the Swin-NeXt structure. Our model uses the lowest-level feature map of the encoder as Key and Value and uses patient features such as pathology and treatment information as Query to calculate the attention weights in a Cross Attention module. Moreover, we simplify some pre- and post-processing as well as data enhancement methods in 3D image segmentation based on the dynUnet and nnU-net frameworks. We integrate our proposed Swin-NeXt with Cross-Attention framework into this framework. Last, we construct a DiceFocalCELoss to improve the training efficiency for the uneven data convergence of voxel classification.
zh
[CV-12] Learning 3D Garment Animation from Trajectories of A Piece of Cloth NEURIPS2024
【速读】: 该论文试图解决在虚拟现实、游戏和电影制作等应用中,基于学习的服装动画方法需要大量服装数据且难以泛化到未见过的场景的问题。解决方案的关键在于采用了一种解耦的学习方案,即通过能量单元网络(Energy Unit Network, EUNet)从观察到的布料中学习本构行为(constitutive behaviors),并基于能量优化动态地动画化各种服装。EUNet能够直接从观察到的布料中捕捉本构关系,并以能量的形式统一描述由拉伸和弯曲等变形引起的能量变化。通过这种解耦方案,减少了对大规模服装数据的依赖,并能够利用一块布料的动力学特性来动画化多种服装,从而实现了更稳定且物理上合理的动画效果。
链接: https://arxiv.org/abs/2501.01393
作者: Yidi Shao,Chen Change Loy,Bo Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted by NeurIPS2024, 16 pages
Abstract:Garment animation is ubiquitous in various applications, such as virtual reality, gaming, and film producing. Recently, learning-based approaches obtain compelling performance in animating diverse garments under versatile scenarios. Nevertheless, to mimic the deformations of the observed garments, data-driven methods require large scale of garment data, which are both resource-wise expensive and time-consuming. In addition, forcing models to match the dynamics of observed garment animation may hinder the potentials to generalize to unseen cases. In this paper, instead of using garment-wise supervised-learning we adopt a disentangled scheme to learn how to animate observed garments: 1). learning constitutive behaviors from the observed cloth; 2). dynamically animate various garments constrained by the learned constitutive laws. Specifically, we propose Energy Unit network (EUNet) to model the constitutive relations in the format of energy. Without the priors from analytical physics models and differentiable simulation engines, EUNet is able to directly capture the constitutive behaviors from the observed piece of cloth and uniformly describes the change of energy caused by deformations, such as stretching and bending. We further apply the pre-trained EUNet to animate various garments based on energy optimizations. The disentangled scheme alleviates the need of garment data and enables us to utilize the dynamics of a piece of cloth for animating garments. Experiments show that while EUNet effectively delivers the energy gradients due to the deformations, models constrained by EUNet achieve more stable and physically plausible performance comparing with those trained in garment-wise supervised manner. Code is available at this https URL .
zh
[CV-13] Iris Recognition for Infants
【速读】: 该论文旨在解决新生儿和婴儿的非侵入式、高效、无物理标记、准确且稳定的身份识别问题,以防止出生时的婴儿交换、限制婴儿绑架,并改善不同地理环境下的产后健康监测。解决方案的关键包括:(a)使用专门设计的近红外(NIR)虹膜传感器采集17名婴儿的虹膜图像;(b)评估六种虹膜识别方法,以评估现有技术对新生儿的适用性;(c)提出一种新的分割模型,能够准确检测婴儿虹膜图像中的纹理,并结合多种虹膜纹理编码方法,首次实现了完全可操作的婴儿虹膜识别系统;(d)训练基于StyleGAN的模型,生成模拟婴儿虹膜图像的合成图像,以提供隐私安全的婴儿虹膜图像供研究使用。该系统在采集的婴儿虹膜样本上实现了3%的等错误率(EER)和99%的ROC曲线下面积(AUC),显著优于现有成人虹膜识别系统的性能(EER≥20%,AUC≤88%),表明从婴儿虹膜中成功提取生物特征的方法是可行的。
链接: https://arxiv.org/abs/2501.01375
作者: Rasel Ahmed Bhuiyan,Mateusz Trokielewicz,Piotr Maciejewicz,Sherri Bucher,Adam Czajka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Non-invasive, efficient, physical token-less, accurate and stable identification methods for newborns may prevent baby swapping at birth, limit baby abductions and improve post-natal health monitoring across geographies, within the context of both the formal (i.e., hospitals) and informal (i.e., humanitarian and fragile settings) health sectors. This paper explores the feasibility of application iris recognition to build biometric identifiers for 4-6 week old infants. We (a) collected near infrared (NIR) iris images from 17 infants using a specially-designed NIR iris sensor; (b) evaluated six iris recognition methods to assess readiness of the state-of-the-art iris recognition to be applied to newborns and infants; © proposed a new segmentation model that correctly detects iris texture within infants iris images, and coupled it with several iris texture encoding approaches to offer, to the first of our knowledge, a fully-operational infant iris recognition system; and, (d) trained a StyleGAN-based model to synthesize iris images mimicking samples acquired from infants to deliver to the research community privacy-safe infant iris images. The proposed system, incorporating the specially-designed iris sensor and segmenter, and applied to the collected infant iris samples, achieved Equal Error Rate (EER) of 3% and Area Under ROC Curve (AUC) of 99%, compared to EER \geq 20% and AUC \leq 88% obtained for state of the art adult iris recognition systems. This suggests that it may be feasible to design methods that succesfully extract biometric features from infant irises.
zh
[CV-14] CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在处理不可回答的视觉问答(Visual Question Answering, VQA)问题时出现的错误,例如模型会对图像中不存在的对象提供错误的答案。为了解决这一问题,作者提出了CLIP-UP(CLIP-based Unanswerable Problem detection),这是一种轻量级的方法,通过利用CLIP(Contrastive Language–Image Pretraining)提取问题与图像的对齐信息,使VLMs能够识别并拒绝回答不可回答的问题。CLIP-UP的关键在于仅需训练少量额外的层,同时保持原始VLMs的权重不变,从而在MM-UPD基准测试中实现了最先进的不可回答性检测性能,同时不影响模型在其他任务上的表现。
链接: https://arxiv.org/abs/2501.01371
作者: Ben Vardi,Oron Nir,Ariel Shamir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent Vision-Language Models (VLMs) have demonstrated remarkable capabilities in visual understanding and reasoning, and in particular on multiple-choice Visual Question Answering (VQA). Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image. To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. By leveraging CLIP to extract question-image alignment information, CLIP-UP requires only efficient training of a few additional layers, while keeping the original VLMs’ weights unchanged. Tested across LLaVA models, CLIP-UP achieves state-of-the-art results on the MM-UPD benchmark for assessing unanswerability in multiple-choice VQA, while preserving the original performance on other tasks.
zh
[CV-15] st-time Controllable Image Generation by Explicit Spatial Constraint Enforcement
【速读】: 该论文试图解决现有文本到图像生成方法在测试时泛化能力差的问题,尤其是面对复杂的空间条件(如掩码、边界框和关键点)时表现不佳的情况。现有的方法通常需要特定形式的标注来微调模型,或者仅适用于简化的提示和空间条件。论文提出了一种新颖且通用的测试时可控生成方法,通过将空间条件解耦为语义条件和几何条件,并在图像生成过程中分别强化它们的一致性。关键解决方案包括:1)通过补全提示词来弥合语义条件与文本提示之间的差距,并通过注意力图和词空间距离去除干扰词的影响;2)引入几何变换模块,识别注意力图中的感兴趣区域(RoI),并根据几何条件转换类别潜变量;3)提出基于扩散的潜变量重填方法,显式去除RoI区域的潜变量影响,减少生成图像中的伪影。实验结果表明,该方法在布局一致性评估指标上比现有训练无关方法提升了30%。
链接: https://arxiv.org/abs/2501.01368
作者: Z. Zhang,B. Liu,J. Bao,L. Chen,S. Zhu,J. Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent text-to-image generation favors various forms of spatial conditions, e.g., masks, bounding boxes, and key points. However, the majority of the prior art requires form-specific annotations to fine-tune the original model, leading to poor test-time generalizability. Meanwhile, existing training-free methods work well only with simplified prompts and spatial conditions. In this work, we propose a novel yet generic test-time controllable generation method that aims at natural text prompts and complex conditions. Specifically, we decouple spatial conditions into semantic and geometric conditions and then enforce their consistency during the image-generation process individually. As for the former, we target bridging the gap between the semantic condition and text prompts, as well as the gap between such condition and the attention map from diffusion models. To achieve this, we propose to first complete the prompt w.r.t. semantic condition, and then remove the negative impact of distracting prompt words by measuring their statistics in attention maps as well as distances in word space w.r.t. this condition. To further cope with the complex geometric conditions, we introduce a geometric transform module, in which Region-of-Interests will be identified in attention maps and further used to translate category-wise latents w.r.t. geometric condition. More importantly, we propose a diffusion-based latents-refill method to explicitly remove the impact of latents at the RoI, reducing the artifacts on generated images. Experiments on Coco-stuff dataset showcase 30 % relative boost compared to SOTA training-free methods on layout consistency evaluation metrics.
zh
[CV-16] Domain-invariant feature learning in brain MR imaging for content-based image retrieval
【速读】: 该论文旨在解决多中心脑部磁共振成像(MR)研究中,由于不同设备和成像协议导致的域差异问题,这种差异会影响基于内容的图像检索(CBIR)的准确性。为了解决这一问题,作者提出了一种新的低维表示(LDR)获取方法,称为风格编码对抗域适应(SE-ADA)。SE-ADA通过将域特定信息从低维表示中分离,并利用对抗学习最小化域差异,从而在保留病理特征的同时减少域差异。实验结果表明,SE-ADA在八个公开的脑部MR数据集上有效地去除了域信息,同时保留了原始脑结构的关键特征,并展示了最高的疾病检索准确率。
链接: https://arxiv.org/abs/2501.01326
作者: Shuya Tobari,Shuhei Tomoshige,Hayato Muraki,Kenichi Oishi,Hitoshi Iyatomi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 6 pages, 1 figures. Accepted at the SPIE Medical Imaging 2025
Abstract:When conducting large-scale studies that collect brain MR images from multiple facilities, the impact of differences in imaging equipment and protocols at each site cannot be ignored, and this domain gap has become a significant issue in recent years. In this study, we propose a new low-dimensional representation (LDR) acquisition method called style encoder adversarial domain adaptation (SE-ADA) to realize content-based image retrieval (CBIR) of brain MR images. SE-ADA reduces domain differences while preserving pathological features by separating domain-specific information from LDR and minimizing domain differences using adversarial learning. In evaluation experiments comparing SE-ADA with recent domain harmonization methods on eight public brain MR datasets (ADNI1/2/3, OASIS1/2/3/4, PPMI), SE-ADA effectively removed domain information while preserving key aspects of the original brain structure and demonstrated the highest disease search accuracy. Comments: 6 pages, 1 figures. Accepted at the SPIE Medical Imaging 2025 Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2501.01326 [cs.LG] (or arXiv:2501.01326v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.01326 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the SPIE Medical Imaging, 16–20 February, 2025, San Diego, California, US
zh
[CV-17] SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration DATE
【速读】: 该论文旨在解决视频修复(video restoration)中的两个主要挑战:在保持保真度的同时从未知的退化中恢复时间上一致的细节,以及提高生成能力和采样效率。尽管基于扩散模型(diffusion-based)的修复方法取得了进展,但这些方法在处理长视频序列和高分辨率时仍存在局限性。论文提出的解决方案SeedVR,其核心设计在于采用了移位窗口注意力机制(shifted window attention),该机制能够有效处理长视频序列的修复问题。此外,SeedVR通过支持空间和时间维度边界附近的变尺寸窗口,克服了传统窗口注意力机制的分辨率限制。结合因果视频自编码器(causal video autoencoder)、混合图像和视频训练以及渐进式训练等现代技术,SeedVR在合成和真实世界的基准测试以及AI生成视频上均表现出色,显著优于现有的通用视频修复方法。
链接: https://arxiv.org/abs/2501.01320
作者: Jianyi Wang,Zhijie Lin,Meng Wei,Yang Zhao,Ceyuan Yang,Chen Change Loy,Lu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Draft ver., may be updated in the future
Abstract:Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR’s superiority over existing methods for generic video restoration.
zh
[CV-18] Multi-Head Explainer: A General Framework to Improve Explainability in CNNs and Transformers
【速读】: 该论文旨在解决卷积神经网络(CNNs)和基于Transformer的模型在可解释性和准确性方面的不足。为了解决这一问题,作者提出了多头解释器(Multi-Head Explainer, MHEX),这是一个多功能且模块化的框架。MHEX的核心包括三个关键组件:注意力门(Attention Gate),用于动态突出任务相关特征;深度监督(Deep Supervision),指导早期层捕捉与目标类别相关的细粒度细节;以及等效矩阵(Equivalent Matrix),用于统一精炼的局部和全局表示,生成全面的显著性图。通过这些组件,MHEX不仅提升了模型的分类准确性,还生成了高度可解释且详细的显著性分数,从而增强了模型的可解释性。
链接: https://arxiv.org/abs/2501.01311
作者: Bohang Sun,Pietro Liò
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this study, we introduce the Multi-Head Explainer (MHEX), a versatile and modular framework that enhances both the explainability and accuracy of Convolutional Neural Networks (CNNs) and Transformer-based models. MHEX consists of three core components: an Attention Gate that dynamically highlights task-relevant features, Deep Supervision that guides early layers to capture fine-grained details pertinent to the target class, and an Equivalent Matrix that unifies refined local and global representations to generate comprehensive saliency maps. Our approach demonstrates superior compatibility, enabling effortless integration into existing residual networks like ResNet and Transformer architectures such as BERT with minimal modifications. Extensive experiments on benchmark datasets in medical imaging and text classification show that MHEX not only improves classification accuracy but also produces highly interpretable and detailed saliency scores.
zh
[CV-19] HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking
【速读】: 该论文旨在解决高级驾驶辅助系统(ADAS)中多目标跟踪算法的鲁棒性和泛化性问题。传统基于统计模型的跟踪方法依赖于预定义的运动模型和系统噪声分布的假设,虽然计算效率高,但缺乏对多变交通场景的适应性,且需要大量手动设计和参数调优。为解决这些问题,论文提出了一种名为HybridTrack的新型3D多目标跟踪方法,该方法在检测跟踪(tracking-by-detection)框架中集成了数据驱动的卡尔曼滤波器(Kalman Filter, KF)。其关键创新在于直接从数据中学习状态转移残差和卡尔曼增益,从而消除了手动建模运动和随机参数的需求。通过在KITTI数据集上的验证,HybridTrack实现了82.08%的HOTA(Higher Order Tracking Accuracy)精度,显著优于现有方法,并在不同配置下达到了112 FPS的处理速度,实现了实时性能的提升。
链接: https://arxiv.org/abs/2501.01275
作者: Leandro Di Bella,Yangxintong Lyu,Bruno Cornelis,Adrian Munteanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication
Abstract:The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.08% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency. The code will be publicly available at the time of publishing: this https URL.
zh
[CV-20] Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging AAAI2025
【速读】: 该论文试图解决在编码孔径快照光谱成像系统中,从单次2D测量中恢复3D高光谱图像(HSI)时存在的非线性和不适定问题。现有方法在准确性和稳定性方面仍面临挑战。为解决这一问题,作者提出了一种受Mamba启发的联合展开网络(MiJUN),该网络将物理嵌入的深度展开网络(DUNs)与基于学习的高光谱成像相结合。解决方案的关键在于:首先,利用梯形离散化概念扩展展开网络的表示空间,引入加速展开网络方案,该方法可视为广义加速半二次分裂与二阶微分方程的结合,减少了对初始优化阶段的依赖,并解决了长程相互作用相关的挑战。其次,在Mamba框架内,通过结合选择性状态空间模型和注意力机制,重构了Mamba启发的全局到局部注意力机制,从而将Mamba重新解释为Transformer架构的变体,提升了其适应性和效率。最后,通过将张量模式-k展开整合到Mamba网络中,优化了扫描策略,强调了张量沿不同模式的低秩特性,并方便地实现了12个扫描方向。实验结果表明,MiJUN在仿真和真实数据集上均表现出优越性,能够实现细节的精确表示。
链接: https://arxiv.org/abs/2501.01262
作者: Mengjie Qin,Yuchao Feng,Zongliang Wu,Yulun Zhang,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, AAAI 2025
Abstract:In the coded aperture snapshot spectral imaging system, Deep Unfolding Networks (DUNs) have made impressive progress in recovering 3D hyperspectral images (HSIs) from a single 2D measurement. However, the inherent nonlinear and ill-posed characteristics of HSI reconstruction still pose challenges to existing methods in terms of accuracy and stability. To address this issue, we propose a Mamba-inspired Joint Unfolding Network (MiJUN), which integrates physics-embedded DUNs with learning-based HSI imaging. Firstly, leveraging the concept of trapezoid discretization to expand the representation space of unfolding networks, we introduce an accelerated unfolding network scheme. This approach can be interpreted as a generalized accelerated half-quadratic splitting with a second-order differential equation, which reduces the reliance on initial optimization stages and addresses challenges related to long-range interactions. Crucially, within the Mamba framework, we restructure the Mamba-inspired global-to-local attention mechanism by incorporating a selective state space model and an attention mechanism. This effectively reinterprets Mamba as a variant of the Transformer architecture, improving its adaptability and efficiency. Furthermore, we refine the scanning strategy with Mamba by integrating the tensor mode- k unfolding into the Mamba network. This approach emphasizes the low-rank properties of tensors along various modes, while conveniently facilitating 12 scanning directions. Numerical and visual comparisons on both simulation and real datasets demonstrate the superiority of our proposed MiJUN, and achieving overwhelming detail representation.
zh
[CV-21] SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization AAAI2025
【速读】: 该论文试图解决细粒度动作识别(Fine-grained Action Recognition, FAR)中的挑战,特别是在较短时间跨度内识别具有详细语义标签的动作(例如“带一个转身的后空翻”)。由于细粒度标签的标注成本高且需要大量数据进行大语言模型(LLMs)的微调,论文提出采用半监督学习(Semi-Supervised Learning, SSL)方法。其解决方案的关键在于提出了一个名为SeFAR的框架,该框架通过构建双层次时间元素(Dual-level temporal elements)来捕捉足够的视觉细节,并设计了一种新的强增强策略,结合适度的时间扰动来优化教师-学生(Teacher-Student)学习范式。此外,为了应对教师模型在FAR任务中预测的高不确定性,论文提出了自适应调节(Adaptive Regulation)机制以稳定学习过程。实验表明,SeFAR在FineGym和FineDiving两个FAR数据集上达到了最先进的性能,并在UCF101和HMDB51两个经典粗粒度数据集上优于其他半监督方法。进一步的分析和消融研究验证了这些设计的有效性,并表明SeFAR提取的特征能够显著提升多模态基础模型理解细粒度和领域特定语义的能力。
链接: https://arxiv.org/abs/2501.01245
作者: Yongle Huang,Haodong Chen,Zhenbang Xu,Zihan Jia,Haozhou Sun,Dian Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025; Code: this https URL
Abstract:Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., “salto backward tucked with 1 turn”). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model’s predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.
zh
[CV-22] Asymmetric Reinforcing against Multi-modal Representation Bias AAAI2025
【速读】: 该论文试图解决多模态学习(multimodal learning)中动态模态贡献(dynamic modality contributions)导致的模态表示偏差(multimodal representation bias)问题。在多模态系统中,不同模态的主导性会随环境变化而变化,导致某些模态表现不佳,进而影响整体性能。现有的方法主要通过增强弱模态来平衡模态表示偏差,但这往往从部分模态的角度进行优化,容易导致主导模态的性能下降。为解决这一问题,论文提出了一种非对称强化方法(Asymmetric Reinforcing method against Multimodal representation bias, ARM)。ARM通过条件互信息(conditional mutual information)动态强化弱模态,同时保持对主导模态的表示能力。此外,论文深入分析了优化某些模态可能导致信息丢失,阻碍充分利用多模态数据的优势。通过探索模态主导性并缩小模态间的贡献差距,ARM显著提升了多模态学习的性能,有效缓解了不平衡多模态学习的问题。
链接: https://arxiv.org/abs/2501.01240
作者: Xiyuan Gao,Bing Cao,Pengfei Zhu,Nannan Wang,Qinghua Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025
Abstract:The strength of multimodal learning lies in its ability to integrate information from various sources, providing rich and comprehensive insights. However, in real-world scenarios, multi-modal systems often face the challenge of dynamic modality contributions, the dominance of different modalities may change with the environments, leading to suboptimal performance in multimodal learning. Current methods mainly enhance weak modalities to balance multimodal representation bias, which inevitably optimizes from a partialmodality perspective, easily leading to performance descending for dominant modalities. To address this problem, we propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM). Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information. Moreover, we provide an in-depth analysis that optimizing certain modalities could cause information loss and prevent leveraging the full advantages of multimodal data. By exploring the dominance and narrowing the contribution gaps between modalities, we have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
zh
[CV-23] EHCTNet: Enhanced Hybrid of CNN and Transformer Network for Remote Sensing Image Change Detection
【速读】: 该论文试图解决遥感(Remote Sensing, RS)变化检测中由于假阴性(false negatives)导致的高成本问题,以及现有框架在提高精度(Precision)以减少假阳性(false positives)成本时,仍然存在的对感兴趣变化关注不足、导致漏检和不连续性的问题。解决方案的关键在于增强特征学习能力,并整合特征信息的频率成分,通过逐步提升召回率(Recall)的策略来实现。具体而言,论文提出了一种增强的卷积神经网络(CNN)与Transformer网络混合模型(EHCTNet),通过双分支特征提取模块提取遥感图像的多尺度特征,利用改进的模块I挖掘这些特征的频率成分,并基于Kolmogorov Arnold网络的增强令牌挖掘模块获取语义信息。最后,通过改进的模块II挖掘对最终检测有益的语义变化信息的频率成分。实验验证了EHCTNet在理解复杂感兴趣变化方面的有效性,可视化结果表明EHCTNet能够检测到更完整和连续的变化区域,并在相邻区域区分上表现出更高的准确性。
链接: https://arxiv.org/abs/2501.01238
作者: Junjie Yang,Haibo Wan,Zhihai Shang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Remote sensing (RS) change detection incurs a high cost because of false negatives, which are more costly than false positives. Existing frameworks, struggling to improve the Precision metric to reduce the cost of false positive, still have limitations in focusing on the change of interest, which leads to missed detections and discontinuity issues. This work tackles these issues by enhancing feature learning capabilities and integrating the frequency components of feature information, with a strategy to incrementally boost the Recall value. We propose an enhanced hybrid of CNN and Transformer network (EHCTNet) for effectively mining the change information of interest. Firstly, a dual branch feature extraction module is used to extract the multi scale features of RS images. Secondly, the frequency component of these features is exploited by a refined module I. Thirdly, an enhanced token mining module based on the Kolmogorov Arnold Network is utilized to derive semantic information. Finally, the semantic change information’s frequency component, beneficial for final detection, is mined from the refined module II. Extensive experiments validate the effectiveness of EHCTNet in comprehending complex changes of interest. The visualization outcomes show that EHCTNet detects more intact and continuous changed areas and perceives more accurate neighboring distinction than state of the art models.
zh
[CV-24] SVFR: A Unified Framework for Generalized Video Face Restoration
【速读】: 该论文试图解决视频面部修复(Video Face Restoration, VFR)中的关键问题,特别是在处理时间一致性、运动伪影以及高质量视频数据稀缺性方面的挑战。传统的面部修复方法通常侧重于提高分辨率,而对相关任务如面部着色(colorization)和修复(inpainting)的关注较少。为此,论文提出了一种广义视频面部修复(Generalized Video Face Restoration, GVFR)的新方法,将视频面部修复、修复和着色任务整合到一个统一的框架中,称为稳定视频面部修复(Stable Video Face Restoration, SVFR)。该框架的关键在于利用稳定视频扩散(Stable Video Diffusion, SVD)的生成和运动先验,并通过统一的面部修复框架结合任务特定信息。此外,引入了可学习的任务嵌入(task embedding)以增强任务识别,并采用统一潜在正则化(Unified Latent Regularization, ULR)来促进不同子任务之间的共享特征表示学习。为了进一步提升修复质量和时间稳定性,论文还引入了面部先验学习和自参考细化作为辅助策略。该框架有效结合了这些任务的互补优势,增强了时间一致性并实现了卓越的修复质量,推动了视频面部修复领域的前沿发展。
链接: https://arxiv.org/abs/2501.01235
作者: Zhiyao Wang,Xu Chen,Chengming Xu,Junwei Zhu,Xiaobin Hu,Jiangning Zhang,Chengjie Wang,Yuqi Liu,Yiyi Zhou,Rongrong Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
zh
[CV-25] Exploiting Latent Properties to Optimize Neural Codecs
【速读】: 该论文旨在解决当前端到端图像和视频编解码器(end-to-end image and video codecs)在利用矢量量化(vector quantization)和解码设备中的熵梯度(entropy gradient)方面的不足。尽管现有的神经编解码器在许多方面优于传统压缩技术,但它们尚未充分利用矢量量化和熵梯度的优势。论文提出的解决方案包括两个关键点:首先,通过使用预定义的最优均匀矢量量化(uniform vector quantization)来替代非均匀标量量化(non-uniform scalar quantization),以提高性能;其次,利用解码器中可用的熵梯度作为重建误差梯度的代理,从而提升压缩性能。实验结果表明,这些方法在各种预训练方法中能够在相同质量下节省1%到3%的比特率,并且基于熵梯度的解决方案也显著提升了传统编解码器的性能。
链接: https://arxiv.org/abs/2501.01231
作者: Muhammet Balcilar,Bharath Bhushan Damodaran,Karam Naser,Franck Galpin,Pierre Hellier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in IEEE TRANSACTIONS ON IMAGE PROCESSING
Abstract:End-to-end image and video codecs are becoming increasingly competitive, compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques, such as their straightforward adaptation to perceptual distortion metrics and high performance in specific fields thanks to their learning ability. However, current state-of-the-art neural codecs do not fully exploit the benefits of vector quantization and the existence of the entropy gradient in decoding devices. In this paper, we propose to leverage these two properties (vector quantization and entropy gradient) to improve the performance of off-the-shelf codecs. Firstly, we demonstrate that using non-uniform scalar quantization cannot improve performance over uniform quantization. We thus suggest using predefined optimal uniform vector quantization to improve performance. Secondly, we show that the entropy gradient, available at the decoder, is correlated with the reconstruction error gradient, which is not available at the decoder. We therefore use the former as a proxy to enhance compression performance. Our experimental results show that these approaches save between 1 to 3% of the rate for the same quality across various pretrained methods. In addition, the entropy gradient based solution improves traditional codec performance significantly as well.
zh
[CV-26] Conditional Consistency Guided Image Translation and Enhancement ICME
【速读】: 该论文试图解决多领域图像翻译任务(multi-domain image translation tasks)中的挑战,特别是在跨模态翻译(cross-modal translation)和低光图像增强(low-light image enhancement)等任务中,一致性模型(Consistency Models)的应用尚未得到充分探索。为了解决这一问题,论文提出了条件一致性模型(Conditional Consistency Models, CCMs),通过引入额外的条件输入来指导去噪过程,确保生成的输出保留来自相应输入域的结构和上下文信息。解决方案的关键在于引入任务特定的条件输入,从而在单步生成样本的同时,保持高质量的输出。论文在10个不同数据集上评估了CCMs,证明了其在多领域图像翻译任务中的有效性。
链接: https://arxiv.org/abs/2501.01223
作者: A. V. Subramanyam,Amil Bhagat,Milind Jain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 5 figures, 4 tables, ICME conference 2025
Abstract:Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at this https URL.
zh
[CV-27] Real-time Cross-modal Cybersickness Prediction in Virtual Reality
【速读】: 该论文试图解决沉浸式虚拟现实(VR)体验中普遍存在的晕动症(cybersickness)问题,这一问题严重影响了用户的参与度和舒适度。现有的深度学习方法(如卷积神经网络(CNNs)和长短期记忆网络(LSTMs))在处理多模态数据(如头部和眼动追踪数据、生理数据等)时,难以捕捉复杂的模态间交互关系,且无法实现实时推理,限制了其实际应用。为解决这一问题,论文提出了一种轻量级模型,其关键创新在于结合了基于Transformer的稀疏自注意力编码器(sparse self-attention)来处理生物信号特征,以及使用PP-TSN网络提取视频特征。通过跨模态融合模块(cross-modal fusion module),模型能够生成视频感知的生物信号表示,从而支持基于视觉和生物信号输入的晕动症预测。该模型在包含眼动、头部追踪数据、生理数据和VR视频的公开数据集上验证,仅使用VR视频输入即可达到93.13%的预测准确率,表现出色。这一解决方案不仅实现了高效的实时晕动症预测,还解决了VR环境中多模态数据交互的长期难题,为未来VR多模态数据集成研究奠定了基础,有望推动更个性化、舒适且广泛可访问的VR体验发展。
链接: https://arxiv.org/abs/2501.01212
作者: Yitong Zhu,Tangyao Li,Yuyang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Cybersickness remains a significant barrier to the widespread adoption of immersive virtual reality (VR) experiences, as it can greatly disrupt user engagement and comfort. Research has shown that cybersickness can significantly be reflected in head and eye tracking data, along with other physiological data (e.g., TMP, EDA, and BMP). Despite the application of deep learning techniques such as CNNs and LSTMs, these models often struggle to capture the complex interactions between multiple data modalities and lack the capacity for real-time inference, limiting their practical application. Addressing this gap, we propose a lightweight model that leverages a transformer-based encoder with sparse self-attention to process bio-signal features and a PP-TSN network for video feature extraction. These features are then integrated via a cross-modal fusion module, creating a video-aware bio-signal representation that supports cybersickness prediction based on both visual and bio-signal inputs. Our model, trained with a lightweight framework, was validated on a public dataset containing eye and head tracking data, physiological data, and VR video, and demonstrated state-of-the-art performance in cybersickness prediction, achieving a high accuracy of 93.13% using only VR video inputs. These findings suggest that our approach not only enables effective, real-time cybersickness prediction but also addresses the longstanding issue of modality interaction in VR environments. This advancement provides a foundation for future research on multimodal data integration in VR, potentially leading to more personalized, comfortable and widely accessible VR experiences.
zh
[CV-28] LayeringDiff: Layered Image Synthesis via Generation then Disassembly with Generative Knowledge
【速读】: 该论文旨在解决分层图像合成(layered image synthesis)中的关键问题,即如何在不依赖大规模训练的情况下,生成具有独立控制前景(foreground)和背景(background)的分层图像。传统的分层图像生成方法通常需要为每个层单独训练生成模型,这既耗时又资源密集。论文提出的解决方案 LayeringDiff 通过以下关键步骤实现目标:首先,利用现成的图像生成模型生成复合图像(composite image),然后通过分解技术将复合图像拆分为前景和背景层。这种方法避免了为每个层单独训练生成模型的需求。此外,论文还引入了大规模预训练生成先验(pretrained generative prior)来估计前景和背景层,并通过高频对齐模块(high-frequency alignment modules)优化层的细节。实验表明,该方法能够有效合成分层图像,并支持多种实际应用。
链接: https://arxiv.org/abs/2501.01197
作者: Kyoungkook Kang,Gyujin Sim,Geonung Kim,Donguk Kim,Seungho Nam,Sunghyun Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Layers have become indispensable tools for professional artists, allowing them to build a hierarchical structure that enables independent control over individual visual elements. In this paper, we propose LayeringDiff, a novel pipeline for the synthesis of layered images, which begins by generating a composite image using an off-the-shelf image generative model, followed by disassembling the image into its constituent foreground and background layers. By extracting layers from a composite image, rather than generating them from scratch, LayeringDiff bypasses the need for large-scale training to develop generative capabilities for individual layers. Furthermore, by utilizing a pretrained off-the-shelf generative model, our method can produce diverse contents and object scales in synthesized layers. For effective layer decomposition, we adapt a large-scale pretrained generative prior to estimate foreground and background layers. We also propose high-frequency alignment modules to refine the fine-details of the estimated layers. Our comprehensive experiments demonstrate that our approach effectively synthesizes layered images and supports various practical applications.
zh
[CV-29] Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views AAAI2025
【速读】: 该论文试图解决从稀疏视角(sparse views)重建室内场景几何(indoor scene geometry)时,由于单目先验(monocular priors)在尺度模糊(scale ambiguity)下的性能下降问题。现有的方法通常需要数百张图像才能实现高质量的重建,而在输入视角有限的情况下,单目先验的准确性显著降低,导致重建场景几何的崩溃。论文提出了一种名为Sparis的新方法,通过引入基于图像间匹配信息(inter-image matching information)的新先验,提供更准确的深度信息,并确保跨视角匹配一致性(cross-view matching consistency)。此外,该方法采用了角度滤波策略(angular filter strategy)和极线匹配权重函数(epipolar matching weight function),以减少由于视角匹配不准确导致的误差,从而优化图像间先验,提升重建精度。实验结果表明,该方法在稀疏视角场景重建中表现出优越性能。
链接: https://arxiv.org/abs/2501.01196
作者: Yulun Wu,Han Huang,Wenyuan Zhang,Chao Deng,Ge Gao,Ming Gu,Yu-Shen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025. Project page: this https URL
Abstract:In recent years, reconstructing indoor scene geometry from multi-view images has achieved encouraging accomplishments. Current methods incorporate monocular priors into neural implicit surface models to achieve high-quality reconstructions. However, these methods require hundreds of images for scene reconstruction. When only a limited number of views are available as input, the performance of monocular priors deteriorates due to scale ambiguity, leading to the collapse of the reconstructed scene geometry. In this paper, we propose a new method, named Sparis, for indoor surface reconstruction from sparse views. Specifically, we investigate the impact of monocular priors on sparse scene reconstruction, introducing a novel prior based on inter-image matching information. Our prior offers more accurate depth information while ensuring cross-view matching consistency. Additionally, we employ an angular filter strategy and an epipolar matching weight function, aiming to reduce errors due to view matching inaccuracies, thereby refining the inter-image prior for improved reconstruction accuracy. The experiments conducted on widely used benchmarks demonstrate superior performance in sparse-view scene reconstruction.
zh
[CV-30] Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection
【速读】: 该论文旨在解决深度伪造视频(deepfake videos)检测中的两个主要问题:一是现有方法难以聚焦于重要的伪造痕迹(artifacts),导致泛化能力受限;二是现有模型缺乏可解释性,难以理解预测过程。为解决这些问题,论文提出了FakeSTormer,其关键解决方案包括两个方面:首先,引入了一个多任务学习框架(multi-task learning framework),通过额外的空间和时间分支(spatial and temporal branches)使模型能够聚焦于细微的时空伪造痕迹,并通过高亮可能包含伪造痕迹的视频区域提供可解释性;其次,提出了一种视频级数据合成算法(video-level data synthesis algorithm),生成具有细微伪造痕迹的伪伪造视频(pseudo-fake videos),为模型提供高质量样本和空间、时间分支的基准数据。实验结果表明,该方法在多个挑战性基准测试中具有竞争力。
链接: https://arxiv.org/abs/2501.01184
作者: Dat Nguyen,Marcella Astrid,Anis Kacem,Enjie Ghorbel,Djamila Aouada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting deepfake videos is highly challenging due to the complex intertwined spatial and temporal artifacts in forged sequences. Most recent approaches rely on binary classifiers trained on both real and fake data. However, such methods may struggle to focus on important artifacts, which can hinder their generalization capability. Additionally, these models often lack interpretability, making it difficult to understand how predictions are made. To address these issues, we propose FakeSTormer, offering two key contributions. First, we introduce a multi-task learning framework with additional spatial and temporal branches that enable the model to focus on subtle spatio-temporal artifacts. These branches also provide interpretability by highlighting video regions that may contain artifacts. Second, we propose a video-level data synthesis algorithm that generates pseudo-fake videos with subtle artifacts, providing the model with high-quality samples and ground truth data for our spatial and temporal branches. Extensive experiments on several challenging benchmarks demonstrate the competitiveness of our approach compared to recent state-of-the-art methods. The code is available at this https URL.
zh
[CV-31] L3D-Pose: Lifting Pose for 3D Avatars from a Single Camera in the Wild ICASSP2025
【速读】: 该论文旨在解决动物和灵长类动物在自然环境中动态且不可预测的行为下,3D姿态估计数据集难以获取的问题。现有的2D姿态估计方法由于缺乏深度信息,限制了其应用范围,而3D姿态估计虽然提供了更全面的解决方案,但创建大规模的3D姿态数据集仍然具有挑战性。为此,论文提出了一种混合方法,利用绑定骨骼的虚拟角色(rigged avatars)和生成合成数据集的流程,以获取训练所需的3D标注。关键解决方案包括:1)引入一种基于简单注意力机制的多层感知器(MLP)网络,将2D姿态转换为3D,且该网络独立于输入图像,以确保在自然环境中姿态的可扩展性;2)提出一种基于深度姿态估计方法的查找表(lookup table),用于将姿态重新定位到任意虚拟角色上,解决了现有解剖关键点检测器在姿态重定位上的不足。实验结果表明,该基于查找表的重定位方法具有高效性和有效性。总体而言,论文提出了一个系统化的框架,通过合成数据集将2D姿态提升为3D,并利用这一框架将野外环境中的运动重定位到任意虚拟角色上。
链接: https://arxiv.org/abs/2501.01174
作者: Soumyaratna Debnath,Harish Katti,Shashikant Verma,Shanmuganathan Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Abstract:While 2D pose estimation has advanced our ability to interpret body movements in animals and primates, it is limited by the lack of depth information, constraining its application range. 3D pose estimation provides a more comprehensive solution by incorporating spatial depth, yet creating extensive 3D pose datasets for animals is challenging due to their dynamic and unpredictable behaviours in natural settings. To address this, we propose a hybrid approach that utilizes rigged avatars and the pipeline to generate synthetic datasets to acquire the necessary 3D annotations for training. Our method introduces a simple attention-based MLP network for converting 2D poses to 3D, designed to be independent of the input image to ensure scalability for poses in natural environments. Additionally, we identify that existing anatomical keypoint detectors are insufficient for accurate pose retargeting onto arbitrary avatars. To overcome this, we present a lookup table based on a deep pose estimation method using a synthetic collection of diverse actions rigged avatars perform. Our experiments demonstrate the effectiveness and efficiency of this lookup table-based retargeting approach. Overall, we propose a comprehensive framework with systematically synthesized datasets for lifting poses from 2D to 3D and then utilize this to re-target motion from wild settings onto arbitrary avatars.
zh
[CV-32] Deep Learning in Palmprint Recognition-A Comprehensive Survey
【速读】: 该论文旨在解决传统手工方法在掌纹识别(palmprint recognition)中表现能力不足的问题,这些方法过度依赖研究者的先验知识,限制了其在实际应用中的效果。为了解决这一局限性,论文引入了深度学习(Deep Learning, DL)技术,利用其在多个领域中的显著成功来提升掌纹识别的性能。论文的关键解决方案是通过系统性地回顾和总结近年来基于深度学习的掌纹识别技术进展,涵盖关键任务如感兴趣区域分割(region-of-interest segmentation)、特征提取(feature extraction)以及安全和隐私相关的挑战。通过整合最新的研究成果,论文为研究人员提供了全面的参考,帮助他们掌握前沿技术并推动掌纹识别领域的创新。
链接: https://arxiv.org/abs/2501.01166
作者: Chengrui Gao,Ziyuan Yang,Wei Jia,Lu Leng,Bob Zhang,Andrew Beng Jin Teoh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Palmprint recognition, biometrics, deep learning, feature extraction, recognition tasks
Abstract:Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers’ prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.
zh
[CV-33] owards Interactive Deepfake Analysis
【速读】: 该论文旨在解决现有深度伪造(deepfake)分析方法主要基于判别模型(discriminative models)的局限性,这些方法在应用场景上受到显著限制。为此,论文提出了一种基于多模态大语言模型(MLLMs)的交互式深度伪造分析方法。解决方案的关键包括:(1)通过GPT辅助的数据构建过程,生成了一个名为DFA-Instruct的指令跟随数据集;(2)设计了一个名为DFA-Bench的基准测试,用于全面评估MLLMs在深度伪造检测、分类和伪影描述方面的能力;(3)构建了一个名为DFA-GPT的交互式深度伪造分析系统,并引入了低秩适应(LoRA)模块作为社区研究的强基线。这些措施旨在解决数据集和基准测试的缺乏以及训练效率低下的问题,推动深度伪造分析领域的进一步发展。
链接: https://arxiv.org/abs/2501.01164
作者: Lixiong Qin,Ning Jiang,Yang Zhang,Yuhan Qiu,Dingheng Zeng,Jiani Hu,Weihong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at this https URL to facilitate further research.
zh
[CV-34] 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
【速读】: 该论文旨在解决当前3D大模态模型(3D LMMs)在细粒度场景理解和灵活人机交互方面的不足。尽管现有的3D LMMs在基于3D视觉的对话和推理中展现了巨大潜力,但如何进一步提升其能力以实现更精细的场景理解和更灵活的人机交互仍是一个挑战。为此,论文提出了3D-LLaVA,一种简单但功能强大的3D LMM,旨在作为智能助手来理解、推理和与3D世界交互。3D-LLaVA的关键创新在于其采用了极简设计,仅以点云作为输入,避免了复杂的多视角特征提取或额外任务特定头部的依赖。其核心是一个新的Omni Superpoint Transformer(OST),该模块集成了三个功能:视觉特征选择器、视觉提示编码器和参考掩码解码器。OST通过混合预训练获得感知先验,并作为视觉连接器将3D数据与大型语言模型(LLM)桥接。经过统一的指令调优后,3D-LLaVA在多个基准测试中取得了显著成果。
链接: https://arxiv.org/abs/2501.01163
作者: Jiajun Deng,Tianyu He,Li Jiang,Tianyu Wang,Feras Dayoub,Ian Reid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D LMMs to achieve fine-grained scene understanding and facilitate flexible human-agent interaction remains a challenging problem. In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. Unlike existing top-performing methods that rely on complicated pipelines-such as offline multi-view feature extraction or additional task-specific heads-3D-LLaVA adopts a minimalist design with integrated architecture and only takes point clouds as input. At the core of 3D-LLaVA is a new Omni Superpoint Transformer (OST), which integrates three functionalities: (1) a visual feature selector that converts and selects visual tokens, (2) a visual prompt encoder that embeds interactive visual prompts into the visual token space, and (3) a referring mask decoder that produces 3D masks based on text description. This versatile OST is empowered by the hybrid pretraining to obtain perception priors and leveraged as the visual connector that bridges the 3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA reports impressive results on various benchmarks. The code and model will be released to promote future exploration.
zh
[CV-35] xAVi: Generating Stereoscopic VR Video Clips from Text Descriptions
【速读】: 该论文试图解决从文本生成虚拟现实(VR)视频的挑战,特别是在缺乏训练数据和实现虚拟环境中真实深度与运动的复杂性方面。解决方案的关键在于结合现有的生成模型,通过三个阶段实现从文本到立体VR视频的生成。首先,使用基础的文本到图像模型从输入文本中捕捉上下文信息;其次,通过Stable Diffusion对生成的初步图像进行增强,提升其真实感和整体质量;最后,利用深度估计算法处理这些帧,生成左右眼视图,并将其拼接以创建沉浸式观看体验。该方法通过Fréchet Inception Distance和CLIP Score等图像评估技术定量评估生成视频帧的视觉质量,验证了其有效性。这一工作展示了自然语言驱动图形在虚拟现实模拟等领域的潜力。
链接: https://arxiv.org/abs/2501.01156
作者: Vriksha Srihari,R. Bhavya,Shruti Jayaraman,V. Mary Anita Rajam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, published in 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI)
Abstract:While generative models such as text-to-image, large language models and text-to-video have seen significant progress, the extension to text-to-virtual-reality remains largely unexplored, due to a deficit in training data and the complexity of achieving realistic depth and motion in virtual environments. This paper proposes an approach to coalesce existing generative systems to form a stereoscopic virtual reality video from text. Carried out in three main stages, we start with a base text-to-image model that captures context from an input text. We then employ Stable Diffusion on the rudimentary image produced, to generate frames with enhanced realism and overall quality. These frames are processed with depth estimation algorithms to create left-eye and right-eye views, which are stitched side-by-side to create an immersive viewing experience. Such systems would be highly beneficial in virtual reality production, since filming and scene building often require extensive hours of work and post-production effort. We utilize image evaluation techniques, specifically Fréchet Inception Distance and CLIP Score, to assess the visual quality of frames produced for the video. These quantitative measures establish the proficiency of the proposed method. Our work highlights the exciting possibilities of using natural language-driven graphics in fields like virtual reality simulations. Comments: 6 pages, published in 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2 Cite as: arXiv:2501.01156 [cs.CV] (or arXiv:2501.01156v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.01156 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: TexAVi: Generating Stereoscopic VR Video Clips from Text Descriptions, 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Prayagraj, India, 2024, pp. 1-6 Related DOI: https://doi.org/10.1109/CVMI61877.2024.10782691 Focus to learn more DOI(s) linking to related resources Submission history From: Vriksha Srihari [view email] [v1] Thu, 2 Jan 2025 09:21:03 UTC (1,370 KB)
zh
[CV-36] Adaptive Hardness-driven Augmentation and Alignment Strategies for Multi-Source Domain Adaptations
【速读】: 该论文试图解决多源域适应(Multi-source Domain Adaptation, MDA)任务中的三个关键问题:1)数据增强的潜力未被充分利用;2)域内对齐的重要性被忽视;3)缺乏有效的聚类级约束设计。传统方法主要通过样本级约束(如最大均值差异,Maximum Mean Discrepancy, MMD)来实现域间对齐,但这些方法未能全面考虑上述三个关键方面。
论文提出的解决方案“A3MDA”通过自适应硬度量化(Adaptive Hardness Quantification)和数据增强(Data Augmentation)来综合考虑这些问题。具体而言,A3MDA引入了三种自适应硬度测量(Adaptive Hardness Measurements, AHM):基础AHM(Basic AHM)、平滑AHM(Smooth AHM)和对比AHM(Comparative AHM)。基础AHM用于评估每个源域和目标域样本的瞬时硬度;平滑AHM通过自适应调整强数据增强的强度来保持模型的泛化能力;对比AHM则用于增强聚类级约束,将传统的MMD改进为加权聚类变体,从而提升域间对齐的鲁棒性和精度。此外,A3MDA还通过选择硬度较高的样本构建伪对比矩阵,优化伪标签质量,并形成良好聚类的目标特征空间,从而有效解决域内对齐问题。实验结果表明,A3MDA在多个MDA基准测试中优于其他方法。
链接: https://arxiv.org/abs/2501.01142
作者: Yang Yuxiang,Zeng Xinyi,Zeng Pinxian,Zu Chen,Yan Binyu,Zhou Jiliu,Wang Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures
Abstract:Multi-source Domain Adaptation (MDA) aims to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Nevertheless, traditional methods primarily focus on achieving inter-domain alignment through sample-level constraints, such as Maximum Mean Discrepancy (MMD), neglecting three pivotal aspects: 1) the potential of data augmentation, 2) the significance of intra-domain alignment, and 3) the design of cluster-level constraints. In this paper, we introduce a novel hardness-driven strategy for MDA tasks, named “A3MDA” , which collectively considers these three aspects through Adaptive hardness quantification and utilization in both data Augmentation and domain this http URL achieve this, “A3MDA” progressively proposes three Adaptive Hardness Measurements (AHM), i.e., Basic, Smooth, and Comparative AHMs, each incorporating distinct mechanisms for diverse scenarios. Specifically, Basic AHM aims to gauge the instantaneous hardness for each source/target sample. Then, hardness values measured by Smooth AHM will adaptively adjust the intensity level of strong data augmentation to maintain compatibility with the model’s generalization this http URL contrast, Comparative AHM is designed to facilitate cluster-level constraints. By leveraging hardness values as sample-specific weights, the traditional MMD is enhanced into a weighted-clustered variant, strengthening the robustness and precision of inter-domain alignment. As for the often-neglected intra-domain alignment, we adaptively construct a pseudo-contrastive matrix by selecting harder samples based on the hardness rankings, enhancing the quality of pseudo-labels, and shaping a well-clustered target feature space. Experiments on multiple MDA benchmarks show that " A3MDA " outperforms other methods.
zh
[CV-37] Missing Data as Augmentation in the Earth Observation Domain: A Multi-View Learning Approach
【速读】: 该论文试图解决多视图学习(Multi-view Learning, MVL)在地球观测(Earth Observation, EO)领域中因数据缺失而导致的模型预测性能下降问题。具体来说,论文提出了一种针对缺失视图情况下的新型多视图学习方法,旨在提高模型在视图缺失条件下的鲁棒性和预测性能。解决方案的关键在于引入动态合并函数(dynamic merge functions),如平均值(average)和更复杂的Transformer模型,来处理缺失视图,而不是简单地用数值替代缺失数据。通过模拟所有可能的视图缺失组合作为不同的训练样本,模型能够在训练过程中完全忽略缺失视图,从而增强其预测鲁棒性。实验结果表明,该方法在中等缺失条件下显著提高了模型的鲁棒性,并在所有视图都存在时提升了预测性能。
链接: https://arxiv.org/abs/2501.01132
作者: Francisco Mena,Diego Arenas,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view learning (MVL) leverages multiple sources or views of data to enhance machine learning model performance and robustness. This approach has been successfully used in the Earth Observation (EO) domain, where views have a heterogeneous nature and can be affected by missing data. Despite the negative effect that missing data has on model predictions, the ML literature has used it as an augmentation technique to improve model generalization, like masking the input data. Inspired by this, we introduce novel methods for EO applications tailored to MVL with missing views. Our methods integrate the combination of a set to simulate all combinations of missing views as different training samples. Instead of replacing missing data with a numerical value, we use dynamic merge functions, like average, and more complex ones like Transformer. This allows the MVL model to entirely ignore the missing views, enhancing its predictive robustness. We experiment on four EO datasets with temporal and static views, including state-of-the-art methods from the EO domain. The results indicate that our methods improve model robustness under conditions of moderate missingness, and improve the predictive performance when all views are present. The proposed methods offer a single adaptive solution to operate effectively with any combination of available views.
zh
[CV-38] InDeed: Interpretable image deep decomposition with guaranteed generalizability
【速读】: 该论文旨在解决图像分解(image decomposition)任务中的可解释性和泛化性问题。图像分解是将图像分解为基本组成部分的过程,这对于许多下游任务至关重要,并且本身为分析提供了一定的可解释性。尽管深度学习在此类任务中表现出强大的能力,但其与可解释性和泛化性的结合却鲜有研究。为此,论文提出了一种新颖的可解释深度图像分解框架,结合了层次贝叶斯建模(hierarchical Bayesian modeling)和深度学习,构建了一个模块化且具有模型泛化能力的深度神经网络(DNN)。该框架包括三个关键步骤:(1) 图像分解的层次贝叶斯建模,(2) 将推理问题转化为优化任务,(3) 通过模块化的贝叶斯DNN进行深度推理。此外,论文还建立了损失函数与泛化误差界之间的理论联系,并提出了一种新的测试时适应方法以应对分布外场景。通过在图像去噪和无监督异常检测两个下游任务中的应用,验证了该方法在泛化性和可解释性方面的改进。
链接: https://arxiv.org/abs/2501.01127
作者: Sihan Wang,Shangqi Gao,Fuping Wu,Xiahai Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image decomposition aims to analyze an image into elementary components, which is essential for numerous downstream tasks and also by nature provides certain interpretability to the analysis. Deep learning can be powerful for such tasks, but surprisingly their combination with a focus on interpretability and generalizability is rarely explored. In this work, we introduce a novel framework for interpretable deep image decomposition, combining hierarchical Bayesian modeling and deep learning to create an architecture-modularized and model-generalizable deep neural network (DNN). The proposed framework includes three steps: (1) hierarchical Bayesian modeling of image decomposition, (2) transforming the inference problem into optimization tasks, and (3) deep inference via a modularized Bayesian DNN. We further establish a theoretical connection between the loss function and the generalization error bound, which inspires a new test-time adaptation approach for out-of-distribution scenarios. We instantiated the application using two downstream tasks, \textiti.e., image denoising, and unsupervised anomaly detection, and the results demonstrated improved generalizability as well as interpretability of our methods. The source code will be released upon the acceptance of this paper.
zh
[CV-39] Source-free Semantic Regularization Learning for Semi-supervised Domain Adaptation
【速读】: 该论文试图解决半监督领域自适应(SSDA)中现有方法难以有效适应目标领域的问题,主要原因是现有方法难以充分学习目标领域丰富且复杂的语义信息及其关系。为了解决这一问题,论文提出了一种名为语义正则化学习(SERL)的新框架,通过从多个正则化学习的角度捕捉目标语义信息,实现对源预训练模型在目标领域的自适应微调。SERL框架的关键在于三种鲁棒的语义正则化技术:语义概率对比正则化(SPCR)、难样本混合正则化(HMR)和目标预测正则化(TPR)。SPCR通过概率视角帮助模型学习更具判别性的特征表示,HMR利用易样本挖掘难样本中的潜在目标知识,而TPR则通过最大化当前预测与过去学习目标之间的相关性来正则化目标预测,从而减少错误伪标签对语义信息的误导。这些技术的结合使得SERL在多个基准数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2501.01126
作者: Xinyang Huang,Chuang Zhu,Ruiying Ren,Shengjie Liu,Tiejun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised domain adaptation (SSDA) has been extensively researched due to its ability to improve classification performance and generalization ability of models by using a small amount of labeled data on the target domain. However, existing methods cannot effectively adapt to the target domain due to difficulty in fully learning rich and complex target semantic information and relationships. In this paper, we propose a novel SSDA learning framework called semantic regularization learning (SERL), which captures the target semantic information from multiple perspectives of regularization learning to achieve adaptive fine-tuning of the source pre-trained model on the target domain. SERL includes three robust semantic regularization techniques. Firstly, semantic probability contrastive regularization (SPCR) helps the model learn more discriminative feature representations from a probabilistic perspective, using semantic information on the target domain to understand the similarities and differences between samples. Additionally, adaptive weights in SPCR can help the model learn the semantic distribution correctly through the probabilities of different samples. To further comprehensively understand the target semantic distribution, we introduce hard-sample mixup regularization (HMR), which uses easy samples as guidance to mine the latent target knowledge contained in hard samples, thereby learning more complete and complex target semantic knowledge. Finally, target prediction regularization (TPR) regularizes the target predictions of the model by maximizing the correlation between the current prediction and the past learned objective, thereby mitigating the misleading of semantic information caused by erroneous pseudo-labels. Extensive experiments on three benchmark datasets demonstrate that our SERL method achieves state-of-the-art performance.
zh
[CV-40] DuMo: Dual Encoder Modulation Network for Precise Concept Erasure AAAI2025
【速读】: 该论文旨在解决文本到图像生成模型(text-to-image models)在生成不适当内容(如NSFW内容)和潜在版权侵权问题时的安全性问题。现有的方法通过消除不适当的概念来保护模型,但这些方法会改变主干网络(backbone network)的参数,并对图像的结构(低频)成分产生显著影响,从而削弱模型保留非目标概念的能力。本文提出的解决方案是双编码器调制网络(Dual encoder Modulation network, DuMo),其核心在于通过引入具有先验知识的擦除模块(Eraser with PRior Knowledge, EPR)来修改U-NET的跳跃连接特征(skip connection features),主要针对图像的细节(高频)成分进行概念擦除。为了最小化擦除过程中对非目标概念的损害,主干U-NET的参数被冻结,并在擦除过程中引入原始跳跃连接特征的先验知识。此外,本文还观察到EPR在不同时间步和层上对图像结构和细节的擦除偏好不同,因此采用了时间-层调制过程(Time-Layer MOdulation, TLMO),自动平衡擦除效果和模型的生成能力。该方法在显式内容擦除、卡通概念移除和艺术风格擦除任务上达到了最先进的性能。
链接: https://arxiv.org/abs/2501.01125
作者: Feng Han,Kai Chen,Chao Gong,Zhipeng Wei,Jingjing Chen,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025 accepted
Abstract:The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model’s ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module’s outputs across different layers and timesteps, automatically balancing the erasure effects and model’s generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods. Code is available at this https URL
zh
[CV-41] PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation
【速读】: 该论文旨在解决当前高分辨率深度估计方法中存在的计算效率低下的问题,这些问题主要源于对重型模型和多次推理步骤的依赖,导致推理时间增加。为了解决这一问题,作者提出了PatchRefiner V2 (PRV2),通过用轻量级编码器替代重型细化模型,减少了模型大小和推理时间,但引入了噪声特征。为了克服这一挑战,论文提出了一个从粗到精(Coarse-to-Fine, C2F)模块,其中包含一个引导去噪单元(Guided Denoising Unit),用于细化和去噪细化特征,并采用噪声预训练策略(Noisy Pretraining strategy)来预训练细化分支,以充分发挥轻量级细化分支的潜力。此外,作者还引入了尺度和平移不变梯度匹配损失(Scale-and-Shift Invariant Gradient Matching, SSIGM loss)来增强合成到真实领域的迁移能力。PRV2在UnrealStereo4K数据集上以更少的参数和更快的推理速度在精度和速度上均优于现有的深度估计方法,并在CityScape、ScanNet++和KITTI等真实世界数据集上展示了其跨领域的通用性。
链接: https://arxiv.org/abs/2501.01121
作者: Zhenyu Li,Wenqing Cui,Shariq Farooq Bhat,Peter Wonka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While current high-resolution depth estimation methods achieve strong results, they often suffer from computational inefficiencies due to reliance on heavyweight models and multiple inference steps, increasing inference time. To address this, we introduce PatchRefiner V2 (PRV2), which replaces heavy refiner models with lightweight encoders. This reduces model size and inference time but introduces noisy features. To overcome this, we propose a Coarse-to-Fine (C2F) module with a Guided Denoising Unit for refining and denoising the refiner features and a Noisy Pretraining strategy to pretrain the refiner branch to fully exploit the potential of the lightweight refiner branch. Additionally, we introduce a Scale-and-Shift Invariant Gradient Matching (SSIGM) loss to enhance synthetic-to-real domain transfer. PRV2 outperforms state-of-the-art depth estimation methods on UnrealStereo4K in both accuracy and speed, using fewer parameters and faster inference. It also shows improved depth boundary delineation on real-world datasets like CityScape, ScanNet++, and KITTI, demonstrating its versatility across domains.
zh
[CV-42] Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning AAAI2025
【速读】: 该论文旨在解决多模态学习(Multimodal Learning)中模态缺失(Incomplete Modality)的问题。模态缺失会导致任务推理时模态线索受限、缺失内容的虚拟填补(Dummy Imputation)引起信息丢失和噪声引入,以及静态提示(Static Prompts)无法适应不同缺失条件下的实例。为解决这些问题,论文提出了RAGPT(Retrieval-AuGmented dynamic Prompt Tuning)框架,其核心包括三个模块:(1) 多通道检索器(Multi-channel Retriever),通过模态内检索策略识别相似实例;(2) 缺失模态生成器(Missing Modality Generator),利用检索到的上下文恢复缺失信息;(3) 上下文感知提示器(Context-aware Prompter),从相关实例中捕捉上下文知识并生成动态提示,从而显著增强多模态变换器(MultiModal Transformers, MMTs)在模态缺失情况下的鲁棒性。实验结果表明,RAGPT在处理模态缺失问题上优于现有基线方法。
链接: https://arxiv.org/abs/2501.01120
作者: Jian Lang,Zhangtao Cheng,Ting Zhong,Fan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures. Accepted by AAAI 2025. Codes are released at this https URL
Abstract:Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT’s robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at this https URL.
zh
[CV-43] Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
【速读】: 该论文旨在解决开放词汇全景重建(open-vocabulary panoptic reconstruction)中的场景理解问题,特别是在具身机器人(embodied robotics)和逼真仿真(photorealistic simulation)中的应用。现有方法通常将查询(queries)和键(keys)的优化分离,或忽略空间邻近性(spatial proximity),导致重建效果受限。论文提出的PanopticRecon++方法通过一种新颖的交叉注意力(cross-attention)视角,将3D实例(3D instances)作为查询,场景的3D嵌入场(3D embedding field)作为键,并通过注意力图(attention map)建模它们之间的关系。关键创新在于引入了可学习的3D高斯分布(learnable 3D Gaussians)作为实例查询,注入3D空间先验以保持邻近性,同时保持端到端的可优化性。此外,该方法通过最优线性分配(optimal linear assignment)与实例掩码(instance masks)对齐跨帧的2D开放词汇实例ID,并通过融合查询驱动的实例分割概率与语义概率,确保语义-实例分割的一致性。训练过程中,实例查询令牌的数量动态适应物体数量。实验表明,PanopticRecon++在3D和2D分割及重建任务中表现出色,并展示了其在机器人仿真中的应用潜力。
链接: https://arxiv.org/abs/2501.01119
作者: Xuan Yu,Yuxuan Xie,Yili Liu,Haojian Lu,Rong Xiong,Yiyi Liao,Yue Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 18 pages, 10 figures
Abstract:Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene’s 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: this https URL
zh
[CV-44] HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment
【速读】: 该论文试图解决现有图像质量评估(IQA)方法在图像和谐化(Image Harmonization)任务中对人类视觉偏好(human visual preference)评估不准确的问题。具体来说,现有的IQA方法对颜色或光照的微小不一致性不敏感,导致其评估结果与人类视觉偏好不一致。为解决这一问题,论文提出了首个用于图像和谐化评估的图像质量评估数据库(HarmonyIQAD),该数据库包含由9种不同图像和谐化算法(IHAs)生成的1,350张和谐化图像及其对应的人类视觉偏好评分。基于此数据库,论文进一步提出了一种新的和谐图像质量评估方法(HarmonyIQA),用于预测人类对和谐化图像的视觉偏好。实验表明,HarmonyIQA在和谐化图像的人类视觉偏好评估上达到了最先进的性能,并在传统IQA任务上也表现出竞争力。此外,跨数据集评估表明,HarmonyIQA比基于自监督学习的IQA方法具有更好的泛化能力。
链接: https://arxiv.org/abs/2501.01116
作者: Zitong Xu,Huiyu Duan,Guangji Ma,Liu Yang,Jiarui Wang,Qingbo Wu,Xiongkuo Min,Guangtao Zhai,Patrick Le Callet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor color or light inconsistency. To address the issue and facilitate the advancement of IHAs, we introduce the first Image Quality Assessment Database for image Harmony evaluation (HarmonyIQAD), which consists of 1,350 harmonized images generated by 9 different IHAs, and the corresponding human visual preference scores. Based on this database, we propose a Harmony Image Quality Assessment (HarmonyIQA), to predict human visual preference for harmonized images. Extensive experiments show that HarmonyIQA achieves state-of-the-art performance on human visual preference evaluation for harmonized images, and also achieves competing results on traditional IQA tasks. Furthermore, cross-dataset evaluation also shows that HarmonyIQA exhibits better generalization ability than self-supervised learning-based IQA methods. Both HarmonyIQAD and HarmonyIQA will be made publicly available upon paper publication.
zh
[CV-45] Generalized Task-Driven Medical Image Quality Enhancement with Gradient Promotion
【速读】: 该论文试图解决现有任务驱动的图像质量增强(IQE)模型在处理不同层次视觉任务时,由于这些任务对图像特征的需求存在差异甚至冲突,导致模型性能受限的问题。为解决这一问题,论文提出了一种广义梯度提升(GradProm)训练策略,专门用于医学图像的任务驱动IQE。该策略的关键在于将任务驱动的IQE系统划分为两个子模型:一个用于图像增强的主流模型和一个用于视觉识别的辅助模型。在训练过程中,GradProm仅当这两个子模型的梯度方向一致时(通过余弦相似度衡量),才使用两者的梯度更新图像增强模型的参数;若梯度方向不一致,则仅使用图像增强模型的梯度进行更新。理论上,GradPro确保图像增强模型的优化方向不会受到辅助视觉识别模型的偏差影响。实验结果表明,GradProm在四个公开且具有挑战性的医学图像数据集上优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.01114
作者: Dong Zhang,Kwang-Ting Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR, the image enhancement model and the visual recognition model can mutually enhance each other’s quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact – different levels of vision tasks have varying and sometimes conflicting requirements of image features. To address this problem, this paper proposes a generalized gradient promotion (GradProm) training strategy for task-driven IQE of medical images. Specifically, we partition a task-driven IQE system into two sub-models, i.e., a mainstream model for image enhancement and an auxiliary model for visual recognition. During training, GradProm updates only parameters of the image enhancement model using gradients of the visual recognition model and the image enhancement model, but only when gradients of these two sub-models are aligned in the same direction, which is measured by their cosine similarity. In case gradients of these two sub-models are not in the same direction, GradProm only uses the gradient of the image enhancement model to update its parameters. Theoretically, we have proved that the optimization direction of the image enhancement model will not be biased by the auxiliary visual recognition model under the implementation of GradProm. Empirically, extensive experimental results on four public yet challenging medical image datasets demonstrated the superior performance of GradProm over existing state-of-the-art methods.
zh
[CV-46] BatStyler: Advancing Multi-category Style Generation for Source-free Domain Generalization
【速读】: 该论文试图解决源自由域泛化(Source-Free Domain Generalization, SFDG)在多类别配置下的性能问题。具体而言,现有方法在多域少类别配置下表现良好,但在多域多类别配置下性能较差,且风格合成的效率在多类别场景中显著下降。论文提出的解决方案是BatStyler方法,其核心在于通过两个模块提升多类别场景下的风格合成能力:粗粒度语义生成模块(Coarse Semantic Generation)和均匀风格生成模块(Uniform Style Generation)。粗粒度语义生成模块通过提取粗粒度语义,防止多类别配置下风格多样性学习空间的压缩;均匀风格生成模块则提供均匀分布的风格模板,并实现并行训练。实验表明,该方法在少类别数据集上表现与现有方法相当,而在多类别数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2501.01109
作者: Xiusheng Xu,Lei Qi,Jingyang Zhou,Xin Geng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TCSVT
Abstract:Source-Free Domain Generalization (SFDG) aims to develop a model that performs on unseen domains without relying on any source domains. However, the implementation remains constrained due to the unavailability of training data. Research on SFDG focus on knowledge transfer of multi-modal models and style synthesis based on joint space of multiple modalities, thus eliminating the dependency on source domain images. However, existing works primarily work for multi-domain and less-category configuration, but performance on multi-domain and multi-category configuration is relatively poor. In addition, the efficiency of style synthesis also deteriorates in multi-category scenarios. How to efficiently synthesize sufficiently diverse data and apply it to multi-category configuration is a direction with greater practical value. In this paper, we propose a method called BatStyler, which is utilized to improve the capability of style synthesis in multi-category scenarios. BatStyler consists of two modules: Coarse Semantic Generation and Uniform Style Generation modules. The Coarse Semantic Generation module extracts coarse-grained semantics to prevent the compression of space for style diversity learning in multi-category configuration, while the Uniform Style Generation module provides a template of styles that are uniformly distributed in space and implements parallel training. Extensive experiments demonstrate that our method exhibits comparable performance on less-category datasets, while surpassing state-of-the-art methods on multi-category datasets.
zh
[CV-47] AIM: Additional Image Guided Generation of Transferable Adversarial Attacks
【速读】: 该论文试图解决深度神经网络(DNNs)在目标迁移攻击(targeted transferable attacks)中的脆弱性问题。尽管在无目标迁移攻击(untargeted transferable attacks)方面已有显著进展,但目标迁移攻击仍然是一个重大挑战。论文提出了一种生成式方法,通过引入一个新颖的即插即用模块——语义注入模块(Semantic Injection Module, SIM),来增强对抗样本的迁移性。该模块利用额外引导图像中的语义信息,将目标类别的语义信息融入对抗样本生成过程中,从而提升目标迁移攻击的效果。此外,论文还提出了新的损失函数,以更有效地集成语义注入模块,适用于目标和无目标攻击场景。实验结果表明,该方法在目标和无目标攻击设置下均表现出色。
链接: https://arxiv.org/abs/2501.01106
作者: Teng Li,Xingjun Ma,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Transferable adversarial examples highlight the vulnerability of deep neural networks (DNNs) to imperceptible perturbations across various real-world applications. While there have been notable advancements in untargeted transferable attacks, targeted transferable attacks remain a significant challenge. In this work, we focus on generative approaches for targeted transferable attacks. Current generative attacks focus on reducing overfitting to surrogate models and the source data domain, but they often overlook the importance of enhancing transferability through additional semantics. To address this issue, we introduce a novel plug-and-play module into the general generator architecture to enhance adversarial transferability. Specifically, we propose a \emphSemantic Injection Module (SIM) that utilizes the semantics contained in an additional guiding image to improve transferability. The guiding image provides a simple yet effective method to incorporate target semantics from the target class to create targeted and highly transferable attacks. Additionally, we propose new loss formulations that can integrate the semantic injection module more effectively for both targeted and untargeted attacks. We conduct comprehensive experiments under both targeted and untargeted attack settings to demonstrate the efficacy of our proposed approach.
zh
[CV-48] Deformable Gaussian Splatting for Efficient and High-Fidelity Reconstruction of Surgical Scenes ICRA2025
【速读】: 该论文旨在解决可变形手术场景的高效和高保真重建问题,特别是针对两个主要挑战:一是难以处理不可逆的动态变化(如组织剪切),二是缺乏对手术场景变形的层次化建模,导致渲染速度下降。为解决这些问题,论文提出了EH-SurGS算法,其关键创新点包括:1)引入基于3D高斯生命周期(life cycle of 3D Gaussians)的变形建模方法,能够有效捕捉规则和不可逆的变形,从而提升重建质量;2)提出自适应运动层次化策略(adaptive motion hierarchy strategy),通过区分手术场景中的静态区域和可变形区域,减少通过变形场的3D高斯数量,进而提高渲染速度。实验表明,该方法在重建质量和渲染速度上均优于现有技术。
链接: https://arxiv.org/abs/2501.01101
作者: Jiwei Shan,Zeyu Cai,Cheng-Tai Hsieh,Shing Shin Cheng,Hesheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures, submitted to ICRA 2025
Abstract:Efficient and high-fidelity reconstruction of deformable surgical scenes is a critical yet challenging task. Building on recent advancements in 3D Gaussian splatting, current methods have seen significant improvements in both reconstruction quality and rendering speed. However, two major limitations remain: (1) difficulty in handling irreversible dynamic changes, such as tissue shearing, which are common in surgical scenes; and (2) the lack of hierarchical modeling for surgical scene deformation, which reduces rendering speed. To address these challenges, we introduce EH-SurGS, an efficient and high-fidelity reconstruction algorithm for deformable surgical scenes. We propose a deformation modeling approach that incorporates the life cycle of 3D Gaussians, effectively capturing both regular and irreversible deformations, thus enhancing reconstruction quality. Additionally, we present an adaptive motion hierarchy strategy that distinguishes between static and deformable regions within the surgical scene. This strategy reduces the number of 3D Gaussians passing through the deformation field, thereby improving rendering speed. Extensive experiments demonstrate that our method surpasses existing state-of-the-art approaches in both reconstruction quality and rendering speed. Ablation studies further validate the effectiveness and necessity of our proposed components. We will open-source our code upon acceptance of the paper.
zh
[CV-49] EliGen: Entity-Level Controlled Image Generation with Regional Attention
【速读】: 该论文旨在解决当前扩散模型(diffusion models)在文本到图像生成(text-to-image generation)中存在的局限性,即仅通过全局文本提示(global text prompts)无法实现对图像中单个实体的细粒度控制。为此,作者提出了EliGen框架,通过引入区域注意力机制(regional attention),在不增加额外参数的情况下,将实体提示(entity prompts)和任意形状的空间掩码(arbitrary-shaped spatial masks)无缝集成到扩散变换器(diffusion transformers)中。EliGen的关键创新在于其能够通过高质量的数据集进行训练,该数据集包含细粒度的空间和语义实体级注释,从而实现精确的实体级操控。此外,作者还提出了一个修复融合管道(inpainting fusion pipeline),将EliGen扩展到多实体图像修复任务,并展示了其与社区模型(如IP-Adapter和MLLM)集成的灵活性,进一步拓展了创意可能性。
链接: https://arxiv.org/abs/2501.01097
作者: Hong Zhang,Zhongjie Duan,Xingjun Wang,Yingda Chen,Yu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.
zh
[CV-50] HoneypotNet: Backdoor Attacks Against Model Extraction AAAI2025
【速读】: 该论文旨在解决模型提取攻击(Model Extraction Attacks)对生产模型和机器学习即服务(MLaaS)平台带来的严重安全威胁。模型提取攻击通过向黑盒模型发起大量查询并利用其预测结果来训练替代模型,从而近似原模型的功能和性能,可能导致模型所有者遭受重大经济损失。论文提出了一种新的防御范式,称为“以攻为防”(Attack as Defense),其核心思想是通过修改模型的输出,使其具有“毒性”,从而在恶意用户尝试利用这些输出训练替代模型时,注入后门(backdoor)以破坏替代模型的功能。具体而言,论文提出了一种轻量级的后门攻击方法HoneypotNet,该方法通过将受害者模型的分类层替换为蜜罐层(honeypot layer),并利用双层优化(bi-level optimization)对蜜罐层进行微调,使其输出在保持原始性能的同时具有毒性。实验结果表明,HoneypotNet能够在替代模型中高效注入后门,不仅有助于所有权验证,还能显著破坏替代模型的功能,从而有效遏制模型提取攻击。
链接: https://arxiv.org/abs/2501.01090
作者: Yixu Wang,Tianle Gu,Yan Teng,Yingchun Wang,Xingjun Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the AAAI 2025
Abstract:Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model’s predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model’s outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model’s output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.
zh
[CV-51] Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction
【速读】: 该论文试图解决时间序列预测(Time Series Forecasting, TSF)任务中,复杂模型(如基于Transformer的模型)在保持时间关系方面的潜在不足,以及如何在数据有限的情况下提高预测精度的问题。论文提出了一种名为GLinear的新型数据高效架构,该架构通过利用时间序列中的周期性模式来提高预测精度,并且在历史数据量较少的情况下仍能提供优于现有线性预测模型(如NLinear、DLinear和RLinear)和基于Transformer的模型(如Autoformer)的性能。GLinear的关键在于其参数效率高,能够在多变量时间序列预测任务中显著超越现有架构,从而为数据高效和计算高效的时间序列分析开辟了新的研究方向。
链接: https://arxiv.org/abs/2501.01087
作者: Syed Tahir Hussain Rizvi,Neel Kanwal,Muddasar Naeem,Alfredo Cuzzocrea,Antonio Coronato
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Submitted to IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:Time Series Forecasting (TSF) is an important application across many fields. There is a debate about whether Transformers, despite being good at understanding long sequences, struggle with preserving temporal relationships in time series data. Recent research suggests that simpler linear models might outperform or at least provide competitive performance compared to complex Transformer-based models for TSF tasks. In this paper, we propose a novel data-efficient architecture, GLinear, for multivariate TSF that exploits periodic patterns to provide better accuracy. It also provides better prediction accuracy by using a smaller amount of historical data compared to other state-of-the-art linear predictors. Four different datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the performance of the proposed predictor. A performance comparison with state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear) and transformer-based time series predictor (Autoformer) shows that the GLinear, despite being parametrically efficient, significantly outperforms the existing architectures in most cases of multivariate TSF. We hope that the proposed GLinear opens new fronts of research and development of simpler and more sophisticated architectures for data and computationally efficient time-series analysis. The source code is publicly available on GitHub.
zh
[CV-52] Evidential Calibrated Uncertainty-Guided Interactive Segmentation paradigm for Ultrasound Images
【速读】: 该论文旨在解决超声图像分割中的准确性和鲁棒性问题,特别是针对传统分割方法在处理模糊边界和斑点噪声(speckle noise)时表现不佳的挑战。现有的交互式分割方法虽然有所进展,但仍存在效率低下和缺乏针对性的问题,通常需要大量准确的手动或随机采样提示(prompts)才能达到满意的性能。为此,论文提出了一种基于证据不确定性估计的端到端高效分层交互式分割方法——Evidential Uncertainty-Guided Interactive Segmentation (EUGIS)。该方案的关键在于利用基于Dempster-Shafer理论和主观逻辑(Subjective Logic)的证据不确定性估计,来评估模型对不同区域预测的不确定性水平。通过优先对高不确定性区域进行采样,EUGIS能够有效模拟训练有素的放射科医生的交互行为,从而提高采样的针对性并减少提示和迭代次数。此外,论文还提出了一种可训练的不确定性估计校准机制,进一步优化确定性与不确定性之间的边界,从而增强不确定性估计的置信度。
链接: https://arxiv.org/abs/2501.01072
作者: Jiang Shang,Yuanmeng Wu,Xiaoxiang Han,Xi Chen,Qi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and robust ultrasound image segmentation is critical for computer-aided diagnostic systems. Nevertheless, the inherent challenges of ultrasound imaging, such as blurry boundaries and speckle noise, often cause traditional segmentation methods to struggle with performance. Despite recent advancements in universal image segmentation, such as the Segment Anything Model, existing interactive segmentation methods still suffer from inefficiency and lack of specialization. These methods rely heavily on extensive accurate manual or random sampling prompts for interaction, necessitating numerous prompts and iterations to reach satisfactory performance. In response to this challenge, we propose the Evidential Uncertainty-Guided Interactive Segmentation (EUGIS), an end-to-end, efficient tiered interactive segmentation paradigm based on evidential uncertainty estimation for ultrasound image segmentation. Specifically, EUGIS harnesses evidence-based uncertainty estimation, grounded in Dempster-Shafer theory and Subjective Logic, to gauge the level of uncertainty in the predictions of model for different regions. By prioritizing sampling the high-uncertainty region, our method can effectively simulate the interactive behavior of well-trained radiologists, enhancing the targeted of sampling while reducing the number of prompts and iterations this http URL, we propose a trainable calibration mechanism for uncertainty estimation, which can further optimize the boundary between certainty and uncertainty, thereby enhancing the confidence of uncertainty estimation.
zh
[CV-53] S-SatMVSNet: Slope Aware Height Estimation for Large-Scale Earth Terrain Multi-view Stereo
【速读】: 该论文试图解决基于遥感影像的三维地形重建(3D terrain reconstruction)中,现有学习方法在高度估计(height estimation)时忽略地形特性导致精度不足的问题。解决方案的关键在于将地形坡度(slope)信息整合到多视图立体视觉(MVS)框架中,以提升地形重建的准确性。具体而言,作者提出了一种端到端的坡度感知高度估计网络(TS-SatMVSNet),通过创新的基于高度图的坡度计算策略,生成坡度图以衡量地形起伏。此外,设计了两个坡度引导模块:在微观层面,通过坡度引导的区间划分模块进行精细化高度估计;在宏观层面,使用可学习的高斯平滑算子修正不准确的高度值。同时,引入坡度方向损失函数(slope direction loss)隐式优化高度估计结果。实验结果表明,该方法在WHU-TLC和MVS3D数据集上达到了最先进的性能,并展示了较强的泛化能力。
链接: https://arxiv.org/abs/2501.01049
作者: Song Zhang,Zhiwei Wei,Wenjia Xu,Lili Zhang,Yang Wang,Jinming Zhang,Junyi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D terrain reconstruction with remote sensing imagery achieves cost-effective and large-scale earth observation and is crucial for safeguarding natural disasters, monitoring ecological changes, and preserving the this http URL, learning-based multi-view stereo~(MVS) methods have shown promise in this task. However, these methods simply modify the general learning-based MVS framework for height estimation, which overlooks the terrain characteristics and results in insufficient accuracy. Considering that the Earth’s surface generally undulates with no drastic changes and can be measured by slope, integrating slope considerations into MVS frameworks could enhance the accuracy of terrain reconstructions. To this end, we propose an end-to-end slope-aware height estimation network named TS-SatMVSNet for large-scale remote sensing terrain this http URL effectively obtain the slope representation, drawing from mathematical gradient concepts, we innovatively proposed a height-based slope calculation strategy to first calculate a slope map from a height map to measure the terrain undulation. To fully integrate slope information into the MVS pipeline, we separately design two slope-guided modules to enhance reconstruction outcomes at both micro and macro levels. Specifically, at the micro level, we designed a slope-guided interval partition module for refined height estimation using slope values. At the macro level, a height correction module is proposed, using a learnable Gaussian smoothing operator to amend the inaccurate height values. Additionally, to enhance the efficacy of height estimation, we proposed a slope direction loss for implicitly optimizing height estimation results. Extensive experiments on the WHU-TLC dataset and MVS3D dataset show that our proposed method achieves state-of-the-art performance and demonstrates competitive generalization ability.
zh
[CV-54] ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think
【速读】: 该论文试图解决在持续学习(continual learning)和持续预训练(continual pre-training)中,由于梯度信息不可用(如黑箱API、硬件限制或不可微分系统)而导致的灾难性遗忘(catastrophic forgetting)问题。为了解决这一问题,论文提出了首个基准测试ZeroFlow,用于评估无梯度优化算法在克服遗忘方面的表现。关键解决方案在于仅通过前向传播(forward pass)方法即可有效克服遗忘。研究结果表明,前向传播不仅能够缓解遗忘,还能管理任务冲突并减少内存需求,同时通过单一前向传播的增强方法进一步减轻遗忘现象。这一工作为推进前向传播方法在克服遗忘方面的应用提供了重要的见解和工具。
链接: https://arxiv.org/abs/2501.01045
作者: Tao Feng,Wei Li,DiDi Zhu,Hangjie Yuan,Wendi Zheng,Dan Zhang,Jie Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Like, SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. In practice, permission to access gradient information is not always granted (the gradient ban), such as black-box APIs, hardware limitations, and non-differentiable systems. To bridge this gap, we introduce the first benchmark ZeroFlow to evaluate gradient-free optimization algorithms for overcoming forgetting. This benchmark examines a suite of forward pass methods across multiple methods, forgetting scenarios, and datasets. We find that forward passes alone are enough to overcome forgetting. Our findings reveal new optimization principles that highlight the potential of forward-pass in mitigating forgetting, managing task conflicts, and reducing memory demands, alongside novel enhancements that further mitigate forgetting with just one forward pass. This work provides essential insights and tools for advancing forward pass methods to overcome forgetting.
zh
[CV-55] Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLM s
【速读】: 该论文旨在解决视频-文本多模态大语言模型(V-MLLMs)在黑盒场景下对抗样本的迁移性问题。现有的对抗攻击方法在黑盒设置中表现出显著的局限性,主要问题包括:(1)在扰动视频特征时缺乏泛化能力,(2)仅关注稀疏的关键帧,(3)未能有效整合多模态信息。为解决这些问题,论文提出了一种基于图像到视频的多模态大语言模型攻击方法(I2V-MLLM)。该方法利用图像多模态模型(IMM)作为替代模型生成对抗视频样本,通过整合多模态交互和时序信息来破坏潜在空间中的视频表示,从而提升对抗样本的迁移性。此外,引入了一种扰动传播技术以应对不同的未知帧采样策略。实验结果表明,该方法生成的对抗样本在多个视频-文本多模态任务中表现出较强的迁移性,黑盒攻击的成功率与白盒攻击相当。
链接: https://arxiv.org/abs/2501.01042
作者: Linhao Huang,Xue Jiang,Zhiqiang Wang,Wentao Mo,Xi Xiao,Bo Han,Yongjie Yin,Feng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models–a common and practical real world scenario–remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.
zh
[CV-56] Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras ICASSP2025
【速读】: 该论文旨在解决基于动态视觉传感器(Dynamic Vision Sensors, DVS)的动作识别任务中存在的两个主要问题:1)在数据转换过程中丢失时间信息;2)由于传感器缺陷或环境因素导致的噪声和异常值。为解决这些问题,论文提出了一种新颖的框架,该框架通过保留和利用事件数据的时空结构来提升动作识别的性能。解决方案的关键在于两个主要组件:1)点级事件掩码自编码器(point-wise event masked autoencoder, MAE),它通过从掩码的原始事件相机数据中重建事件块来学习紧凑且具有区分性的事件块表示;2)改进的事件点块生成算法,该算法利用事件数据的内点模型和点级数据增强技术来提高事件点块的质量和多样性。此外,该研究首次将预训练方法引入事件相机原始点数据,并提出了一种新的事件点块嵌入方法,以便在事件相机上应用基于Transformer的模型。
链接: https://arxiv.org/abs/2501.01040
作者: Jingkai Sun,Qiang Zhang,Jiaxu Wang,Jiahang Cao,Renjing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2025 Camera Ready
Abstract:Dynamic vision sensors (DVS) are bio-inspired devices that capture visual information in the form of asynchronous events, which encode changes in pixel intensity with high temporal resolution and low latency. These events provide rich motion cues that can be exploited for various computer vision tasks, such as action recognition. However, most existing DVS-based action recognition methods lose temporal information during data transformation or suffer from noise and outliers caused by sensor imperfections or environmental factors. To address these challenges, we propose a novel framework that preserves and exploits the spatiotemporal structure of event data for action recognition. Our framework consists of two main components: 1) a point-wise event masked autoencoder (MAE) that learns a compact and discriminative representation of event patches by reconstructing them from masked raw event camera points data; 2) an improved event points patch generation algorithm that leverages an event data inlier model and point-wise data augmentation techniques to enhance the quality and diversity of event points patches. To the best of our knowledge, our approach introduces the pre-train method into event camera raw points data for the first time, and we propose a novel event points patch embedding to utilize transformer-based models on event cameras.
zh
[CV-57] MSC-Bench: Benchmarking and Analyzing Multi-Sensor Corruption for Driving Perception
【速读】: 该论文试图解决多传感器融合模型在自动驾驶感知任务中的鲁棒性问题,特别是在传感器数据损坏或缺失的情况下。现有的相机-激光雷达(camera-LiDAR)融合方法虽然能够整合两种模态的数据,但其性能依赖于完整的传感器输入,导致在传感器数据损坏或缺失时鲁棒性较低,存在安全隐患。为解决这一问题,论文提出了多传感器损坏基准(Multi-Sensor Corruption Benchmark, MSC-Bench),这是首个旨在评估多传感器自动驾驶感知模型在多种传感器损坏情况下的鲁棒性的综合基准。该基准包含16种损坏类型组合,能够模拟相机和激光雷达单独或同时受损的情况。通过对六种3D目标检测模型和四种高精度地图(HD map)构建模型的广泛评估,论文揭示了这些模型在恶劣天气和传感器故障下的性能显著下降,突显了关键的安全问题。
链接: https://arxiv.org/abs/2501.01037
作者: Xiaoshuai Hao,Guanqun Liu,Yuting Zhao,Yuheng Ji,Mengchuan Wei,Haimei Zhao,Lingdong Kong,Rong Yin,Yu Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-sensor fusion models play a crucial role in autonomous driving perception, particularly in tasks like 3D object detection and HD map construction. These models provide essential and comprehensive static environmental information for autonomous driving systems. While camera-LiDAR fusion methods have shown promising results by integrating data from both modalities, they often depend on complete sensor inputs. This reliance can lead to low robustness and potential failures when sensors are corrupted or missing, raising significant safety concerns. To tackle this challenge, we introduce the Multi-Sensor Corruption Benchmark (MSC-Bench), the first comprehensive benchmark aimed at evaluating the robustness of multi-sensor autonomous driving perception models against various sensor corruptions. Our benchmark includes 16 combinations of corruption types that disrupt both camera and LiDAR inputs, either individually or concurrently. Extensive evaluations of six 3D object detection models and four HD map construction models reveal substantial performance degradation under adverse weather conditions and sensor failures, underscoring critical safety issues. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.
zh
[CV-58] DynamicLip: Shape-Independent Continuous Authentication via Lip Articulator Dynamics
【速读】: 该论文旨在解决传统生物特征认证(如面部认证)在隐私保护和动态场景下的局限性问题。传统方法依赖于静态面部数据,存在隐私泄露风险,且在用户说话或嘴唇动态运动时表现不佳。论文提出了一种基于嘴唇发音器官动态特征(lip articulator dynamics)的形状无关连续认证系统。该系统的关键在于提取与嘴唇形状无关的动态特征,利用发音器官的运动特性进行认证,从而在用户说话或嘴唇动态变化时仍能保持高准确性和鲁棒性。实验结果表明,该系统在不同环境和攻击场景下均表现出色,总体准确率达到99.06%,并能有效抵御高级模仿攻击和AI深度伪造攻击,适用于高安全性和隐私要求的应用场景。
链接: https://arxiv.org/abs/2501.01032
作者: Huashan Chen,Yifan Xu,Yue Feng,Ming Jian,Feng Liu,Pengfei Hu,Kebin Peng,Sen He,Zi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Biometrics authentication has become increasingly popular due to its security and convenience; however, traditional biometrics are becoming less desirable in scenarios such as new mobile devices, Virtual Reality, and Smart Vehicles. For example, while face authentication is widely used, it suffers from significant privacy concerns. The collection of complete facial data makes it less desirable for privacy-sensitive applications. Lip authentication, on the other hand, has emerged as a promising biometrics method. However, existing lip-based authentication methods heavily depend on static lip shape when the mouth is closed, which can be less robust due to lip shape dynamic motion and can barely work when the user is speaking. In this paper, we revisit the nature of lip biometrics and extract shape-independent features from the lips. We study the dynamic characteristics of lip biometrics based on articulator motion. Building on the knowledge, we propose a system for shape-independent continuous authentication via lip articulator dynamics. This system enables robust, shape-independent and continuous authentication, making it particularly suitable for scenarios with high security and privacy requirements. We conducted comprehensive experiments in different environments and attack scenarios and collected a dataset of 50 subjects. The results indicate that our system achieves an overall accuracy of 99.06% and demonstrates robustness under advanced mimic attacks and AI deepfake attacks, making it a viable solution for continuous biometric authentication in various applications.
zh
[CV-59] Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
【速读】: 该论文旨在解决基于Transformer技术的双目立体匹配(binocular stereo matching)问题,特别是在非线性表达能力不足和推理时间过长的情况下。现有方法由于注意力机制的低秩瓶颈(low-rank bottleneck)和二次复杂度(quadratic complexity),难以在合理时间内表现出足够的非线性表达能力,且在反射和弱纹理等挑战性条件下表现不佳。为解决这些问题,论文提出了Hadamard Attention Recurrent Stereo Transformer (HART)模型,其关键创新包括:1)采用Hadamard积(Hadamard product)范式实现线性计算复杂度,从而加速推理;2)设计了密集注意力核(Dense Attention Kernel, DAK),通过放大相关与无关特征响应的差异,增强对关键细节的关注,并缓解低秩瓶颈导致的表达能力下降;3)提出了MKOI模块,通过大核和小核卷积的交错使用,捕捉全局和局部信息,弥补Hadamard积在空间和通道交互上的不足。实验结果表明,HART在KITTI 2012基准测试的反射区域中表现优异,提交时在所有已发表方法中排名第一。
链接: https://arxiv.org/abs/2501.01023
作者: Ziyang Chen,Yongjun Zhang,Wenting Li,Bingshu Wang,Yabo Wu,Yong Zhao,C.L. Philip Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the \textbfHadamard \textbfAttention \textbfRecurrent Stereo \textbfTransformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked \textbf1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at \urlthis https URL.
zh
[CV-60] Efficient Connectivity-Preserving Instance Segmentation with Supervoxel-Based Loss Function
【速读】: 该论文试图解决神经科学中神经元局部形态和长距离投射轴突的复杂重建问题,特别是在连接组学(connectomics)流程中纠正拓扑错误(topological errors)的瓶颈。由于多个纠缠的神经元树突和轴突是一个具有挑战性的实例分割问题,因此论文提出了一种基于数字拓扑学(digital topology)的拓扑感知神经网络分割方法。该方法将简单点(simple points)的概念扩展到连接的体素集(即超体素,supervoxels),并在计算开销最小的情况下实现了对曲线状、丝状结构的分割。解决方案的关键在于通过拓扑感知的神经网络方法,有效处理复杂的三维光显微镜图像数据,并在小鼠大脑的公开数据集以及DRIVE、ISBI12和CrackTree等基准数据集上验证了其有效性。
链接: https://arxiv.org/abs/2501.01022
作者: Anna Grim,Jayaram Chandrashekar,Uygar Sumbul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Reconstructing the intricate local morphology of neurons and their long-range projecting axons can address many connectivity related questions in neuroscience. The main bottleneck in connectomics pipelines is correcting topological errors, as multiple entangled neuronal arbors is a challenging instance segmentation problem. More broadly, segmentation of curvilinear, filamentous structures continues to pose significant challenges. To address this problem, we extend the notion of simple points from digital topology to connected sets of voxels (i.e. supervoxels) and propose a topology-aware neural network segmentation method with minimal computational overhead. We demonstrate its effectiveness on a new public dataset of 3-d light microscopy images of mouse brains, along with the benchmark datasets DRIVE, ISBI12, and CrackTree.
zh
[CV-61] Boosting Adversarial Transferability with Spatial Adversarial Alignment
【速读】: 该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在面对对抗样本(adversarial examples)时,跨模型迁移性(transferability)不足的问题,特别是在跨架构(cross-architecture)场景下,如从卷积神经网络(CNN)到视觉Transformer(ViT)的迁移。现有的方法,如高级优化、数据增强和模型修改,虽然在提升对抗样本的迁移性方面取得了一定进展,但在跨架构攻击中仍表现出有限的迁移性。
论文提出的解决方案是空间对抗对齐(Spatial Adversarial Alignment, SAA),其关键在于通过两个核心部分来实现:空间感知对齐(spatial-aware alignment)和对抗感知对齐(adversarial-aware alignment)。首先,SAA通过最小化代理模型(surrogate model)和见证模型(witness model)在全局和局部区域的特征差异,促进空间对齐。其次,SAA引入了一种自对抗策略,利用对抗样本来施加进一步的约束,从对抗角度对齐特征。通过这种对齐,代理模型被训练为专注于见证模型提取的共享特征,从而生成具有更高迁移性的对抗扰动。实验表明,基于SAA的对齐代理模型在跨架构攻击中能够生成更具迁移性的对抗样本。
链接: https://arxiv.org/abs/2501.01015
作者: Zhaoyu Chen,Haijing Guo,Kaixun Jiang,Jiyuan Fu,Xinyu Zhou,Dingkang Yang,Hao Tang,Bo Li,Wenqiang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models. Numerous approaches are proposed to enhance the transferability of adversarial examples, including advanced optimization, data augmentation, and model modifications. However, these methods still show limited transferability, particularly in cross-architecture scenarios, such as from CNN to ViT. To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model. Specifically, SAA consists of two key parts: spatial-aware alignment and adversarial-aware alignment. First, we minimize the divergences of features between the two models in both global and local regions, facilitating spatial alignment. Second, we introduce a self-adversarial strategy that leverages adversarial examples to impose further constraints, aligning features from an adversarial perspective. Through this alignment, the surrogate model is trained to concentrate on the common features extracted by the witness model. This facilitates adversarial attacks on these shared features, thereby yielding perturbations that exhibit enhanced transferability. Extensive experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples, especially in cross-architecture attacks.
zh
[CV-62] EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy
【速读】: 该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)技术在场景表示中的两个主要挑战:一是由于运动结构恢复(Structure-from-Motion, SfM)方法在获取精确场景初始化时的局限性;二是现有密度化策略的效率不足。为解决这些问题,论文提出了一种名为EasySplat的新框架。其关键解决方案包括:首先,采用基于大规模点云图(pointmap)的新方法替代SfM进行场景初始化,通过基于视图相似性的高效分组策略和鲁棒的点云图先验信息,获取高质量的点云和相机姿态。其次,提出了一种自适应的密度化方法,利用KNN(K-Nearest Neighbors)方案,根据邻近高斯椭球的平均形状自适应地分割高斯基元。通过这些创新,EasySplat在初始化和优化方面克服了现有方法的局限性,实现了高效且精确的3DGS建模。实验结果表明,EasySplat在新视角合成任务中优于当前的最先进方法。
链接: https://arxiv.org/abs/2501.01003
作者: Ao Gao,Luosong Guo,Tao Chen,Zhao Wang,Ying Tai,Jian Yang,Zhenyu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5figures
Abstract:3D Gaussian Splatting (3DGS) techniques have achieved satisfactory 3D scene representation. Despite their impressive performance, they confront challenges due to the limitation of structure-from-motion (SfM) methods on acquiring accurate scene initialization, or the inefficiency of densification strategy. In this paper, we introduce a novel framework EasySplat to achieve high-quality 3DGS modeling. Instead of using SfM for scene initialization, we employ a novel method to release the power of large-scale pointmap approaches. Specifically, we propose an efficient grouping strategy based on view similarity, and use robust pointmap priors to obtain high-quality point clouds and camera poses for 3D scene initialization. After obtaining a reliable scene structure, we propose a novel densification approach that adaptively splits Gaussian primitives based on the average shape of neighboring Gaussian ellipsoids, utilizing KNN scheme. In this way, the proposed method tackles the limitation on initialization and optimization, leading to an efficient and accurate 3DGS modeling. Extensive experiments demonstrate that EasySplat outperforms the current state-of-the-art (SOTA) in handling novel view synthesis.
zh
[CV-63] CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation
【速读】: 该论文旨在解决视频压缩领域中在较低比特率下实现更高质量的问题。传统的基于变换的方法存在一定的局限性,而隐式神经表示(Implicit Neural Representation, INR)作为一种新兴技术,展现出替代传统方法的潜力。论文提出的解决方案关键是一种名为CoordFlow的像素级INR方法。该方法通过将视觉信息分离为视觉一致的层(visually consistent layers),每个层由专门的网络表示,以补偿该层的运动。这种分层处理不仅能够有效减少视觉-时间冗余,还隐含地利用了物体的运动轨迹。此外,CoordFlow在视频上采样、稳定、修复和去噪方面具有天然的优势,同时还能实现无监督的视频序列分割。与现有的像素级INR方法相比,CoordFlow达到了最先进的性能,并与领先的帧级方法表现相当。
链接: https://arxiv.org/abs/2501.00975
作者: Daniel Silver,Ron Kimmel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer’s motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
zh
[CV-64] OASIS Uncovers: High-Quality T2I Models Same Old Stereotypes
【速读】: 该论文试图解决文本到图像(Text-to-Image, T2I)模型生成的图像中存在的视觉偏见和刻板印象问题。现有的定量测量方法基于统计均等性(statistical parity),但这些方法与社会学定义的刻板印象不符,导致错误地将偏见归类为刻板印象。论文提出了一种与社会学定义一致的刻板印象定量测量方法,并开发了OASIS工具来测量生成数据集中的刻板印象及其在T2I模型中的起源。OASIS包括两个评分标准:M1(刻板印象评分)用于测量刻板印象属性的分布偏差,M2(WALS)用于测量图像在刻板印象属性上的光谱变化。此外,OASIS还包括两种方法来理解T2I模型中刻板印象的起源:U1(StOP)用于发现T2I模型内部与给定概念相关联的属性,U2(SPI)用于量化图像生成过程中刻板印象属性在T2I模型潜在空间中的出现。通过OASIS,论文得出结论,尽管图像保真度有了显著提升,但较新的T2I模型(如FLUX.1和SDv3)仍然存在强烈的刻板印象倾向,并且生成的图像中广泛存在刻板印象属性,尤其是对于互联网足迹较少的国家,刻板印象的数量更为严重。
链接: https://arxiv.org/abs/2501.00962
作者: Sepehr Dehdashtian,Gautam Sreekumar,Vishnu Naresh Boddeti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Images generated by text-to-image (T2I) models often exhibit visual biases and stereotypes of concepts such as culture and profession. Existing quantitative measures of stereotypes are based on statistical parity that does not align with the sociological definition of stereotypes and, therefore, incorrectly categorizes biases as stereotypes. Instead of oversimplifying stereotypes as biases, we propose a quantitative measure of stereotypes that aligns with its sociological definition. We then propose OASIS to measure the stereotypes in a generated dataset and understand their origins within the T2I model. OASIS includes two scores to measure stereotypes from a generated image dataset: (M1) Stereotype Score to measure the distributional violation of stereotypical attributes, and (M2) WALS to measure spectral variance in the images along a stereotypical attribute. OASIS also includes two methods to understand the origins of stereotypes in T2I models: (U1) StOP to discover attributes that the T2I model internally associates with a given concept, and (U2) SPI to quantify the emergence of stereotypical attributes in the latent space of the T2I model during image generation. Despite the considerable progress in image fidelity, using OASIS, we conclude that newer T2I models such as FLUX.1 and SDv3 contain strong stereotypical predispositions about concepts and still generate images with widespread stereotypical attributes. Additionally, the quantity of stereotypes worsens for nationalities with lower Internet footprints.
zh
[CV-65] he Silent Majority: Demystifying Memorization Effect in the Presence of Spurious Correlations
【速读】: 该论文试图解决机器学习模型在训练数据中依赖简单但虚假的特征(spurious features)所导致的测试性能不平衡问题,特别是在少数群体(minority groups)和多数群体(majority groups)之间的表现差异。这些虚假特征与目标变量相关但并无因果关系,例如图像分类任务中的背景信息。论文通过记忆化(memorization)的视角,揭示了模型在训练集上能够准确预测少数群体的典型示例,但在测试集上表现不佳的根本原因。关键解决方案在于发现并消除网络中少数神经元或通道对虚假特征的记忆化模式。通过实验证据,论文提出了一种新的训练框架,能够有效减少这些不必要的虚假记忆化,从而显著提升模型在少数群体上的性能。这一发现为未来研究如何增强模型对虚假相关性的鲁棒性提供了新的理论基础。
链接: https://arxiv.org/abs/2501.00961
作者: Chenyu You,Haocheng Dai,Yifei Min,Jasjeet S. Sekhon,Sarang Joshi,James S. Duncan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Machine learning models often rely on simple spurious features – patterns in training data that correlate with targets but are not causally related to them, like image backgrounds in foreground classification. This reliance typically leads to imbalanced test performance across minority and majority groups. In this work, we take a closer look at the fundamental cause of such imbalanced performance through the lens of memorization, which refers to the ability to predict accurately on \textitatypical examples (minority groups) in the training set but failing in achieving the same accuracy in the testing set. This paper systematically shows the ubiquitous existence of spurious features in a small set of neurons within the network, providing the first-ever evidence that memorization may contribute to imbalanced group performance. Through three experimental sources of converging empirical evidence, we find the property of a small subset of neurons or channels in memorizing minority group information. Inspired by these findings, we articulate the hypothesis: the imbalanced group performance is a byproduct of ``noisy’’ spurious memorization confined to a small set of neurons. To further substantiate this hypothesis, we show that eliminating these unnecessary spurious memorization patterns via a novel framework during training can significantly affect the model performance on minority groups. Our experimental results across various architectures and benchmarks offer new insights on how neural networks encode core and spurious knowledge, laying the groundwork for future research in demystifying robustness to spurious correlation.
zh
[CV-66] Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model
【速读】: 该论文旨在解决扩散模型(Diffusion Models)在生成高质量、高维图像时面临的高计算成本和推理速度慢的问题,特别是由于自注意力机制(self-attention)的二次计算复杂度导致的瓶颈。为了解决这一问题,论文提出了一种称为缓存自适应令牌合并(Cached Adaptive Token Merging, CA-ToMe)的方法。该方法的关键在于通过计算令牌之间的相似度,合并最相似的令牌,从而减少输入自注意力机制的令牌数量。此外,论文还引入了自适应阈值和缓存机制,以进一步优化令牌合并过程。自适应阈值能够根据相似度的变化动态调整合并策略,而缓存机制则通过存储相邻步骤中的相似令牌对,减少重复计算。实验结果表明,CA-ToMe作为一种无需训练的加速方法,能够在去噪过程中实现1.24倍的加速,同时保持与现有方法相同的FID(Fréchet Inception Distance)分数。
链接: https://arxiv.org/abs/2501.00946
作者: Omid Saghatchian,Atiyeh Gh. Moghadam,Ahmad Nickabadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have emerged as a promising approach for generating high-quality, high-dimensional images. Nevertheless, these models are hindered by their high computational cost and slow inference, partly due to the quadratic computational complexity of the self-attention mechanisms with respect to input size. Various approaches have been proposed to address this drawback. One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe). In our method, which is called cached adaptive token merging(CA-ToMe), we calculate the similarity between tokens and then merge the r proportion of the most similar tokens. However, due to the repetitive patterns observed in adjacent steps and the variation in the frequency of similarities, we aim to enhance this approach by implementing an adaptive threshold for merging tokens and adding a caching mechanism that stores similar pairs across several adjacent steps. Empirical results demonstrate that our method operates as a training-free acceleration method, achieving a speedup factor of 1.24 in the denoising process while maintaining the same FID scores compared to existing approaches.
zh
[CV-67] Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion
【速读】: 该论文试图解决在图像到图像合成(image-to-image synthesis)过程中,当输入图像具有低熵(low entropy)和稀疏性时,扩散模型(diffusion models)生成的图像多样性受限的问题。这一问题严重影响了数据增强(data augmentation)的效果。为了解决这一问题,论文提出了Diffusion Prism,这是一个无需训练的框架,能够高效地将二值掩码(binary masks)转换为具有形态特征保留的逼真且多样化的样本。关键解决方案在于引入少量人工噪声(artificial noise),这显著辅助了图像去噪(image-denoising)过程,从而提升了生成图像的多样性。通过以纳米树枝状图案(nano-dendritic patterns)为例,论文验证了该方法的有效性,并将其扩展到其他生物图案,展示了其在多个领域的潜在应用前景。
链接: https://arxiv.org/abs/2501.00944
作者: Hao Wang,Xiwen Chen,Ashish Bastola,Jiayou Qin,Abolfazl Razi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The emergence of generative AI and controllable diffusion has made image-to-image synthesis increasingly practical and efficient. However, when input images exhibit low entropy and sparse, the inherent characteristics of diffusion models often result in limited diversity. This constraint significantly interferes with data augmentation. To address this, we propose Diffusion Prism, a training-free framework that efficiently transforms binary masks into realistic and diverse samples while preserving morphological features. We explored that a small amount of artificial noise will significantly assist the image-denoising process. To prove this novel mask-to-image concept, we use nano-dendritic patterns as an example to demonstrate the merit of our method compared to existing controllable diffusion models. Furthermore, we extend the proposed framework to other biological patterns, highlighting its potential applications across various fields.
zh
[CV-68] Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers
【速读】: 该论文旨在解决机器学习模型中的“捷径学习”(shortcut learning)问题,即模型过度依赖与任务无关的特征,从而影响其在实际应用中的表现,特别是在敏感决策(如医疗诊断)中的可靠性。为了解决这一问题,作者提出了一种无监督框架,能够检测并缓解Transformer模型中的捷径学习。该框架的关键在于利用最新的机器学习进展,通过自动化方式识别和消除这些不相关的特征,从而显著提高模型的最差组准确率(worst-group accuracy)和平均准确率,同时减少人工标注的需求。此外,该框架计算效率高,能够在消费级硬件上运行,且检测到的捷径特征对人类专家具有实际意义。
链接: https://arxiv.org/abs/2501.00942
作者: Lukas Kuhn,Sari Sadiya,Jorg Schlotterer,Christin Seifert,Gemma Roig
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Shortcut learning, i.e., a model’s reliance on undesired features not directly relevant to the task, is a major challenge that severely limits the applications of machine learning algorithms, particularly when deploying them to assist in making sensitive decisions, such as in medical diagnostics. In this work, we leverage recent advancements in machine learning to create an unsupervised framework that is capable of both detecting and mitigating shortcut learning in transformers. We validate our method on multiple datasets. Results demonstrate that our framework significantly improves both worst-group accuracy (samples misclassified due to shortcuts) and average accuracy, while minimizing human annotation effort. Moreover, we demonstrate that the detected shortcuts are meaningful and informative to human experts, and that our framework is computationally efficient, allowing it to be run on consumer hardware.
zh
[CV-69] A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset AAAI2025
【速读】: 该论文试图解决科学计算中多模态配对数据生成的问题,特别是在数据稀缺且模态不平衡的情况下。在科学计算中,许多任务涉及将多种数据模态(如地震成像中的空间和波形数据、信号处理中的时间和频率数据、气候建模中的时间和光谱数据)结合起来描述物理现象。然而,现有生成式 AI 技术通常专注于单模态数据生成(如自然图像),难以应对多模态数据生成的需求,尤其是在模态数据不平衡的情况下(如地震成像中空间数据易模拟,但真实波形数据稀缺)。
论文提出的解决方案是“UB-Diff”,一种基于扩散模型(diffusion model)的多模态配对数据生成方法。其关键创新在于设计了一种“一进二出”的编码器-解码器网络结构(one-in-two-out encoder-decoder network),该结构能够从共同潜在表示(co-latent representation)中生成配对的多模态数据。通过扩散过程利用这一共同潜在表示,UB-Diff 能够有效生成可靠且有用的多模态配对数据。实验结果表明,UB-Diff 在 Fréchet Inception Distance (FID) 分数和配对数据评估方面显著优于现有技术。
链接: https://arxiv.org/abs/2501.00941
作者: Junhuan Yang,Yuzhou Zhang,Yi Sheng,Youzuo Lin,Lei Yang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)
备注: Accepted at AAAI 2025. This is the preprint version. Keywords: Multi-modal generation, diffuison models, scientific data generation, unbalanced modalities
Abstract:Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present ``UB-Diff’', a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.
zh
[CV-70] Multiscaled Multi-Head Attention-based Video Transformer Network for Hand Gesture Recognition
【速读】: 该论文试图解决动态手势识别(Dynamic Gesture Recognition)中的挑战,主要由于手势者的手部姿态、大小和形状的变化导致识别难度增加。为了解决这一问题,论文提出了一种多尺度多头注意力视频变换网络(Multiscaled Multi-Head Attention Video Transformer Network, MsMHA-VTN)。该模型的关键在于利用变换器的多尺度多头注意力机制,提取多层次的特征金字塔结构,并通过为每个注意力头分配不同的注意力维度,实现对多尺度特征的关注。此外,论文还探讨了多模态(Multiple Modalities)在手势识别中的应用,进一步提升了识别性能。实验结果表明,MsMHA-VTN在NVGesture和Briareo数据集上分别达到了88.22%和99.10%的总体准确率,显著优于现有方法。
链接: https://arxiv.org/abs/2501.00935
作者: Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Dynamic gesture recognition is one of the challenging research areas due to variations in pose, size, and shape of the signer’s hand. In this letter, Multiscaled Multi-Head Attention Video Transformer Network (MsMHA-VTN) for dynamic hand gesture recognition is proposed. A pyramidal hierarchy of multiscale features is extracted using the transformer multiscaled head attention model. The proposed model employs different attention dimensions for each head of the transformer which enables it to provide attention at the multiscale level. Further, in addition to single modality, recognition performance using multiple modalities is examined. Extensive experiments demonstrate the superior performance of the proposed MsMHA-VTN with an overall accuracy of 88.22% and 99.10% on NVGesture and Briareo datasets, respectively.
zh
[CV-71] Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models
【速读】: 该论文旨在解决文本到图像生成(Text-to-Image Generation)领域中复杂文本描述与高质量、视觉连贯图像之间的对齐问题。现有的生成式 AI 方法在生成复杂场景时,往往难以精确捕捉文本中的语义细节,导致生成的图像在视觉质量和语义一致性上存在不足。为此,论文提出了 Vision-Language Aligned Diffusion (VLAD) 模型,其核心解决方案包括两个关键部分:一是通过上下文组合模块(Contextual Composition Module, CCM)将文本提示分解为全局和局部表示,确保与视觉特征的精确对齐;二是采用多阶段扩散过程(multi-stage diffusion process)并结合层次化指导,生成高保真度的图像。实验结果表明,VLAD 在图像质量、语义对齐和文本渲染准确性方面显著优于现有方法,验证了其在复杂场景下的优越性能。
链接: https://arxiv.org/abs/2501.00917
作者: Emily Johnson,Noah Wilson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image quality, semantic alignment, and text rendering accuracy. Human evaluations further validate the superior performance of VLAD, making it a promising approach for text-to-image generation in complex scenarios.
zh
[CV-72] xt2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model
【速读】: 该论文试图解决在遥感领域中大规模文本到图像生成技术(text-to-image generation)的缺乏问题。现有的遥感图像-文本数据集规模较小,且局限于特定地理区域和场景类型,无法实现全球尺度、多分辨率可控和无边界的图像生成。为解决这些问题,论文提出了两个关键贡献:Git-10M数据集和Text2Earth基础模型。Git-10M是一个包含1000万图像-文本对的全球尺度数据集,覆盖广泛的地理场景并包含分辨率信息,显著超越了现有数据集的规模和多样性。基于Git-10M,论文提出了Text2Earth,这是一个基于扩散框架的13亿参数生成式基础模型,用于建模全球尺度的遥感场景。Text2Earth集成了分辨率引导机制,允许用户指定图像分辨率,并提出了动态条件适应策略以提高图像质量。该模型在零样本文本到图像生成、无边场景构建、图像编辑和跨模态图像生成等任务中表现出色,超越了以往受限于固定尺寸和有限场景类型的模型。
链接: https://arxiv.org/abs/2501.00895
作者: Chenyang Liu,Keyan Chen,Rui Zhao,Zhengxia Zou,Zhenwei Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA this http URL project page is \urlthis https URL
zh
[CV-73] FullTransNet: Full Transformer with Local-Global Attention for Video Summarization
【速读】: 该论文旨在解决视频摘要(video summarization)问题,即从原始视频中生成紧凑、简短且具有代表性的摘要,以便于浏览、分析和理解视频内容。现有的主流方法通常基于循环神经网络(RNN)或卷积神经网络(CNN),甚至最近的仅编码器(encoder-only)的Transformer架构。本文提出使用完整的Transformer架构(full transformer)作为替代方案,特别是其编码器-解码器(encoder-decoder)结构,该结构天然适合处理序列到序列(sequence-to-sequence)的学习问题。关键创新在于直接应用完整的Transformer进行视频摘要任务,并通过结合局部和全局稀疏注意力(local-global sparse attention)来替代全注意力机制(full attention),从而在建模长距离依赖关系的同时降低计算成本。基于此,作者提出了名为FullTransNet的类Transformer架构,该架构在编码器端使用局部-全局稀疏注意力,并在两个公开的多媒体基准数据集SumMe和TVSum上进行了广泛实验,结果表明该模型在计算和内存需求较低的情况下,性能优于其他视频摘要方法,验证了其有效性和高效性。
链接: https://arxiv.org/abs/2501.00882
作者: Libin Lan,Lu Jiang,Tianshu Yu,Xiaojuan Liu,Zhongshi He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures, 4 tables; The code is at this https URL
Abstract:Video summarization mainly aims to produce a compact, short, informative, and representative synopsis of raw videos, which is of great importance for browsing, analyzing, and understanding video content. Dominant video summarization approaches are generally based on recurrent or convolutional neural networks, even recent encoder-only transformers. We propose using full transformer as an alternative architecture to perform video summarization. The full transformer with an encoder-decoder structure, specifically designed for handling sequence transduction problems, is naturally suitable for video summarization tasks. This work considers supervised video summarization and casts it as a sequence-to-sequence learning problem. Our key idea is to directly apply the full transformer to the video summarization task, which is intuitively sound and effective. Also, considering the efficiency problem, we replace full attention with the combination of local and global sparse attention, which enables modeling long-range dependencies while reducing computational costs. Based on this, we propose a transformer-like architecture, named FullTransNet, which has a full encoder-decoder structure with local-global sparse attention for video summarization. Specifically, both the encoder and decoder in FullTransNet are stacked the same way as ones in the vanilla transformer, and the local-global sparse attention is used only at the encoder side. Extensive experiments on two public multimedia benchmark datasets SumMe and TVSum demonstrate that our proposed model can outperform other video summarization approaches, achieving F-Measures of 54.4% on SumMe and 63.9% on TVSum with relatively lower compute and memory requirements, verifying its effectiveness and efficiency. The code and models are publicly available on GitHub.
zh
[CV-74] Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
【速读】: 该论文试图解决现有基于大语言模型(LLM)的视觉生成方法中,未能充分探究语言与视觉之间根本差异的问题,导致在LLM框架下视觉生成能力的利用不够优化。为解决这一问题,论文提出了一种改进的自回归视觉生成方法(IAR),其关键在于通过视觉嵌入空间的相关性来实现更稳定和鲁棒的生成结果。具体解决方案包括:1)提出了一种代码本重排策略(Codebook Rearrangement),利用平衡k-means聚类算法将视觉代码本重新排列为多个簇,确保每个簇内的视觉特征具有高度相似性;2)提出了面向簇的交叉熵损失(Cluster-oriented Cross-entropy Loss),引导模型正确预测token所在的簇,从而即使模型预测错误的token索引,也能确保预测的token位于正确的簇内,显著提升了生成质量和鲁棒性。实验结果表明,该方法在100M到1.4B参数范围内均能有效提升模型训练效率和性能,训练时间减少一半的同时保持相同的FID(Fréchet Inception Distance)指标。
链接: https://arxiv.org/abs/2501.00880
作者: Teng Hu,Jiangning Zhang,Ran Yi,Jieyu Weng,Yabiao Wang,Xianfang Zeng,Zhucun Xue,Lizhuang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation.
zh
[CV-75] FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation
【速读】: 该论文试图解决开放词汇分割(open-vocabulary segmentation)任务中的两个关键问题:一是现有视觉-语言模型(VLMs)如CLIP在图像级视觉-文本对齐上表现良好,但无法提供细粒度的像素级对齐和详细的类别边界信息;二是直接从VLMs提取的信息无法满足分割任务的需求。为解决这些问题,论文提出了FGAseg模型,其核心在于两个关键模块:一是像素级对齐模块(Pixel-Level Alignment module),通过跨模态注意力机制和文本-像素对齐损失函数,优化CLIP的粗粒度对齐,实现更细粒度的像素-文本语义对齐;二是类别信息补充模块(Category Information Supplementation module),通过引入可优化的伪掩码(pseudo-masks),提供不同类别之间的全局和局部边界信息。通过结合这两个策略,FGAseg显著提升了像素级对齐和类别边界信息的质量,从而在开放词汇语义分割任务中取得了优于现有方法的表现。
链接: https://arxiv.org/abs/2501.00877
作者: Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can’t meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.
zh
[CV-76] Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation NEURIPS2024
【速读】: 该论文旨在解决生成式模型(generative models)和判别式模型(discriminative models)在机器学习中的互补优势问题,特别是如何利用生成式模型的语义结构来增强判别式模型的性能。论文提出的解决方案DUSA(Diffusion-based Universal Semantic Adaptation)通过揭示基于扩散评分(diffusion score)的生成式模型中的潜在语义结构,将其作为有效的判别式先验(discriminative priors),从而促进图像分类器或密集预测器(dense predictors)在测试时的适应能力。DUSA的关键创新在于从去噪扩散(denoising diffusion)的单个时间步中提取知识,避免了基于蒙特卡罗(Monte Carlo)的似然估计在多时间步上的计算负担。通过广泛的实验验证,DUSA在多种测试场景下成功适应了多种预训练的判别式模型,并通过消融研究(ablation study)深入分析了其核心要素。
链接: https://arxiv.org/abs/2501.00873
作者: Mingjia Li,Shuang Li,Tongrui Su,Longhui Yuan,Jian Liang,Wei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2024. Project page: this https URL
Abstract:Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at this https URL.
zh
[CV-77] Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation
【速读】: 该论文试图解决遥感图像分割(Referring Remote Sensing Image Segmentation, RRSIS)中的两个主要问题:一是现有方法在跨模态融合阶段主要依赖语言感知指导来优化视觉特征,而忽略了视觉到语言的互补信息流,导致生成的特征可能不相关或次优;二是遥感图像中地物的多尺度空间特性对现有模型的视觉感知能力提出了挑战,尤其是在文本输入条件下。为解决这些问题,论文提出了一个名为Scale-wise Bidirectional Alignment Network (SBANet)的创新框架。其关键解决方案包括:1) 设计了双向对齐模块(Bidirectional Alignment Module, BAM),通过可学习的查询令牌选择性地表示视觉和语言特征,并强调与关键令牌相关的区域;2) 引入了动态特征选择块,提供宏观和微观层次的视觉特征,保留全局上下文和局部细节,以促进更有效的跨模态交互;3) 结合了文本条件通道和空间聚合器,增强编码器和解码器之间的跨尺度信息交换。实验表明,SBANet在RRSIS-D和RefSegRS数据集上显著优于现有方法。
链接: https://arxiv.org/abs/2501.00851
作者: Kun Li,George Vosselman,Michael Ying Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. Recent advancements, particularly Transformer-based fusion designs, have demonstrated remarkable progress in this domain. However, existing methods primarily focus on refining visual features using language-aware guidance during the cross-modal fusion stage, neglecting the complementary vision-to-language flow. This limitation often leads to irrelevant or suboptimal representations. In addition, the diverse spatial scales of ground objects in aerial images pose significant challenges to the visual perception capabilities of existing models when conditioned on textual inputs. In this paper, we propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges for RRSIS. Specifically, we design a Bidirectional Alignment Module (BAM) with learnable query tokens to selectively and effectively represent visual and linguistic features, emphasizing regions associated with key tokens. BAM is further enhanced with a dynamic feature selection block, designed to provide both macro- and micro-level visual features, preserving global context and local details to facilitate more effective cross-modal interaction. Furthermore, SBANet incorporates a text-conditioned channel and spatial aggregator to bridge the gap between the encoder and decoder, enhancing cross-scale information exchange in complex aerial scenarios. Extensive experiments demonstrate that our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both quantitatively and qualitatively. The code will be released after publication.
zh
[CV-78] IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models
【速读】: 该论文旨在解决当前视觉语言模型(VLMs)在处理视觉幻觉(visual illusions)时表现不佳的问题,特别是在现实场景中的应用。现有的基准测试主要关注经典的认知幻觉(classical cognitive illusions),这些幻觉已被最先进的(SOTA)VLMs学习,但揭示了模型存在幻觉(hallucinations)和感知能力有限等问题。为了解决这一差距,作者提出了IllusionBench,这是一个全面的视觉幻觉数据集,不仅包含经典的认知幻觉,还涵盖了现实场景中的幻觉。该数据集包含1,051张图像、5,548个问答对和1,051个黄金文本描述,涉及幻觉的存在、原因和内容。通过在真实或错误、多项选择和开放式任务上评估十个SOTA VLMs,作者发现即使是表现最好的模型GPT-4o,在处理现实世界幻觉时仍落后于人类表现。此外,作者设计了陷阱幻觉(trap illusions),这些幻觉在形式上类似于经典模式,但在现实中有所不同,进一步凸显了SOTA模型的幻觉问题。IllusionBench是目前为止最大且最全面的视觉幻觉基准测试。
链接: https://arxiv.org/abs/2501.00848
作者: Yiming Zhang,Zicheng Zhang,Xinyi Wei,Xiaohong Liu,Guangtao Zhai,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions, especially in real-world scenarios. Existing benchmarks focus on classical cognitive illusions, which have been learned by state-of-the-art (SOTA) VLMs, revealing issues such as hallucinations and limited perceptual abilities. To address this gap, we introduce IllusionBench, a comprehensive visual illusion dataset that encompasses not only classic cognitive illusions but also real-world scene illusions. This dataset features 1,051 images, 5,548 question-answer pairs, and 1,051 golden text descriptions that address the presence, causes, and content of the illusions. We evaluate ten SOTA VLMs on this dataset using true-or-false, multiple-choice, and open-ended tasks. In addition to real-world illusions, we design trap illusions that resemble classical patterns but differ in reality, highlighting hallucination issues in SOTA models. The top-performing model, GPT-4o, achieves 80.59% accuracy on true-or-false tasks and 76.75% on multiple-choice questions, but still lags behind human performance. In the semantic description task, GPT-4o’s hallucinations on classical illusions result in low scores for trap illusions, even falling behind some open-source models. IllusionBench is, to the best of our knowledge, the largest and most comprehensive benchmark for visual illusions in VLMs to date.
zh
[CV-79] FusionSORT: Fusion Methods for Online Multi-object Visual Tracking
【速读】: 该论文旨在解决多目标视觉跟踪(multi-object visual tracking)中检测与轨迹片段(tracklets)关联的问题。研究探讨了四种不同的融合方法,包括最小化法、基于交并比(IoU)的加权求和法、卡尔曼滤波(Kalman filter, KF)门控法以及不同线索的哈达玛积(Hadamard product)法。这些方法不仅考虑了运动信息和外观信息等强线索,还引入了高度交并比(height-IoU)和轨迹片段置信度等弱线索。研究通过在MOT17、MOT20和DanceTrack数据集的验证集上进行广泛评估,发现融合方法的选择是数据关联的关键。论文的核心贡献在于为计算机视觉研究社区提供了关于如何选择合适融合方法的指导,以优化多目标视觉跟踪中的数据关联效果。
链接: https://arxiv.org/abs/2501.00843
作者: Nathanael L. Baisa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this work, we investigate four different fusion methods for associating detections to tracklets in multi-object visual tracking. In addition to considering strong cues such as motion and appearance information, we also consider weak cues such as height intersection-over-union (height-IoU) and tracklet confidence information in the data association using different fusion methods. These fusion methods include minimum, weighted sum based on IoU, Kalman filter (KF) gating, and hadamard product of costs due to the different cues. We conduct extensive evaluations on validation sets of MOT17, MOT20 and DanceTrack datasets, and find out that the choice of a fusion method is key for data association in multi-object visual tracking. We hope that this investigative work helps the computer vision research community to use the right fusion method for data association in multi-object visual tracking.
zh
[CV-80] Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation
【速读】: 该论文试图解决的是在光流估计(optical flow estimation)中如何有效融合帧数据(frame data)和事件数据(event data)的问题。当前的跨模态方法通常只是简单地将两种数据堆叠,未能充分利用它们的互补优势。论文提出了一种新颖的解决方案,其关键在于使用空间密集的帧数据来引导时间密集的事件数据的聚合,从而实现有效的跨模态融合。具体而言,作者提出了一种事件增强的帧表示方法,保留了帧的丰富纹理和事件的基本结构,并将其作为引导模态。通过这种引导模态,事件数据用于捕捉时间密集的运动信息,而帧数据则提供了稳健的运动特征,指导事件数据的运动信息聚合。此外,论文还引入了一个基于Transformer的模块,用于补充稀疏的事件运动特征,并通过空间丰富的帧信息增强全局信息传播。最后,设计了一个混合融合编码器(mix-fusion encoder),用于从两种模态中提取全面的时空上下文特征。实验结果表明,该方法在DSEC-Flow数据集上取得了领先的性能,相较于仅使用事件数据的模型,帧引导将精度提高了10%,并且在推理时间上减少了45%。
链接: https://arxiv.org/abs/2501.00838
作者: Qianang Zhou,Junhui Hou,Meiyi Yang,Yongjian Deng,Youfu Li,Junlin Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 8 figures, under review
Abstract:Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4% accuracy gain and a 45% reduction in inference time.
zh
[CV-81] Recognizing Artistic Style of Archaeological Image Fragments Using Deep Style Extrapolation
【速读】: 该论文试图解决考古发掘中古代艺术品碎片分类的难题。由于这些碎片通常来自不同时期或艺术风格,且每块碎片仅包含部分信息,传统的视觉分类方法即使对专业人士也具有挑战性。论文提出了一种基于深度学习的通用框架,用于预测图像碎片的艺术风格。该解决方案的关键在于利用现代深度学习架构的强大分类能力,能够高效且准确地处理具有不同风格和几何形状的碎片,从而实现了在该领域的先进成果。
链接: https://arxiv.org/abs/2501.00836
作者: Gur Elkin,Ofir Itzhak Shahar,Yaniv Ohayon,Nadav Alali,Ohad Ben-Shahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in the 27th International Conference on Human-Computer Interaction (HCII 2025)
Abstract:Ancient artworks obtained in archaeological excavations usually suffer from a certain degree of fragmentation and physical degradation. Often, fragments of multiple artifacts from different periods or artistic styles could be found on the same site. With each fragment containing only partial information about its source, and pieces from different objects being mixed, categorizing broken artifacts based on their visual cues could be a challenging task, even for professionals. As classification is a common function of many machine learning models, the power of modern architectures can be harnessed for efficient and accurate fragment classification. In this work, we present a generalized deep-learning framework for predicting the artistic style of image fragments, achieving state-of-the-art results for pieces with varying styles and geometries.
zh
[CV-82] SPARNet: Continual Test-Time Adaptation via Sample Partitioning Strategy and Anti-Forgetting Regularization
【速读】: 该论文试图解决在持续测试时适应(Continual Test-time Adaptation, TTA)场景中,模型在部署后遇到一系列领域变化时的性能提升问题。主要挑战在于模型需要长期适应且无法预知领域变化的发生时间,同时自训练方法生成的伪标签质量难以保证,可能导致误差累积和灾难性遗忘(catastrophic forgetting)。为解决这些问题,论文提出了名为SPARNet的新框架,其核心包括样本划分策略和抗遗忘正则化。样本划分策略将样本分为可靠样本和不可靠样本两组,并根据每组样本的特性采用不同处理策略,确保可靠样本对模型的贡献更大,同时通过均值教师(mean teacher)的一致性学习消除不可靠样本的负面影响。此外,引入正则化项以缓解灾难性遗忘问题,限制重要参数的过度变化,从而实现网络参数的长期适应。该方法的有效性在CIFAR10-C、CIFAR100-C和ImageNet-C数据集上通过大量实验得到验证。
链接: https://arxiv.org/abs/2501.00818
作者: Xinru Meng,Han Sun,Jiamei Liu,Ningzhong Liu,Huiyu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures
Abstract:Test-time Adaptation (TTA) aims to improve model performance when the model encounters domain changes after deployment. The standard TTA mainly considers the case where the target domain is static, while the continual TTA needs to undergo a sequence of domain changes. This encounters a significant challenge as the model needs to adapt for the long-term and is unaware of when the domain changes occur. The quality of pseudo-labels is hard to guarantee. Noisy pseudo-labels produced by simple self-training methods can cause error accumulation and catastrophic forgetting. In this work, we propose a new framework named SPARNet which consists of two parts, sample partitioning strategy and anti-forgetting regularization. The sample partition strategy divides samples into two groups, namely reliable samples and unreliable samples. According to the characteristics of each group of samples, we choose different strategies to deal with different groups of samples. This ensures that reliable samples contribute more to the model. At the same time, the negative impacts of unreliable samples are eliminated by the mean teacher’s consistency learning. Finally, we introduce a regularization term to alleviate the catastrophic forgetting problem, which can limit important parameters from excessive changes. This term enables long-term adaptation of parameters in the network. The effectiveness of our method is demonstrated in continual TTA scenario by conducting a large number of experiments on CIFAR10-C, CIFAR100-C and ImageNet-C.
zh
[CV-83] MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention
【速读】: 该论文试图解决当前草图提取方法在广泛艺术风格捕捉方面的局限性,这些方法要么需要大量训练,要么无法有效处理多种艺术风格,从而限制了其实际应用和多功能性。论文提出的解决方案是引入一种无需训练的草图提取方法——混合自注意力机制(Mixture-of-Self-Attention, MixSA),该方法利用强大的扩散先验来增强草图感知能力。MixSA的核心在于通过替换自注意力层中的键和值为参考草图中的对应值,从而将笔触元素无缝集成到初始轮廓图像中。这种方法不仅提供了对纹理密度的精确控制,还允许在风格之间进行插值,以生成新颖且未见过的艺术风格。通过在后期解码器层中处理局部纹理,MixSA能够调整初始轮廓,避免颜色平均化问题,从而更好地对齐笔触风格与彩色图像的纹理和轮廓。实验结果表明,MixSA在草图质量、灵活性和适用性方面均表现出色,克服了现有方法的局限性,并能够生成多样化且高保真度的草图,更准确地反映广泛的艺术表达。
链接: https://arxiv.org/abs/2501.00816
作者: Rui Yang,Xiaojun Wu,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 25 figures; Accepted by IEEE IEEE Transactions on Visualization and Computer Graphics, 2024 (TVCG)
Abstract:Current sketch extraction methods either require extensive training or fail to capture a wide range of artistic styles, limiting their practical applicability and versatility. We introduce Mixture-of-Self-Attention (MixSA), a training-free sketch extraction method that leverages strong diffusion priors for enhanced sketch perception. At its core, MixSA employs a mixture-of-self-attention technique, which manipulates self-attention layers by substituting the keys and values with those from reference sketches. This allows for the seamless integration of brushstroke elements into initial outline images, offering precise control over texture density and enabling interpolation between styles to create novel, unseen styles. By aligning brushstroke styles with the texture and contours of colored images, particularly in late decoder layers handling local textures, MixSA addresses the common issue of color averaging by adjusting initial outlines. Evaluated with various perceptual metrics, MixSA demonstrates superior performance in sketch quality, flexibility, and applicability. This approach not only overcomes the limitations of existing methods but also empowers users to generate diverse, high-fidelity sketches that more accurately reflect a wide range of artistic expressions.
zh
[CV-84] Regression Guided Strategy to Automated Facial Beauty Optimization through Image Synthesis
【速读】: 该论文旨在解决传统基于规则的美颜滤镜在社交媒体图像处理中的局限性。传统方法依赖于对吸引力相关面部特征的领域知识,并应用特定的变换来最大化这些属性。论文提出了一种替代方案,通过将面部图像投影到预训练生成对抗网络(GAN)的潜在空间中,并利用新开发的面部美感评估回归网络(facial beauty evaluation regression network)来优化这些潜在点,从而生成更具吸引力的面部图像。该网络通过学习区分吸引人的面部特征,超越了现有许多面部美感评估模型。这种数据驱动的方法能够直接从数据中捕捉整体美感模式,而非依赖预定义规则,从而实现了更动态且应用范围更广的面部美颜编辑。该研究展示了自动化美学增强的潜在新方向,为现有方法提供了补充性替代方案。
链接: https://arxiv.org/abs/2501.00811
作者: Erik Nguyen,Spencer Htin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Short paper, 5 pages
Abstract:The use of beauty filters on social media, which enhance the appearance of individuals in images, is a well-researched area, with existing methods proving to be highly effective. Traditionally, such enhancements are performed using rule-based approaches that leverage domain knowledge of facial features associated with attractiveness, applying very specific transformations to maximize these attributes. In this work, we present an alternative approach that projects facial images as points on the latent space of a pre-trained GAN, which are then optimized to produce beautiful faces. The movement of the latent points is guided by a newly developed facial beauty evaluation regression network, which learns to distinguish attractive facial features, outperforming many existing facial beauty evaluation models in this domain. By using this data-driven approach, our method can automatically capture holistic patterns in beauty directly from data rather than relying on predefined rules, enabling more dynamic and potentially broader applications of facial beauty editing. This work demonstrates a potential new direction for automated aesthetic enhancement, offering a complementary alternative to existing methods.
zh
[CV-85] Multimodal Large Models Are Effective Action Anticipators
【速读】: 该论文试图解决长期动作预测(long-term action anticipation)中的挑战,即如何有效建模长时间跨度的时序动态并深入理解动作的语义。传统方法主要依赖循环单元(recurrent units)或Transformer层来捕捉长期依赖关系,但往往难以应对这些挑战。论文提出的解决方案是引入ActionLLM框架,该框架将视频序列视为连续的token,利用大语言模型(LLMs)的强时序建模能力和丰富的常识知识来预测未来动作。关键创新点包括:简化LLM架构,通过设置未来token、引入动作调优模块(action tuning module)以及将文本解码器层简化为线性层,从而直接预测动作而无需复杂指令或冗余描述;利用LLMs的常识推理能力,预测观察帧的动作类别并使用序列文本线索引导语义理解;此外,还引入了跨模态交互模块(Cross-Modality Interaction Block),以探索每种模态的特异性并捕捉视觉与文本模态之间的交互,从而增强多模态调优。实验结果表明,ActionLLM框架在基准数据集上表现出色,为探索LLMs在动作预测中的应用提供了有前景的方向。
链接: https://arxiv.org/abs/2501.00795
作者: Binglu Wang,Yao Tian,Shunzhou Wang,Le Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at this https URL.
zh
[CV-86] Beyond Words: AuralLLM and SignMST-C for Precise Sign Language Production and Bidirectional Accessibility
【速读】: 该论文旨在解决手语生成(Sign Language Production, SLP)和手语翻译(Sign Language Translation, SLT)领域中的两个主要问题:现有模型在手语生成准确性和姿态控制方面的不足,以及高质量数据集的缺乏。为了解决这些问题,论文提出了两个关键解决方案:首先,引入了两个综合数据集CNText2Sign和CNSign,分别用于SLP和SLT的基准测试。CNText2Sign提供了手语词汇和姿态标注的映射,而CNSign则提供了大量的视频到文本数据。其次,论文提出了两个新模型AuraLLM和SignMST-C。AuraLLM结合了LoRA(Low-Rank Adaptation)和RAG(Retrieval-Augmented Generation)技术,在手语生成任务中实现了对语义和动作的精确控制,并在CNText2Sign数据集上达到了50.41的BLEU-4分数。SignMST-C则通过自监督的快速运动视频预训练,在PHOENIX2014-T基准测试中取得了31.03/32.08的BLEU-4分数,创下了新的最高水平。这些模型为各自任务的数据集建立了坚实的基准。
链接: https://arxiv.org/abs/2501.00765
作者: Yulong Li,Yuxuan Zhang,Feilong Tang,Mian Zhou,Zhixiang Lu,Haochen Xue,Yifang Wang,Kang Dang,Jionglong Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Although sign language recognition aids non-hearing-impaired understanding, many hearing-impaired individuals still rely on sign language alone due to limited literacy, underscoring the need for advanced sign language production and translation (SLP and SLT) systems. In the field of sign language production, the lack of adequate models and datasets restricts practical applications. Existing models face challenges in production accuracy and pose control, making it difficult to provide fluent sign language expressions across diverse scenarios. Additionally, data resources are scarce, particularly high-quality datasets with complete sign vocabulary and pose annotations. To address these issues, we introduce CNText2Sign and CNSign, comprehensive datasets to benchmark SLP and SLT, respectively, with CNText2Sign covering gloss and landmark mappings for SLP, and CNSign providing extensive video-to-text data for SLT. To improve the accuracy and applicability of sign language systems, we propose the AuraLLM and SignMST-C models. AuraLLM, incorporating LoRA and RAG techniques, achieves a BLEU-4 score of 50.41 on the CNText2Sign dataset, enabling precise control over gesture semantics and motion. SignMST-C employs self-supervised rapid motion video pretraining, achieving a BLEU-4 score of 31.03/32.08 on the PHOENIX2014-T benchmark, setting a new state-of-the-art. These models establish robust baselines for the datasets released for their respective tasks.
zh
[CV-87] Less is More: Token Context-aware Learning for Object Tracking AAAI2025
【速读】: 该论文试图解决在目标跟踪(object tracking)任务中,现有方法未能有效利用参考帧中每个图像块(patch)的重要性,导致对噪声和冗余信息敏感,从而影响跟踪性能的问题。为解决这一问题,论文提出了一种名为LMTrack的新的令牌上下文感知跟踪管道(token context-aware tracking pipeline)。其核心解决方案包括两个关键部分:首先,设计了一个令牌上下文记忆模块(Token Context Memory module),该模块能够以自回归方式动态收集目标的高质量时空信息,并剔除参考帧中的冗余背景令牌;其次,引入了一种单向令牌注意力机制(Unidirectional Token Attention mechanism),用于建立参考令牌与搜索帧之间的依赖关系,从而实现鲁棒的跨帧关联和目标定位。通过这些创新,LMTrack能够在多个跟踪基准测试(如GOT-10K、TrackingNet和LaSOT)上取得最先进的性能。
链接: https://arxiv.org/abs/2501.00758
作者: Chenlong Xu,Bineng Zhong,Qihua Liang,Yaozong Zheng,Guorong Li,Shuxiang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025
Abstract:Recently, several studies have shown that utilizing contextual information to perceive target states is crucial for object tracking. They typically capture context by incorporating multiple video frames. However, these naive frame-context methods fail to consider the importance of each patch within a reference frame, making them susceptible to noise and redundant tokens, which deteriorates tracking performance. To address this challenge, we propose a new token context-aware tracking pipeline named LMTrack, designed to automatically learn high-quality reference tokens for efficient visual tracking. Embracing the principle of Less is More, the core idea of LMTrack is to analyze the importance distribution of all reference tokens, where important tokens are collected, continually attended to, and updated. Specifically, a novel Token Context Memory module is designed to dynamically collect high-quality spatio-temporal information of a target in an autoregressive manner, eliminating redundant background tokens from the reference frames. Furthermore, an effective Unidirectional Token Attention mechanism is designed to establish dependencies between reference tokens and search frame, enabling robust cross-frame association and target localization. Extensive experiments demonstrate the superiority of our tracker, achieving state-of-the-art results on tracking benchmarks such as GOT-10K, TrackingNet, and LaSOT.
zh
[CV-88] Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation AAAI
【速读】: 该论文旨在解决少样本分割(Few-Shot Segmentation, FSS)问题,即基于少量标注的支持图像(support images)对未标注的查询图像(query images)中的目标区域进行分割。与以往研究通常通过支持原型(support prototypes)和查询像素(query pixels)来估计查询图像中的目标区域不同,本文提出了一种新的方法,即利用支持原型和查询原型之间的关系来实现分割。
解决方案的关键在于提出了“前景覆盖原型生成与匹配”(Foreground-Covering Prototype Generation and Matching)方法。该方法结合了两种互补的特征:SAM图像编码器(SAM Image Encoder)特征用于像素聚合,ResNet特征用于保持类别一致性。具体而言,首先利用SAM特征构建支持和查询原型,并通过ResNet特征区分查询原型中的目标区域。在查询原型的构建过程中,首先使用传统的伪掩码(pseudo-mask)粗略引导SAM特征中的前景区域,然后通过迭代交叉注意力(iterative cross-attention)将前景特征聚合到可学习的令牌(learnable tokens)中。研究发现,交叉注意力权重可以有效地替代传统的伪掩码,因此使用基于注意力的伪掩码来引导ResNet特征聚焦于前景区域,并将引导后的ResNet特征注入到可学习的令牌中,生成类别一致的查询原型。支持原型的生成过程与查询原型对称,但伪掩码被替换为真实掩码(ground-truth mask)。最后,通过比较查询原型和支持原型生成提示(prompts),并通过SAM掩码解码器(SAM Mask Decoder)生成目标掩码。该方法在多个数据集上的最新性能验证了其有效性。
链接: https://arxiv.org/abs/2501.00752
作者: Suho Park,SuBeen Lee,Hyun Seok Seong,Jaejoon Yoo,Jae-Pil Heo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Association for the Advancement of Artificial Intelligence (AAAI) 2025
Abstract:We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at this https URL
zh
[CV-89] owards End-to-End Neuromorphic Voxel-based 3D Object Reconstruction Without Physical Priors ICME2025
【速读】: 该论文试图解决使用单目神经形态相机(neuromorphic cameras)进行3D重建时存在的局限性问题。现有的方法通常依赖于物理先验的估计,并采用复杂的多步骤流程,导致效率和精度受限。论文提出了一种端到端的方法,用于实现密集体素(dense voxel)3D重建,消除了对物理先验估计的需求。解决方案的关键在于引入了一种新颖的事件表示方法,以增强边缘特征,从而使特征增强模型能够更有效地学习。此外,论文还提出了最优二值化阈值选择原则(Optimal Binarization Threshold Selection Principle),作为未来相关工作的指导,并以通过阈值优化获得的最优重建结果作为基准。该方法在重建精度上相比基线方法提升了54.6%。
链接: https://arxiv.org/abs/2501.00741
作者: Chuanzhi Xu,Langyi Chen,Vincent Qu,Haodong Chen,Vera Chung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 15 figures, 5 tables, submitted to ICME 2025
Abstract:Neuromorphic cameras, also known as event cameras, are asynchronous brightness-change sensors that can capture extremely fast motion without suffering from motion blur, making them particularly promising for 3D reconstruction in extreme environments. However, existing research on 3D reconstruction using monocular neuromorphic cameras is limited, and most of the methods rely on estimating physical priors and employ complex multi-step pipelines. In this work, we propose an end-to-end method for dense voxel 3D reconstruction using neuromorphic cameras that eliminates the need to estimate physical priors. Our method incorporates a novel event representation to enhance edge features, enabling the proposed feature-enhancement model to learn more effectively. Additionally, we introduced Optimal Binarization Threshold Selection Principle as a guideline for future related work, using the optimal reconstruction results achieved with threshold optimization as the benchmark. Our method achieves a 54.6% improvement in reconstruction accuracy compared to the baseline method.
zh
[CV-90] RORem: Training a Robust Object Remover with Human-in-the-Loop
【速读】: 该论文旨在解决现有物体移除方法在移除不完整、内容合成错误以及合成区域模糊等方面的问题,这些问题导致物体移除的成功率较低。这些问题的主要原因是缺乏高质量的配对训练数据,以及现有方法采用的自监督训练范式,迫使模型对掩码区域进行修复,导致在合成掩码物体和恢复背景之间存在模糊性。为解决这些问题,论文提出了一种半监督学习策略,结合人类反馈(human-in-the-loop)来创建高质量的配对训练数据,目标是训练一个鲁棒的物体移除模型(RORem)。关键解决方案包括:首先从开源数据集中收集60K训练对来训练初始物体移除模型以生成移除样本,然后利用人类反馈选择一组高质量的物体移除对,并训练一个判别器来自动化后续的训练数据生成过程。通过多次迭代,最终获得一个包含超过200K对的物体移除数据集。使用该数据集对预训练的稳定扩散模型进行微调,得到RORem模型,该模型在可靠性和图像质量方面表现出最先进的物体移除性能,特别是将物体移除成功率提高了18%以上。
链接: https://arxiv.org/abs/2501.00740
作者: Ruibin Li,Tao Yang,Song Guo,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18%. The dataset, source code and trained model are available at this https URL.
zh
[CV-91] DDD: Discriminative Difficulty Distance for plant disease diagnosis AAAI
【速读】: 该论文试图解决植物病害诊断中由于训练和测试数据集来自同一领域(domain)而导致诊断性能被高估的问题。植物病害诊断任务具有细粒度、症状模糊以及图像特征在领域内高度可变的特点,这使得分类任务尤为复杂。为了解决这一问题,论文提出了“判别难度距离”(Discriminative Difficulty Distance, DDD)这一新指标,用于量化训练和测试数据集之间的领域差距,并评估测试数据的分类难度。DDD的关键在于通过低维表示生成的距离度量来识别训练数据的多样性不足,从而支持开发更具多样性和鲁棒性的数据集。研究通过使用不同数据集训练的多个图像编码器,验证了DDD指标的有效性,并发现即使测试图像来自与训练编码器不同的作物或病害,引入这些图像仍能构建与独立开发的病害分类器诊断难度高度相关的距离度量。
链接: https://arxiv.org/abs/2501.00734
作者: Yuji Arima,Satoshi Kagiwada,Hitoshi Iyatomi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 3 tables. Accepted at 4th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)
Abstract:Recent studies on plant disease diagnosis using machine learning (ML) have highlighted concerns about the overestimated diagnostic performance due to inappropriate data partitioning, where training and test datasets are derived from the same source (domain). Plant disease diagnosis presents a challenging classification task, characterized by its fine-grained nature, vague symptoms, and the extensive variability of image features within each domain. In this study, we propose the concept of Discriminative Difficulty Distance (DDD), a novel metric designed to quantify the domain gap between training and test datasets while assessing the classification difficulty of test data. DDD provides a valuable tool for identifying insufficient diversity in training data, thus supporting the development of more diverse and robust datasets. We investigated multiple image encoders trained on different datasets and examined whether the distances between datasets, measured using low-dimensional representations generated by the encoders, are suitable as a DDD metric. The study utilized 244,063 plant disease images spanning four crops and 34 disease classes collected from 27 domains. As a result, we demonstrated that even if the test images are from different crops or diseases than those used to train the encoder, incorporating them allows the construction of a distance measure for a dataset that strongly correlates with the difficulty of diagnosis indicated by the disease classifier developed independently. Compared to the base encoder, pre-trained only on ImageNet21K, the correlation higher by 0.106 to 0.485, reaching a maximum of 0.909.
zh
[CV-92] Automatic Construction of Pattern Classifiers Capable of Continuous Incremental Learning and Unlearning Tasks Based on Compact-Sized Probabilistic Neural Network
【速读】: 该论文旨在解决传统概率神经网络(Probabilistic Neural Network, PNN)在模式分类任务中存在的结构复杂性和参数调整困难的问题。论文提出了一种基于紧凑型概率神经网络的创新方法,该网络能够进行连续的增量学习和去学习任务。解决方案的关键在于采用了一种简单的单次网络增长算法(one-pass network-growing algorithm),无需超参数调优,能够根据训练数据集自动确定网络结构和参数,并在持续的增量和减量学习场景中动态调整。该算法避免了复杂的迭代或基于矩阵的参数近似,而是采用了一种简单的数据驱动更新机制。实验结果表明,所构建的紧凑型概率神经网络在标准分类任务中能够达到与多层感知器神经网络(Multilayer Perceptron, MLP)相似的分类性能,同时具备在连续类别增量学习和去学习任务中的良好表现。
链接: https://arxiv.org/abs/2501.00725
作者: Tetsuya Hoya,Shunpei Morita
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures
Abstract:This paper proposes a novel approach to pattern classification using a probabilistic neural network model. The strategy is based on a compact-sized probabilistic neural network capable of continuous incremental learning and unlearning tasks. The network is constructed/reconstructed using a simple, one-pass network-growing algorithm with no hyperparameter tuning. Then, given the training dataset, its structure and parameters are automatically determined and can be dynamically varied in continual incremental and decremental learning situations. The algorithm proposed in this work involves no iterative or arduous matrix-based parameter approximations but a simple data-driven updating scheme. Simulation results using nine publicly available databases demonstrate the effectiveness of this approach, showing that compact-sized probabilistic neural networks constructed have a much smaller number of hidden units compared to the original probabilistic neural network model and yet can achieve a similar classification performance to that of multilayer perceptron neural networks in standard classification tasks, while also exhibiting sufficient capability in continuous class incremental learning and unlearning tasks.
zh
[CV-93] Everywhere Attack: Attacking Locally and Globally to Boost Targeted Transferability AAAI
【速读】: 该论文试图解决的是对抗样本(Adversarial Examples, AE)在目标攻击中的可迁移性(transferability)问题。尽管在非目标攻击的可迁移性方面取得了显著进展,但目标攻击的可迁移性仍然具有挑战性。论文提出了一种全局和局部结合的方案来提升目标攻击的可迁移性。其关键解决方案是将受害图像分割成不重叠的区块,并在每个区块上同时进行目标攻击,而不是像以往工作那样仅优化图像中的高置信度目标。这种策略缓解了由于代理模型(surrogate model)和受害模型(victim model)之间的注意力不一致性导致的迁移失败问题,从而增强了对抗样本的可迁移性。该方法与现有方法无关(method-agnostic),可以轻松结合现有的可迁移攻击方法以进一步提升效果。实验结果表明,该方法显著提升了现有目标攻击的可迁移性,并在真实世界平台(如Google Cloud Vision)上验证了其优越性。
链接: https://arxiv.org/abs/2501.00707
作者: Hui Zeng,Sanshuai Cui,Biwei Chen,Anjie Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 11 pages, 6 figures, 8 tables, accepted by 2025AAAI
Abstract:Adversarial examples’ (AE) transferability refers to the phenomenon that AEs crafted with one surrogate model can also fool other models. Notwithstanding remarkable progress in untargeted transferability, its targeted counterpart remains challenging. This paper proposes an everywhere scheme to boost targeted transferability. Our idea is to attack a victim image both globally and locally. We aim to optimize ‘an army of targets’ in every local image region instead of the previous works that optimize a high-confidence target in the image. Specifically, we split a victim image into non-overlap blocks and jointly mount a targeted attack on each block. Such a strategy mitigates transfer failures caused by attention inconsistency between surrogate and victim models and thus results in stronger transferability. Our approach is method-agnostic, which means it can be easily combined with existing transferable attacks for even higher transferability. Extensive experiments on ImageNet demonstrate that the proposed approach universally improves the state-of-the-art targeted attacks by a clear margin, e.g., the transferability of the widely adopted Logit attack can be improved by 28.8%-300%.We also evaluate the crafted AEs on a real-world platform: Google Cloud Vision. Results further support the superiority of the proposed method.
zh
[CV-94] Knowledge-Guided Prompt Learning for Deepfake Facial Image Detection ICASSP2025
【速读】: 该论文试图解决深度伪造(deepfake)面部图像检测中的两个主要问题:一是现有方法缺乏对先验知识的探索,二是训练类别(如自然和室内物体)与测试类别(如细粒度的人脸图像)之间的领域转移(domain shift)问题。为解决这些问题,论文提出了一种新颖的知识引导提示学习(knowledge-guided prompt learning)方法。该方法的关键在于从大型语言模型中检索与伪造相关的提示(prompts),作为专家知识来指导可学习提示的优化。此外,论文还详细阐述了测试时提示调优(test-time prompt tuning)策略,以缓解领域转移问题,从而显著提升检测性能,并推动该方法在实际场景中的应用。实验结果表明,该方法在DeepFakeFaceForensics数据集上显著优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.00700
作者: Hao Wang,Cheng Deng,Zhidong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025
Abstract:Recent generative models demonstrate impressive performance on synthesizing photographic images, which makes humans hardly to distinguish them from pristine ones, especially on realistic-looking synthetic facial images. Previous works mostly focus on mining discriminative artifacts from vast amount of visual data. However, they usually lack the exploration of prior knowledge and rarely pay attention to the domain shift between training categories (e.g., natural and indoor objects) and testing ones (e.g., fine-grained human facial images), resulting in unsatisfactory detection performance. To address these issues, we propose a novel knowledge-guided prompt learning method for deepfake facial image detection. Specifically, we retrieve forgery-related prompts from large language models as expert knowledge to guide the optimization of learnable prompts. Besides, we elaborate test-time prompt tuning to alleviate the domain shift, achieving significant performance improvement and boosting the application in real-world scenarios. Extensive experiments on DeepFakeFaceForensics dataset show that our proposed approach notably outperforms state-of-the-art methods.
zh
[CV-95] Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery
【速读】: 该论文试图解决大规模鲁棒矩阵补全(Robust Matrix Completion, RMC)问题,即在低秩数据分析中同时处理缺失数据条目和极端异常值的挑战。论文提出了一种新颖的可扩展且可学习的非凸方法,称为学习型鲁棒矩阵补全(Learned Robust Matrix Completion, LRMC)。该方法的计算复杂度低,并具有线性收敛性。其关键解决方案在于通过深度展开(deep unfolding)有效学习LRMC的自由参数,以达到最优性能。此外,论文还提出了一种灵活的前馈-循环混合神经网络框架,将深度展开从固定次数的迭代扩展到无限次迭代。通过大量实验验证,LRMC在合成数据集和实际应用(如视频背景减除、超声成像、人脸建模和卫星图像云去除)中表现出优于现有技术的性能。
链接: https://arxiv.org/abs/2501.00677
作者: HanQin Cai,Chandra Kundu,Jialin Liu,Wotao Yin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: arXiv admin note: substantial text overlap with arXiv:2110.05649
Abstract:Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.
zh
[CV-96] Leaf diseases detection using deep learning methods
【速读】: 该论文旨在解决植物叶片病害识别与检测中的挑战,提出了一种基于深度学习(Deep Learning)的新方法。当前叶片病害检测方法面临的主要问题包括检测精度不足和效率低下。论文通过开发一种新的卷积神经网络(CNN)模型,结合超参数优化方法,提出了一种高效的网络架构,以提升病害检测的准确性和速度。解决方案的关键在于设计并评估了多种网络架构,最终提出了一种优于现有预训练模型的新模型,并通过实验验证了其有效性。
链接: https://arxiv.org/abs/2501.00669
作者: El Houcine El Fatimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 252 pages , 42 images
Abstract:This study, our main topic is to devlop a new deep-learning approachs for plant leaf disease identification and detection using leaf image datasets. We also discussed the challenges facing current methods of leaf disease detection and how deep learning may be used to overcome these challenges and enhance the accuracy of disease detection. Therefore, we have proposed a novel method for the detection of various leaf diseases in crops, along with the identification and description of an efficient network architecture that encompasses hyperparameters and optimization methods. The effectiveness of different architectures was compared and evaluated to see the best architecture configuration and to create an effective model that can quickly detect leaf disease. In addition to the work done on pre-trained models, we proposed a new model based on CNN, which provides an efficient method for identifying and detecting plant leaf disease. Furthermore, we evaluated the efficacy of our model and compared the results to those of some pre-trained state-of-the-art architectures.
zh
[CV-97] aming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models
【速读】: 该论文试图解决现有3D生成模型在训练过程中面临的挑战,特别是如何有效利用预训练的前馈式图像到3D重建模型(feed-forward image-to-3D reconstruction models)作为潜在编码器(latent encoders)来提升3D生成模型的性能。关键解决方案包括:1) 重用预训练的重建模型,避免计算昂贵的编码器网络训练,并免费获得丰富的3D潜在特征;2) 通过后处理流程(post-processing pipelines)标准化特征并引入空间加权(spatial weighting)以聚焦重要区域,从而改善潜在空间的结构;3) 引入2D图像空间的感知渲染损失(perceptual rendering loss)来处理高维潜在空间;4) 提出基于多流变压器(multi-stream transformer)的整流流架构(rectified flow architecture),实现线性扩展和高质量的文本条件3D生成。通过这些方法,论文成功地将前馈式重建模型的优势与3D生成模型相结合,显著提升了文本到3D生成的可扩展性和性能。
链接: https://arxiv.org/abs/2501.00651
作者: Suttisak Wizadwongsa,Jinfan Zhou,Edward Li,Jeong Joon Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent AI-based 3D content creation has largely evolved along two paths: feed-forward image-to-3D reconstruction approaches and 3D generative models trained with 2D or 3D supervision. In this work, we show that existing feed-forward reconstruction methods can serve as effective latent encoders for training 3D generative models, thereby bridging these two paradigms. By reusing powerful pre-trained reconstruction models, we avoid computationally expensive encoder network training and obtain rich 3D latent features for generative modeling for free. However, the latent spaces of reconstruction models are not well-suited for generative modeling due to their unstructured nature. To enable flow-based model training on these latent features, we develop post-processing pipelines, including protocols to standardize the features and spatial weighting to concentrate on important regions. We further incorporate a 2D image space perceptual rendering loss to handle the high-dimensional latent spaces. Finally, we propose a multi-stream transformer-based rectified flow architecture to achieve linear scaling and high-quality text-conditioned 3D generation. Our framework leverages the advancements of feed-forward reconstruction models to enhance the scalability of 3D generative modeling, achieving both high computational efficiency and state-of-the-art performance in text-to-3D generation.
zh
[CV-98] SoundBrush: Sound as a Brush for Visual Scene Editing AAAI2025
【速读】: 该论文旨在解决如何利用声音作为编辑工具来操纵视觉场景的问题。解决方案的关键在于扩展了潜在扩散模型(Latent Diffusion Model, LDM)的生成能力,使其能够结合音频信息进行视觉场景的编辑。通过构建一个声音配对的视觉场景数据集,并利用现有的图像编辑模型进行监督学习,SoundBrush模型能够将音频特征映射到LDM的文本空间中,从而实现由多样化声音引导的视觉场景编辑。与现有方法不同,SoundBrush不仅能够精确地操纵整体场景,还能插入与音频输入最佳匹配的发声物体,同时保留原始内容。此外,通过与新颖的视图合成技术结合,该框架还可以扩展到编辑3D场景,实现声音驱动的3D场景操纵。
链接: https://arxiv.org/abs/2501.00645
作者: Kim Sung-Bin,Kim Jun-Seong,Junseok Ko,Yewon Kim,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: AAAI 2025
Abstract:We propose SoundBrush, a model that uses sound as a brush to edit and manipulate visual scenes. We extend the generative capabilities of the Latent Diffusion Model (LDM) to incorporate audio information for editing visual scenes. Inspired by existing image-editing works, we frame this task as a supervised learning problem and leverage various off-the-shelf models to construct a sound-paired visual scene dataset for training. This richly generated dataset enables SoundBrush to learn to map audio features into the textual space of the LDM, allowing for visual scene editing guided by diverse in-the-wild sound. Unlike existing methods, SoundBrush can accurately manipulate the overall scenery or even insert sounding objects to best match the audio inputs while preserving the original content. Furthermore, by integrating with novel view synthesis techniques, our framework can be extended to edit 3D scenes, facilitating sound-driven 3D scene manipulation. Demos are available at this https URL.
zh
[CV-99] Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation
【速读】: 该论文旨在解决透明表面(如玻璃)产生的复杂反射问题,这些反射会模糊图像并对下游计算机视觉应用带来挑战。论文提出了一种名为Flash-Split的鲁棒框架,通过使用一对(可能未对齐的)闪光/无闪光图像来分离透射光和反射光。解决方案的关键在于利用闪光线索在潜在空间中进行反射分离。具体而言,Flash-Split框架分为两个阶段:第一阶段通过一个双分支扩散模型(dual-branch diffusion model)在编码的闪光/无闪光潜在对条件下分离反射潜在和透射潜在,有效缓解了闪光/无闪光未对齐问题;第二阶段通过一个跨潜在解码过程,在分离前的原始图像条件下恢复高分辨率、保真的细节。实验验证表明,Flash-Split在真实场景中表现出色,显著超越了基线方法,达到了最先进的反射分离性能。
链接: https://arxiv.org/abs/2501.00637
作者: Tianfu Wang,Mingyang Xie,Haoming Cai,Sachin Shah,Christopher A. Metzler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Transparent surfaces, such as glass, create complex reflections that obscure images and challenge downstream computer vision applications. We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. Our core idea is to perform latent-space reflection separation while leveraging the flash cues. Specifically, Flash-Split consists of two stages. Stage 1 separates apart the reflection latent and transmission latent via a dual-branch diffusion model conditioned on an encoded flash/no-flash latent pair, effectively mitigating the flash/no-flash misalignment issue. Stage 2 restores high-resolution, faithful details to the separated latents, via a cross-latent decoding process conditioned on the original images before separation. By validating Flash-Split on challenging real-world scenes, we demonstrate state-of-the-art reflection separation performance and significantly outperform the baseline methods.
zh
[CV-100] Applying Graph Explanation to Operator Fusion
【速读】: 该论文旨在解决深度神经网络(DNN)推理效率优化中的层融合(Layer Fusion)问题。层融合通过将多个操作(如卷积和激活函数)组合成单个执行单元(融合组)来减少加速器片上缓冲区与DRAM之间的数据传输,从而降低推理成本。然而,片上缓冲区的容量限制了融合组的大小,且在整个DNN中优化层融合需要将网络划分为多个融合组。寻找最优融合组是一个复杂的问题,传统搜索算法因存在无效解而难以应对,因此需要更稳健的方法。
论文的关键解决方案是将可解释人工智能(Explainable AI, XAI)中的图解释技术(Graph Explanation Techniques, GET)引入层融合过程。具体而言,当遇到无效融合组时,GET能够识别导致无效性的关键操作,并利用这些信息通过贪心树算法递归地拆分原始融合组,从而最小化DRAM访问。该方案结合了常见的优化算法,并在两种层融合策略(Line-Buffer Depth First, LBDF 和 Branch Requirement Reduction, BRR)上进行了优化。实验结果表明,该方案在ResNets和MobileNets等经典卷积神经网络上显著减少了DRAM访问,特别是在EfficientNet-B3上实现了超过20%的DRAM访问减少。
链接: https://arxiv.org/abs/2501.00636
作者: Keith G. Mills,Muhammad Fetrat Qharabagh,Weichen Qiu,Fred X. Han,Mohammad Salameh,Wei Lu,Shangling Jui,Di Niu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: DAC’23 WIP Poster; 8 pages, 5 Figures 5 Tables
Abstract:Layer fusion techniques are critical to improving the inference efficiency of deep neural networks (DNN) for deployment. Fusion aims to lower inference costs by reducing data transactions between an accelerator’s on-chip buffer and DRAM. This is accomplished by grouped execution of multiple operations like convolution and activations together into single execution units - fusion groups. However, on-chip buffer capacity limits fusion group size and optimizing fusion on whole DNNs requires partitioning into multiple fusion groups. Finding the optimal groups is a complex problem where the presence of invalid solutions hampers traditional search algorithms and demands robust approaches. In this paper we incorporate Explainable AI, specifically Graph Explanation Techniques (GET), into layer fusion. Given an invalid fusion group, we identify the operations most responsible for group invalidity, then use this knowledge to recursively split the original fusion group via a greedy tree-based algorithm to minimize DRAM access. We pair our scheme with common algorithms and optimize DNNs on two types of layer fusion: Line-Buffer Depth First (LBDF) and Branch Requirement Reduction (BRR). Experiments demonstrate the efficacy of our scheme on several popular and classical convolutional neural networks like ResNets and MobileNets. Our scheme achieves over 20% DRAM Access reduction on EfficientNet-B3.
zh
[CV-101] Gaussian Building Mesh (GBM): Extract a Buildings 3D Mesh with Google Earth and Gaussian Splatting
【速读】: 该论文旨在解决基于多视角2D图像提取建筑物3D网格(3D mesh)的问题。解决方案的关键在于结合了多种先进技术:首先,利用开源预训练的基础图像分割和目标检测模型(SAM2+GroundingDINO),通过文本或点击提示实现对感兴趣对象的几何一致性分割,无需标注训练数据集。其次,采用2D高斯泼溅(Gaussian Splatting)技术,从2D图像中学习场景的几何和辐射3D表示。最后,结合Google Earth Studio,并通过形态学操作和轮廓简化的改进进行掩码优化,构建了一个从建筑物名称、地址或地理坐标提取其3D网格的完整流程。
链接: https://arxiv.org/abs/2501.00625
作者: Kyle Gao,Liangzhi Li,Hongjie He,Dening Lu,Linlin Xu,Jonathan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Recently released open-source pre-trained foundational image segmentation and object detection models (SAM2+GroundingDINO) allow for geometrically consistent segmentation of objects of interest in multi-view 2D images. Users can use text-based or click-based prompts to segment objects of interest without requiring labeled training datasets. Gaussian Splatting allows for the learning of the 3D representation of a scene’s geometry and radiance based on 2D images. Combining Google Earth Studio, SAM2+GroundingDINO, 2D Gaussian Splatting, and our improvements in mask refinement based on morphological operations and contour simplification, we created a pipeline to extract the 3D mesh of any building based on its name, address, or geographic coordinates.
zh
[CV-102] A Study on Context Length and Efficient Transformers for Biomedical Image Analysis ML4H2024
【速读】: 该论文试图解决生物医学影像分析中,由于高分辨率、多维度的影像数据带来的计算挑战,特别是在使用自注意力机制(self-attention)的Transformer模型时,计算复杂度随上下文长度(context length)呈二次方增长的问题。解决方案的关键在于评估和应用最近提出的长上下文模型(long-context models),以提升Transformer在大规模生物医学影像上的计算效率。研究通过调整Vision Transformer和Swin Transformer中的patch大小和注意力窗口大小,系统地分析了上下文长度对网络性能的影响,特别是在像素级预测任务中的表现。结果表明,长上下文模型在保持性能的同时显著提高了计算效率,尽管在某些方面仍存在改进空间。
链接: https://arxiv.org/abs/2501.00619
作者: Sarah M. Hooper,Hui Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at ML4H 2024
Abstract:Biomedical imaging modalities often produce high-resolution, multi-dimensional images that pose computational challenges for deep neural networks. These computational challenges are compounded when training transformers due to the self-attention operator, which scales quadratically with context length. Recent developments in long-context models have potential to alleviate these difficulties and enable more efficient application of transformers to large biomedical images, although a systematic evaluation on this topic is lacking. In this study, we investigate the impact of context length on biomedical image analysis and we evaluate the performance of recently proposed long-context models. We first curate a suite of biomedical imaging datasets, including 2D and 3D data for segmentation, denoising, and classification tasks. We then analyze the impact of context length on network performance using the Vision Transformer and Swin Transformer by varying patch size and attention window size. Our findings reveal a strong relationship between context length and performance, particularly for pixel-level prediction tasks. Finally, we show that recent long-context models demonstrate significant improvements in efficiency while maintaining comparable performance, though we highlight where gaps remain. This work underscores the potential and challenges of using long-context models in biomedical imaging.
zh
[CV-103] DiC: Rethinking Conv3x3 Designs in Diffusion Models
【速读】: 该论文试图解决扩散模型(Diffusion Models)在视觉生成任务中推理速度较慢的问题。现有的扩散模型主要依赖于基于Transformer的各向同性架构(isotropic architectures),尽管这些架构在扩展性和性能上表现出色,但其复杂的自注意力机制(self-attention operation)导致了较慢的推理速度。为了解决这一问题,论文重新审视了深度学习中最简单且快速的模块——3x3卷积(3x3 Convolution),并提出了一种基于纯卷积的扩散模型(Diffusion CNN, DiC)。关键解决方案包括:1)采用编码器-解码器沙漏设计(Encoder-Decoder Hourglass design),发现其在卷积架构中优于可扩展的各向同性架构;2)引入稀疏跳跃连接(sparse skip connections)以减少冗余并提高可扩展性;3)通过阶段特定的嵌入(stage-specific embeddings)、中间块条件注入(mid-block condition injection)和条件门控(conditional gating)等条件改进来增强模型性能。实验结果表明,DiC在性能上显著超越现有的扩散Transformer,同时保持了较好的速度优势。
链接: https://arxiv.org/abs/2501.00603
作者: Yuchuan Tian,Jing Han,Chengcheng Wang,Yuchen Liang,Chao Xu,Hanting Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 6 figures
Abstract:Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: this https URL
zh
[CV-104] STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes
【速读】: 该论文试图解决从稀疏观测数据中重建动态户外场景的挑战。现有方法通常依赖于每场景优化、密集的时空观测以及强运动监督,导致优化时间长、对新视角或场景的泛化能力有限,并且由于动态伪标签的噪声问题导致重建质量下降。为解决这些问题,STORM采用了一种数据驱动的Transformer架构,通过单次前向传播直接推断动态3D场景表示(参数化为3D高斯分布及其速度)。其关键设计在于利用自监督场景流(self-supervised scene flows)聚合所有帧的3D高斯分布,并将其转换到目标时间步,从而实现任意时间点从任意视角的完整(即“无模态”)重建。STORM能够自动捕捉动态实例并仅通过重建损失生成高质量掩码。实验表明,STORM在动态区域的重建精度上优于现有方法,并支持实时渲染和大规模户外场景的快速重建。
链接: https://arxiv.org/abs/2501.00602
作者: Jiawei Yang,Jiahui Huang,Yuxiao Chen,Yan Wang,Boyi Li,Yurong You,Apoorva Sharma,Maximilian Igl,Peter Karkus,Danfei Xu,Boris Ivanovic,Yue Wang,Marco Pavone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page at: this https URL
Abstract:We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations–parameterized by 3D Gaussians and their velocities–in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., “amodal”) reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.
zh
[CV-105] DreamDrive: Generative 4D Scene Modeling from Street View Images
【速读】: 该论文旨在解决从自动驾驶车辆的驾驶轨迹中合成逼真的视觉观测数据的问题,以支持自动驾驶模型的可扩展训练。现有的基于重建的方法虽然能够通过神经渲染生成几何一致的驾驶视频,但其对昂贵物体标注的依赖限制了其在野外驾驶场景中的泛化能力。另一方面,生成模型虽然能够以更通用的方式合成动作条件下的驾驶视频,但在保持3D视觉一致性方面存在困难。论文提出的解决方案DreamDrive结合了生成和重建的优点,通过4D时空场景生成方法,利用视频扩散模型的生成能力合成视觉参考序列,并通过混合高斯表示将其提升为4D场景。随后,通过高斯溅射(Gaussian splatting)技术渲染出3D一致的驾驶视频。该方法利用生成先验从野外驾驶数据中生成高质量的4D场景,同时通过神经渲染确保从4D场景中生成3D一致的视频。实验表明,DreamDrive能够生成可控且通用的4D驾驶场景,合成具有高保真度和3D一致性的新视角驾驶视频,并以自监督方式分解静态和动态元素,从而增强自动驾驶的感知和规划任务。
链接: https://arxiv.org/abs/2501.00601
作者: Jiageng Mao,Boyi Li,Boris Ivanovic,Yuxiao Chen,Yan Wang,Yurong You,Chaowei Xiao,Danfei Xu,Marco Pavone,Yue Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Synthesizing photo-realistic visual observations from an ego vehicle’s driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
zh
[CV-106] VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
【速读】: 该论文旨在解决视频大语言模型(Video LLMs)在细粒度时空视频理解方面的不足,特别是其在捕捉精细空间和时间细节上的困难。此外,高质量的对象级视频指令数据和全面基准的缺乏也阻碍了其进一步发展。为解决这些问题,论文提出了VideoRefer Suite,通过三个关键方面来增强视频大语言模型的细粒度时空理解能力:数据集、模型和基准。首先,引入多智能体数据引擎精心构建了一个大规模、高质量的对象级视频指令数据集VideoRefer-700K。其次,提出了VideoRefer模型,该模型配备了多功能时空对象编码器,以捕捉精确的区域和序列表示。最后,创建了VideoRefer-Bench,全面评估视频大语言模型的时空理解能力。实验和分析表明,VideoRefer模型不仅在视频引用基准上表现出色,还提升了通用视频理解能力。
链接: https://arxiv.org/abs/2501.00599
作者: Yuqian Yuan,Hang Zhang,Wentong Li,Zesen Cheng,Boqiang Zhang,Long Li,Xin Li,Deli Zhao,Wenqiao Zhang,Yueting Zhuang,Jianke Zhu,Lidong Bing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 12 figures, technical report
Abstract:Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.
zh
[CV-107] Sidewalk Hazard Detection Using Variational Autoencoder and One-Class SVM
【速读】: 该论文旨在解决户外环境中人行道安全隐患检测的问题,特别是在不确定的户外环境中,如何有效识别可能对行人构成威胁的异常情况。解决方案的关键在于提出了一种结合变分自编码器(VAE)和一类支持向量机(OCSVM)的混合方法。具体而言,VAE通过其重构机制检测帧中的异常,若重构效果差则表明存在异常;随后,OCSVM用于进一步确认该异常是否为真正的安全隐患。该方法在区分潜在危险和非危险异常方面表现出色,AUC达到0.94,整体准确率为91.4%,有效减少了误报(如井盖或水阀盖等非危险异常)。这一系统为不确定环境中的安全隐患检测提供了可靠的解决方案。
链接: https://arxiv.org/abs/2501.00585
作者: Edgar Guzman,Robert D. Howe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages
Abstract:The unpredictable nature of outdoor settings introduces numerous safety concerns, making hazard detection crucial for safe navigation. This paper introduces a novel system for sidewalk safety navigation utilizing a hybrid approach that combines a Variational Autoencoder (VAE) with a One-Class Support Vector Machine (OCSVM). The system is designed to detect anomalies on sidewalks that could potentially pose walking hazards. A dataset comprising over 15,000 training frames and 5,000 testing frames was collected using video recordings, capturing various sidewalk scenarios, including normal and hazardous conditions. During deployment, the VAE utilizes its reconstruction mechanism to detect anomalies within a frame. Poor reconstruction by the VAE implies the presence of an anomaly, after which the OCSVM is used to confirm whether the anomaly is hazardous or non-hazardous. The proposed VAE model demonstrated strong performance, with a high Area Under the Curve (AUC) of 0.94, effectively distinguishing anomalies that could be potential hazards. The OCSVM is employed to reduce the detection of false hazard anomalies, such as manhole or water valve covers. This approach achieves an accuracy of 91.4%, providing a highly reliable system for distinguishing between hazardous and non-hazardous scenarios. These results suggest that the proposed system offers a robust solution for hazard detection in uncertain environments.
zh
[CV-108] Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实时在线视频流处理中的应用挑战,特别是在自动驾驶和人机交互等实际场景中的需求。为解决这一问题,论文从三个关键方面提出了系统性的解决方案:评估基准、模型架构和训练策略。首先,论文引入了OVBench,一个专门设计的问答基准,用于评估模型在在线视频上下文中的感知、记忆和推理能力。其次,提出了金字塔记忆库(Pyramid Memory Bank, PMB),以有效保留视频流中的关键时空信息。最后,设计了一种从离线到在线的学习范式,通过构建适合在线视频训练的指令调优数据集,开发了VideoChat-Online模型。该模型在计算成本较低且效率更高的情况下,超越了现有的离线和在线模型,验证了所提出模型架构和训练策略的有效性。
链接: https://arxiv.org/abs/2501.00584
作者: Zhenpeng Huang,Xinhao Li,Jiaqi Li,Jing Wang,Xiangyu Zeng,Cheng Liang,Tao Wu,Xi Chen,Liang Li,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models’ ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.
zh
[CV-109] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
【速读】: 该论文旨在解决多模态大语言模型(MLLMs)在处理极长视频时面临的挑战,特别是在保持关键特征和减少计算负担方面。为了解决这一问题,论文提出了两种关键方法:一是分层视觉标记压缩(HiCo),该方法通过利用长视频中视觉信息的冗余性,从片段级到视频级压缩长视频上下文,从而在保留关键细节的同时显著减少计算量;二是专为多模态长序列处理设计的实用上下文建模系统VideoChat-Flash,该系统采用多阶段从短到长的学习策略,并引入了一个名为LongVid的真实世界长视频数据集,以及升级版的“视频中的针”(NIAH)评估框架来测试模型的上下文处理能力。通过这些创新,VideoChat-Flash在7B模型规模上展示了在主流长视频和短视频基准测试中的领先性能,特别是在NIAH测试中首次在开源模型中实现了10,000帧99.1%的准确率。
链接: https://arxiv.org/abs/2501.00574
作者: Xinhao Li,Yi Wang,Jiashuo Yu,Xiangyu Zeng,Yuhan Zhu,Haian Huang,Jianfei Gao,Kunchang Li,Yinan He,Chenting Wang,Yu Qiao,Yali Wang,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Long-context modeling is a critical capability for multimodal large language models (MLLMs), enabling them to process long-form contents with implicit memorization. Despite its advances, handling extremely long videos remains challenging due to the difficulty in maintaining crucial features over extended sequences. This paper introduces a Hierarchical visual token Compression (HiCo) method designed for high-fidelity representation and a practical context modeling system VideoChat-Flash tailored for multimodal long-sequence processing. HiCo capitalizes on the redundancy of visual information in long videos to compress long video context from the clip-level to the video-level, reducing the compute significantly while preserving essential details. VideoChat-Flash features a multi-stage short-to-long learning scheme, a rich dataset of real-world long videos named LongVid, and an upgraded “Needle-In-A-video-Haystack” (NIAH) for evaluating context capacities. In extensive experiments, VideoChat-Flash shows the leading performance on both mainstream long and short video benchmarks at the 7B model scale. It firstly gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.
zh
[CV-110] Probing Visual Language Priors in VLMs
【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)过度依赖训练数据中的视觉语言先验(visual language priors),而非真正进行视觉推理的问题。为此,作者提出了ViLP基准测试,该基准通过生成具有显著纹理、形状、概念组合、幻觉元素和谚语背景的图像,确保图像分布外(out-of-distribution)的多样性,从而评估模型在需要视觉推理的场景下的表现。实验表明,现代VLMs(如GPT-4)在ViLP上的表现显著低于人类水平。为解决这一问题,作者提出了一种自改进框架,通过模型生成新的视觉问答(VQA)对和图像,并应用像素级和语义级破坏来形成“好坏”图像对进行自训练。这一训练目标迫使VLMs更加关注实际的视觉输入,从而提升其性能。该框架已在开源VLMs(如LLaVA-v1.5和Cambrian)上验证了其有效性。
链接: https://arxiv.org/abs/2501.00569
作者: Tiange Luo,Ang Cao,Gunhee Lee,Justin Johnson,Honglak Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite recent advances in Vision-Language Models (VLMs), many still over-rely on visual language priors present in their training data rather than true visual reasoning. To examine the situation, we introduce ViLP, a visual question answering (VQA) benchmark that pairs each question with three potential answers and three corresponding images: one image whose answer can be inferred from text alone, and two images that demand visual reasoning. By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts, making our benchmark images distinctly out-of-distribution. While humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form “good-bad” image pairs for self-training. Our training objectives compel VLMs to focus more on actual visual inputs and have demonstrated their effectiveness in enhancing the performance of open-source VLMs, including LLaVA-v1.5 and Cambrian.
zh
[CV-111] Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves ECCV
【速读】: 该论文旨在解决精准农业(Precision Agriculture)中的作物监测和干预问题,特别是如何通过精确的除草剂和肥料应用来最大化产量并减少资源浪费和环境影响。为此,作者提出了一种分层全景分割方法(hierarchical panoptic segmentation method),该方法能够同时确定叶片数量(作为植物生长的标识)并定位图像中的杂草。解决方案的关键在于引入了焦点损失(focal loss)和边界损失(boundary loss),以改善对较小实例(如叶片和杂草)的分割效果。该方法在标准训练集上取得了81.89的PQ+(Panoptic Quality)分数,并显著提高了叶片计数的准确性。
链接: https://arxiv.org/abs/2501.00527
作者: Madeleine Darbyshire,Elizabeth Sklar,Simon Parsons
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Presented at the 9th Workshop for Computer Vision in Plant Phenotyping and Agriculture (CVPPA) 2024 at the European Conference of Computer Vision (ECCV) 2024. arXiv admin note: text overlap with arXiv:2310.06582
Abstract:Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at this https URL.
zh
[CV-112] Is Segment Anything Model 2 All You Need for Surgery Video Segmentation? A Systematic Evaluation
【速读】: 该论文试图解决手术视频分割(surgery video segmentation)领域中由于标注数据缺乏导致的模型性能受限问题。解决方案的关键在于利用SAM2模型(Segment Anything Model 2),这是一个在自然视频上训练的大规模基础模型,能够实现零样本(zero-shot)手术视频分割。论文通过系统评估SAM2模型在不同配置下的性能,包括不同的提示策略(prompting strategies)和鲁棒性(robustness),并在9个数据集上对17种不同类型的手术进行了实证评估,以探索其在零样本手术视频分割任务中的潜力。
链接: https://arxiv.org/abs/2501.00525
作者: Cheng Yuan,Jian Jiang,Kunyi Yang,Lv Wu,Rui Wang,Zi Meng,Haonan Ping,Ziyu Xu,Yifan Zhou,Wanli Song,Hesheng Wang,Qi Dou,Yutong Ban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgery video segmentation is an important topic in the surgical AI field. It allows the AI model to understand the spatial information of a surgical scene. Meanwhile, due to the lack of annotated surgical data, surgery segmentation models suffer from limited performance. With the emergence of SAM2 model, a large foundation model for video segmentation trained on natural videos, zero-shot surgical video segmentation became more realistic but meanwhile remains to be explored. In this paper, we systematically evaluate the performance of SAM2 model in zero-shot surgery video segmentation task. We conducted experiments under different configurations, including different prompting strategies, robustness, etc. Moreover, we conducted an empirical evaluation over the performance, including 9 datasets with 17 different types of surgeries.
zh
[CV-113] Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques
【速读】: 该论文旨在解决硅肺病(Silicosis)相关肺部炎症的分类和检测问题。硅肺病是一种由吸入二氧化硅粉尘引起的职业性肺部疾病,其炎症表现与其他肺部疾病(如肺炎)相似,因此准确区分这些疾病具有挑战性。论文的关键解决方案包括:1)创建了一个名为SVBCX的新型胸部X光(CXR)图像数据集,专门针对不同病因引起的肺部炎症的细微差异,为硅肺病和肺炎研究提供了重要资源;2)提出了一种新颖的深度学习架构,结合图变换网络(Graph Transformer Networks)和传统深度神经网络模块,以有效分类硅肺病和肺炎;3)采用平衡交叉熵(Balanced Cross-Entropy, BalCE)作为损失函数,确保不同类别之间的均衡学习,增强模型对肺部细微差异的识别能力。此外,论文还探索了集成学习方法,结合多种模型架构的优势,显著提升了分类的准确性和鲁棒性。实验结果表明,该集成模型在构建的数据集上取得了优异的性能,宏F1分数达到0.9749,各类别的AUC ROC分数均超过0.99,验证了该方法的有效性。
链接: https://arxiv.org/abs/2501.00520
作者: Bao Q. Bui,Tien T.T. Nguyen,Duy M. Le,Cong Tran,Cuong Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model’s ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.
zh
[CV-114] Fine-grained Video-Text Retrieval: A New Benchmark and Method
【速读】: 该论文旨在解决现有视频检索基准(如MSRVTT和MSVD)在评估视频-语言模型(VLMs)的细粒度检索能力方面的不足。具体来说,这些基准缺乏详细的标注,无法有效评估模型在空间和时间维度上的细粒度检索能力。为解决这一问题,作者提出了FIBER(FIne-grained BEnchmark for text to video Retrieval),这是一个包含1000个视频的细粒度基准,视频来源于FineAction数据集。FIBER的关键在于提供了详细的人工标注的空间标注(spatial annotations)和时间标注(temporal annotations),使得能够独立评估VLMs在视频检索任务中的空间和时间偏差。此外,作者采用了一种文本嵌入方法,以释放多模态大语言模型(MLLMs)在细粒度视频-语言理解方面的潜力。实验结果表明,作者提出的视频大语言编码器(VLLE)在传统基准上的表现与基于CLIP的模型相当,并且在细粒度表示能力上更强,空间和时间偏差更低。
链接: https://arxiv.org/abs/2501.00513
作者: Yifan Xu,Xinhao Li,Yichun Yang,Rui Huang,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:The ability of perceiving fine-grained spatial and temporal information is crucial for video-language retrieval. However, the existing video retrieval benchmarks, such as MSRVTT and MSVD, fail to efficiently evaluate the fine-grained retrieval ability of video-language models (VLMs) due to a lack of detailed annotations. To address this problem, we present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from the FineAction dataset. Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video, making it possible to independently evaluate the spatial and temporal bias of VLMs on video retrieval task. Besides, we employ a text embedding method to unlock the capability of fine-grained video-language understanding of Multimodal Large Language Models (MLLMs). Surprisingly, the experiment results show that our Video Large Language Encoder (VLLE) performs comparably to CLIP-based models on traditional benchmarks and has a stronger capability of fine-grained representation with lower spatial-temporal bias. Project page: this https URL.
zh
[CV-115] SAT-LDM: Provably Generalizable Image Watermarking for Latent Diffusion Models with Self-Augmented Training
【速读】: 该论文旨在解决AI生成图像(AI-generated images)在保护知识产权和识别虚假内容方面的有效水印(watermarking)问题。现有的基于训练的水印方法虽然在某种程度上有效,但在面对多样化提示(diverse prompts)时往往表现出泛化能力不足,并且容易产生明显的伪影(artifacts)。为此,论文提出了一种基于自增强训练(Self-Augmented Training, SAT)的潜在扩散模型(Latent Diffusion Models, LDM)水印方法,即SAT-LDM。该方法通过引入自由生成分布(free generation distribution)来对齐训练和测试阶段,从而增强水印模块的泛化能力。理论分析表明,自由生成分布有助于在不收集新数据的情况下实现紧密的泛化边界(generalization bound)。实验结果表明,SAT-LDM在多样化提示下不仅实现了鲁棒的水印效果,还显著提高了水印图像的质量。该方法的提出为保护高保真AI生成内容提供了一种实用且便捷的解决方案。
链接: https://arxiv.org/abs/2501.00463
作者: Lu Zhang,Liang Zeng
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 7 figures
Abstract:The proliferation of AI-generated images necessitates effective watermarking to protect intellectual property and identify fake content. While existing training-based watermarking methods show promise, they often struggle with generalization across diverse prompts and tend to produce noticeable artifacts. To this end, we introduce a provably generalizable image watermarking method for Latent Diffusion Models with Self-Augmented Training (SAT-LDM), which aligns the training and testing phases by a free generation distribution to bolster the watermarking module’s generalization capabilities. We theoretically consolidate our method by proving that the free generation distribution contributes to its tight generalization bound without the need to collect new data. Extensive experimental results show that SAT-LDM achieves robust watermarking while significantly improving the quality of watermarked images across diverse prompts. Furthermore, we conduct experimental analyses to demonstrate the strong generalization abilities of SAT-LDM. We hope our method offers a practical and convenient solution for securing high-fidelity AI-generated content.
zh
[CV-116] OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models ICASSP2025
【速读】: 该论文试图解决传统人类交互识别系统在公共安全监控等场景中的局限性问题。传统系统依赖于固定的词汇表、预定义的标签和僵化的交互类别,通常基于编排好的视频,且无法有效处理并发交互群体,导致其在多样化和不可预测的真实场景中适应性较差。论文提出的解决方案是开放词汇人类交互识别(OV-HHIR)框架,该框架利用大语言模型生成开放世界场景中已见和未见人类交互的开放式文本描述,突破了固定词汇的限制。此外,论文通过标准化和整合现有公共人类交互数据集,创建了一个大规模、统一的基准数据集。实验表明,该方法在视频理解任务中优于传统的固定词汇分类系统和现有的跨模态语言模型,为更智能和适应性更强的视觉理解系统奠定了基础。
链接: https://arxiv.org/abs/2501.00432
作者: Lala Shakti Swarup Ray,Bo Zhou,Sungho Suh,Paul Lukowicz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in IEEE ICASSP 2025
Abstract:Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.
zh
[CV-117] B2Net: Camouflaged Object Detection via Boundary Aware and Boundary Fusion
【速读】: 该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)中由于目标与背景在纹理和颜色上高度相似而导致的检测困难问题。现有的大多数基于边界引导的伪装目标检测算法倾向于在网络早期生成目标边界,而不准确的边缘先验信息往往会引入噪声,影响检测效果。针对这一问题,论文提出了一种名为B2Net的新型网络,其关键解决方案包括:1) 引入残差特征增强模块(Residual Feature Enhanced Module, RFEM),通过整合更具判别性的特征表示来提高检测的准确性和可靠性;2) 提出边界感知模块(Boundary Aware Module, BAM),通过结合低层特征的空间信息和高层特征的语义信息,两次探索边缘线索;3) 设计跨尺度边界融合模块(Cross-scale Boundary Fusion Module, CBFM),以自上而下的方式整合不同尺度的信息,将边界特征与目标特征融合,获得包含边界信息的全面特征表示。实验结果表明,B2Net在三个具有挑战性的基准数据集上优于15种现有方法。
链接: https://arxiv.org/abs/2501.00426
作者: Junmin Cai,Han Sun,Ningzhong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Camouflaged object detection (COD) aims to identify objects in images that are well hidden in the environment due to their high similarity to the background in terms of texture and color. However, existing most boundary-guided camouflage object detection algorithms tend to generate object boundaries early in the network, and inaccurate edge priors often introduce noises in object detection. Address on this issue, we propose a novel network named B2Net aiming to enhance the accuracy of obtained boundaries by reusing boundary-aware modules at different stages of the network. Specifically, we present a Residual Feature Enhanced Module (RFEM) with the goal of integrating more discriminative feature representations to enhance detection accuracy and reliability. After that, the Boundary Aware Module (BAM) is introduced to explore edge cues twice by integrating spatial information from low-level features and semantic information from high-level features. Finally, we design the Cross-scale Boundary Fusion Module(CBFM) that integrate information across different scales in a top-down manner, merging boundary features with object features to obtain a comprehensive feature representation incorporating boundary information. Extensive experimental results on three challenging benchmark datasets demonstrate that our proposed method B2Net outperforms 15 state-of-art methods under widely used evaluation metrics. Code will be made publicly available.
zh
[CV-118] oken Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free
【速读】: 该论文试图解决Stable Diffusion在文本到图像生成(text-to-image generation)领域中由于迭代去噪(iterative denoising)导致的高计算成本和生成速度慢的问题。现有的特征缓存(feature caching)方法虽然有效且简单,但会导致相邻时间步的特征变得相似,从而降低特征的动态性,最终影响生成图像的质量。论文提出的解决方案是动态感知的令牌剪枝(dynamics-aware token pruning, DaTo)方法,该方法通过选择性剪枝低动态性的令牌,仅保留高动态性的令牌参与自注意力层(self-attention layers),从而扩展了特征在时间步上的动态性。DaTo结合了特征缓存和令牌剪枝,以无需训练的方式实现了时间和令牌层面的信息重用。实验结果表明,该方法在ImageNet数据集上实现了9倍的加速,并将FID(Fréchet Inception Distance)降低了0.33,在COCO-30k数据集上实现了7倍的加速,并将FID降低了2.17,显著提升了生成图像的质量。
链接: https://arxiv.org/abs/2501.00375
作者: Evelyn Zhang,Bang Xiao,Jiayi Tang,Qianli Ma,Chang Zou,Xuefei Ning,Xuming Hu,Linfeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9 \times speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7 \times acceleration coupled with a notable FID reduction of 2.17.
zh
[CV-119] A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images
【速读】: 该论文试图解决遥感图像(RSIs)中实例分割(instance segmentation)的两个关键问题:一是如何从动态大气条件下的遥感成像中准确提取物体的边界,二是如何整合分散在广阔空间区域内的相关物体实例之间的互信息。为解决这些问题,论文提出了一种新颖的形状引导变换网络(Shape Guided Transformer Network, SGTN)。其解决方案的关键在于结合了全局上下文建模能力和局部细节信息提取。具体而言,论文提出了一个称为LSwin的变换编码器,该编码器通过垂直和水平的一维全局自注意力机制(self-attention mechanism)增强了遥感图像的全局感知能力,优于现有的基于局部滑动窗口的Swin Transformer。此外,论文还引入了一个形状引导模块(Shape Guidance Module, SGM),用于强调物体的边界和形状信息。通过结合SGM的局部细节提取能力和LSwin的全局上下文建模能力,SGTN在遥感图像实例分割任务中表现出色,并在多个公开数据集上取得了最高的平均精度(AP)得分。
链接: https://arxiv.org/abs/2501.00360
作者: Dawen Yu,Shunping Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 15 figures
Abstract:Instance segmentation performance in remote sensing images (RSIs) is significantly affected by two issues: how to extract accurate boundaries of objects from remote imaging through the dynamic atmosphere, and how to integrate the mutual information of related object instances scattered over a vast spatial region. In this study, we propose a novel Shape Guided Transformer Network (SGTN) to accurately extract objects at the instance level. Inspired by the global contextual modeling capacity of the self-attention mechanism, we propose an effective transformer encoder termed LSwin, which incorporates vertical and horizontal 1D global self-attention mechanisms to obtain better global-perception capacity for RSIs than the popular local-shifted-window based Swin Transformer. To achieve accurate instance mask segmentation, we introduce a shape guidance module (SGM) to emphasize the object boundary and shape information. The combination of SGM, which emphasizes the local detail information, and LSwin, which focuses on the global context relationships, achieve excellent RSI instance segmentation. Their effectiveness was validated through comprehensive ablation experiments. Especially, LSwin is proved better than the popular ResNet and Swin transformer encoder at the same level of efficiency. Compared to other instance segmentation methods, our SGTN achieves the highest average precision (AP) scores on two single-class public datasets (WHU dataset and BITCC dataset) and a multi-class public dataset (NWPU VHR-10 dataset). Code will be available at this http URL.
zh
[CV-120] Embodied VideoAgent : Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
【速读】: 该论文致力于解决从自我中心视角(egocentric observations)理解动态3D场景的问题,这是机器人学和具身人工智能(embodied AI)中的一个关键挑战。与以往仅利用自我中心视频进行长视频理解的研究不同,本文提出了一种基于大语言模型(LLM)的智能体——Embodied VideoAgent。该智能体通过结合自我中心视频和具身感知输入(如深度和姿态感知)来构建场景记忆,并进一步引入基于视觉语言模型(VLM)的方法,在感知到物体上的动作或活动时自动更新记忆。Embodied VideoAgent在3D场景中的复杂推理和规划任务中表现出显著优势,在Ego4D-VQ3D、OpenEQA和EnvQA数据集上分别取得了4.9%、5.8%和11.7%的性能提升。此外,该智能体在生成具身交互和机器人操作感知等具身AI任务中也展现了潜力。
链接: https://arxiv.org/abs/2501.00358
作者: Yue Fan,Xiaojian Ma,Rongpeng Su,Jun Guo,Rujie Wu,Xi Chen,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
zh
[CV-121] PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM
【速读】: 该论文旨在解决从序列视频数据中理解3D场景的几何、语义和实例信息的问题,这对于机器人和增强现实应用至关重要。现有的同时定位与地图构建(SLAM)方法通常只关注几何重建或语义重建,而PanoSLAM是首个将几何重建、3D语义分割和3D实例分割整合到一个统一框架中的SLAM系统。解决方案的关键在于引入了基于3D高斯泼溅(3D Gaussian Splatting)的改进方法,并结合了在线时空提升(Spatial-Temporal Lifting, STL)模块。STL模块通过将视觉模型的2D全景预测转换为3D高斯表示,解决了2D预测中的标签噪声和不一致性问题,从而在多视角输入中优化伪标签,生成一致的3D表示,提升了分割精度。实验表明,PanoSLAM在映射和跟踪精度上优于现有的语义SLAM方法,并首次实现了直接从RGB-D视频进行开放世界环境的全景3D重建。
链接: https://arxiv.org/abs/2501.00352
作者: Runnan Chen,Zhaoqing Wang,Jiepeng Wang,Yuexin Ma,Mingming Gong,Wenping Wang,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (this https URL)
zh
[CV-122] CNC: Cross-modal Normality Constraint for Unsupervised Multi-class Anomaly Detection AAAI2025
【速读】: 该论文试图解决无监督蒸馏方法在多类异常检测任务中由于解码器过度泛化(over-generalization, OG)导致的性能下降问题。具体来说,现有的无监督蒸馏方法依赖于编码和解码特征之间的差异来定位测试图像中的异常区域,但在多类训练中,解码器即使仅使用正常样本进行训练,仍然能够很好地重建异常区域的特征,从而降低了检测性能。论文提出了一种新颖的解决方案,其关键在于利用类无关的可学习提示(class-agnostic learnable prompts)来捕捉不同视觉模式中的共性正常性,并通过这些提示引导解码特征向正常文本表示靠拢,从而抑制解码器对异常模式的过度泛化。此外,论文还引入了一个门控专家混合模块(gated mixture-of-experts module),专门用于处理多样化的局部模式,并减少多类训练中这些模式之间的相互干扰。该方法在MVTec AD和VisA数据集上表现出色,验证了其有效性。
链接: https://arxiv.org/abs/2501.00346
作者: Xiaolei Wang,Xiaoyang Wang,Huihui Bai,Eng Gee Lim,Jimin Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025
Abstract:Existing unsupervised distillation-based methods rely on the differences between encoded and decoded features to locate abnormal regions in test images. However, the decoder trained only on normal samples still reconstructs abnormal patch features well, degrading performance. This issue is particularly pronounced in unsupervised multi-class anomaly detection tasks. We attribute this behavior to over-generalization(OG) of decoder: the significantly increasing diversity of patch patterns in multi-class training enhances the model generalization on normal patches, but also inadvertently broadens its generalization to abnormal patches. To mitigate OG, we propose a novel approach that leverages class-agnostic learnable prompts to capture common textual normality across various visual patterns, and then apply them to guide the decoded features towards a normal textual representation, suppressing over-generalization of the decoder on abnormal patterns. To further improve performance, we also introduce a gated mixture-of-experts module to specialize in handling diverse patch patterns and reduce mutual interference between them in multi-class training. Our method achieves competitive performance on the MVTec AD and VisA datasets, demonstrating its effectiveness.
zh
[CV-123] SG-Splatting: Accelerating 3D Gaussian Splatting with Spherical Gaussians
【速读】: 该论文旨在解决3D Gaussian Splatting(3D高斯泼溅)在新视角合成(novel view synthesis)中因依赖三阶球谐函数(third-degree spherical harmonics)进行颜色表示而导致的存储需求大、计算开销高、内存占用大以及渲染速度慢的问题。解决方案的关键在于引入基于球面高斯(Spherical Gaussians, SG)的颜色表示方法,替代传统的三阶球谐函数,从而大幅减少颜色表示所需的参数数量,显著加速渲染过程。此外,论文提出了一种高效的组织策略,优化多个球面高斯的排列,以实现平衡且准确的场景表示。为进一步提升渲染质量,作者还提出了一种混合表示方法,结合球面高斯与低阶球谐函数,有效捕捉高频和低频颜色信息。SG-Splatting具有即插即用(plug-and-play)的能力,便于集成到现有系统中,从而在计算效率和视觉保真度方面实现显著提升,适用于实时应用场景。
链接: https://arxiv.org/abs/2501.00342
作者: Yiwen Wang,Siyuan Chen,Ran Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting is emerging as a state-of-the-art technique in novel view synthesis, recognized for its impressive balance between visual quality, speed, and rendering efficiency. However, reliance on third-degree spherical harmonics for color representation introduces significant storage demands and computational overhead, resulting in a large memory footprint and slower rendering speed. We introduce SG-Splatting with Spherical Gaussians based color representation, a novel approach to enhance rendering speed and quality in novel view synthesis. Our method first represents view-dependent color using Spherical Gaussians, instead of three degree spherical harmonics, which largely reduces the number of parameters used for color representation, and significantly accelerates the rendering process. We then develop an efficient strategy for organizing multiple Spherical Gaussians, optimizing their arrangement to achieve a balanced and accurate scene representation. To further improve rendering quality, we propose a mixed representation that combines Spherical Gaussians with low-degree spherical harmonics, capturing both high- and low-frequency color information effectively. SG-Splatting also has plug-and-play capability, allowing it to be easily integrated into existing systems. This approach improves computational efficiency and overall visual fidelity, making it a practical solution for real-time applications.
zh
[CV-124] Dynamic Prompt Adjustment for Multi-Label Class-Incremental Learning
【速读】: 该论文试图解决多标签类增量学习(Multi-Label Class Incremental Learning, MLCIL)中的灾难性遗忘(catastrophic forgetting)问题。MLCIL相较于单标签增量学习(Single Label Class Incremental Learning, SLCIL)更具挑战性,且在实际应用中更为常见。为了解决这一问题,论文提出了一种结合改进的数据回放机制(data replay mechanism)和提示损失(prompt loss)的方法。具体而言,模型通过增强提示信息(prompt information)来更好地适应多标签分类任务,并采用基于置信度的回放策略(confidence-based replay strategy)来选择具有代表性的样本。此外,提示损失显著减少了模型对先前知识的遗忘。实验结果表明,该方法在多个基准数据集上显著提升了MLCIL任务的性能,验证了其有效性。
链接: https://arxiv.org/abs/2501.00340
作者: Haifeng Zhao,Yuguang Jin,Leilei Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: published to BICS2024
Abstract:Significant advancements have been made in single label incremental learning (SLCIL),yet the more practical and challenging multi label class incremental learning (MLCIL) remains understudied. Recently,visual language models such as CLIP have achieved good results in classification tasks. However,directly using CLIP to solve MLCIL issue can lead to catastrophic forgetting. To tackle this issue, we integrate an improved data replay mechanism and prompt loss to curb knowledge forgetting. Specifically,our model enhances the prompt information to better adapt to multi-label classification tasks and employs confidence-based replay strategy to select representative samples. Moreover, the prompt loss significantly reduces the model’s forgetting of previous knowledge. Experimental results demonstrate that our method has substantially improved the performance of MLCIL tasks across multiple benchmark datasets,validating its effectiveness.
zh
[CV-125] OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies
【速读】: 该论文旨在解决现有基于3D高斯(3D Gaussian, 3DGS)表示的开集词汇场景理解方法在跨场景泛化能力上的不足。现有方法通常依赖于从大规模2D视觉模型中提取知识,并将其应用于单个场景的3DGS表示,导致其在新场景中的开集词汇查询能力受限。为解决这一问题,论文提出了OVGaussian,一个基于3D高斯表示的通用开集词汇3D语义分割框架。其关键解决方案包括:1)构建了一个大规模3D场景数据集SegGaussian,该数据集为3D高斯点和多视角图像提供了详细的语义和实例标注;2)提出了通用语义光栅化(Generalizable Semantic Rasterization, GSR),通过3D神经网络学习和预测每个3D高斯点的语义属性,并生成多视角一致的2D语义图;3)设计了跨模态一致性学习(Cross-modal Consistency Learning, CCL)框架,利用SegGaussian中的2D图像和3D高斯的开集词汇标注,训练能够跨场景进行开集词汇语义分割的3D神经网络。实验结果表明,OVGaussian在跨场景、跨领域和新视角泛化能力上显著优于基线方法。
链接: https://arxiv.org/abs/2501.00326
作者: Runnan Chen,Xiangyu Sun,Zhaoqing Wang,Youquan Liu,Jiepeng Wang,Lingdong Kong,Jiankang Deng,Mingming Gong,Liang Pan,Wenping Wang,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbfOVGaussian, a generalizable \textbfOpen-\textbfVocabulary 3D semantic segmentation framework based on the 3D \textbfGaussian representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbfSegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (this https URL).
zh
[CV-126] OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
【速读】: 该论文旨在解决当前大规模多模态模型(LMMs)在光学字符识别(OCR)能力评估中的局限性问题。尽管现有基准测试已经展示了LMMs在文本识别方面的显著性能,但在某些具有挑战性的任务上,如文本定位、手写内容提取和逻辑推理等方面,其能力尚未得到充分探索。为此,作者提出了OCRBench v2,这是一个大规模的、以文本为中心的双语基准测试,涵盖了31种多样化的场景(如街景、收据、公式、图表等),并包含10,000个人工验证的问答对,其中包含大量难度较高的样本。OCRBench v2的关键在于其任务的全面性(比之前的OCRBench多4倍任务)和评估指标的细致性,能够更全面地评估LMMs在复杂OCR任务中的表现。通过该基准测试,作者发现22个LMMs中有20个得分低于50(满分100),并揭示了这些模型在罕见文本识别、细粒度感知、布局感知、复杂元素解析和逻辑推理等五个方面的局限性。
链接: https://arxiv.org/abs/2501.00321
作者: Ling Fu,Biao Yang,Zhebin Kuang,Jiajun Song,Yuzhe Li,Linghao Zhu,Qidi Luo,Xinyu Wang,Hao Lu,Mingxin Huang,Zhang Li,Guozhi Tang,Bin Shan,Chunhui Lin,Qi Liu,Binghong Wu,Hao Feng,Hao Liu,Can Huang,Jingqun Tang,Wei Chen,Lianwen Jin,Yuliang Liu,Xiang Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at this https URL.
zh
[CV-127] Improving Text-based Person Search via Part-level Cross-modal Correspondence
【速读】: 该论文试图解决基于文本的人物搜索任务中的两个主要挑战:一是目标图像与文本查询之间存在较大的模态差异,导致难以建立对应关系并区分人物之间的细微差异;二是仅通过人物ID作为监督信息来捕捉细粒度信息时,不同个体的相似身体部位由于缺乏部位级别的监督而被错误地视为不同。为解决这些问题,论文提出了一种高效的编码器-解码器模型,该模型能够在不依赖对齐监督的情况下提取从粗到细的嵌入向量,并在两种模态之间实现语义对齐。此外,论文还提出了一种新颖的排序损失函数,称为基于共性的边际排序损失(commonality-based margin ranking loss),该损失函数量化了每个身体部位的共性程度,并在学习细粒度身体部位细节时反映这一共性,从而在三个公开基准上取得了最佳记录。
链接: https://arxiv.org/abs/2501.00318
作者: Jicheol Park,Boseung Jeong,Dongwon Kim,Suha Kwak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Text-based person search is the task of finding person images that are the most relevant to the natural language text description given as query. The main challenge of this task is a large gap between the target images and text queries, which makes it difficult to establish correspondence and distinguish subtle differences across people. To address this challenge, we introduce an efficient encoder-decoder model that extracts coarse-to-fine embedding vectors which are semantically aligned across the two modalities without supervision for the alignment. There is another challenge of learning to capture fine-grained information with only person IDs as supervision, where similar body parts of different individuals are considered different due to the lack of part-level supervision. To tackle this, we propose a novel ranking loss, dubbed commonality-based margin ranking loss, which quantifies the degree of commonality of each body part and reflects it during the learning of fine-grained body part details. As a consequence, it enables our method to achieve the best records on three public benchmarks.
zh
[CV-128] Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction
【速读】: 该论文试图解决现有基于图卷积网络(GCN)的人体运动预测(HMP)方法在捕捉时空特征时存在的问题。现有方法通常仅关注时间域或空间域特征,或在结合时空特征时未能充分利用这两种特征的互补性和交叉依赖性。为解决这一问题,论文提出了时空多子图图卷积网络(STMS-GCN),通过解耦时间和空间依赖关系的建模,并引入时空信息一致性约束机制,实现跨域知识的多尺度传递。此外,该方法利用多个子图提取更丰富的运动信息,并通过同质信息约束机制增强不同子图之间的学习关联。实验结果表明,该方法在标准HMP基准测试中表现出优越性。
链接: https://arxiv.org/abs/2501.00317
作者: Jiexin Wang,Yiju Guo,Bing Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human motion prediction (HMP) involves forecasting future human motion based on historical data. Graph Convolutional Networks (GCNs) have garnered widespread attention in this field for their proficiency in capturing relationships among joints in human motion. However, existing GCN-based methods tend to focus on either temporal-domain or spatial-domain features, or they combine spatio-temporal features without fully leveraging the complementarity and cross-dependency of these two features. In this paper, we propose the Spatial-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies, enabling cross-domain knowledge transfer at multiple scales through a spatio-temporal information consistency constraint mechanism. Besides, we utilize multiple subgraphs to extract richer motion information and enhance the learning associations of diverse subgraphs through a homogeneous information constraint mechanism. Extensive experiments on the standard HMP benchmarks demonstrate the superiority of our method.
zh
[CV-129] mporal Dynamics Decoupling with Inverse Processing for Enhancing Human Motion Prediction
【速读】: 该论文试图解决人类运动预测中历史与未来运动行为之间的桥梁问题。现有方法通常将重建任务作为辅助任务引入解码器,以增强时空依赖性的建模,但这些方法忽视了重建与预测任务之间潜在的冲突。论文提出了一种新颖的方法:基于逆处理的时序解耦解码(Temporal Decoupling Decoding with Inverse Processing, TD^2IP)。该方案的关键在于将重建和预测解码过程策略性地分离,使用不同的解码器将共享的运动特征分别解码为历史或未来序列。此外,逆处理通过在时间维度上反转运动信息并将其重新引入模型,利用人类运动行为的双向时间相关性,从而缓解重建与预测任务之间的冲突,并增强历史与未来信息的关联性。通过这种方式,TD^2IP 促进了对运动模式的更深入理解。
链接: https://arxiv.org/abs/2501.00315
作者: Jiexin Wang,Yiju Guo,Bing Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Exploring the bridge between historical and future motion behaviors remains a central challenge in human motion prediction. While most existing methods incorporate a reconstruction task as an auxiliary task into the decoder, thereby improving the modeling of spatio-temporal dependencies, they overlook the potential conflicts between reconstruction and prediction tasks. In this paper, we propose a novel approach: Temporal Decoupling Decoding with Inverse Processing (\textbf TD^2IP ). Our method strategically separates reconstruction and prediction decoding processes, employing distinct decoders to decode the shared motion features into historical or future sequences. Additionally, inverse processing reverses motion information in the temporal dimension and reintroduces it into the model, leveraging the bidirectional temporal correlation of human motion behaviors. By alleviating the conflicts between reconstruction and prediction tasks and enhancing the association of historical and future information, \textbf TD^2IP fosters a deeper understanding of motion patterns. Extensive experiments demonstrate the adaptability of our method within existing methods.
zh
[CV-130] SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation AAAI2025
【速读】: 该论文试图解决跨域少样本分割(Cross-Domain Few-Shot Segmentation, CD-FSS)中的主要挑战,即训练和推理阶段之间的域差异问题。这种差异可能存在于输入数据或目标类别中,导致现有模型难以从有限的训练域样本中学习到能够泛化到各种未知域的特征表示。为解决这一问题,论文提出了一种基于大规模视觉模型 SAM(Segment Anything Model)的图提示推理网络(Graph Prompt Reasoning Network, GPRN)。该网络的关键在于充分利用 SAM 的泛化能力,通过 SAM 感知的提示初始化模块(SAM-aware Prompt Initialization, SPI)将 SAM 生成的掩码转化为富含高层语义信息的视觉提示。此外,为了解决 SAM 可能将对象分割为多个子区域导致的语义不一致问题,论文进一步提出了图提示推理模块(Graph Prompt Reasoning, GPR),通过构建视觉提示之间的图结构来推理它们之间的相互关系,从而实现全局语义一致性。最后,论文还设计了无参数自适应点选择模块(Adaptive Point Selection, APS),在测试阶段通过选择代表性点提示来优化分割结果。实验结果表明,该方法在四个标准 CD-FSS 数据集上取得了新的最先进性能。
链接: https://arxiv.org/abs/2501.00303
作者: Shi-Feng Peng,Guolei Sun,Yong Li,Hongsong Wang,Guo-Sen Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025
Abstract:The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: this https URL.
zh
[CV-131] Research on vehicle detection based on improved YOLOv8 network
【速读】: 该论文旨在解决自动驾驶系统中车辆识别精度不足的问题,特别是在复杂多变的实际道路环境中,车辆和行人的多样性对检测精度构成了巨大挑战。为了解决这一问题,论文提出了一种改进的YOLOv8车辆检测方法。其关键解决方案包括:首先,使用FasterNet网络替换YOLOv8n-seg模型的主干网络(backbone),以降低计算复杂度和内存占用,同时提高检测精度和速度;其次,在Neck部分引入注意力机制CBAM(Convolutional Block Attention Module)以增强特征提取能力;最后,将损失函数从CIoU(Complete Intersection over Union)修改为WIoU(Weighted Intersection over Union),以优化检测框的定位精度并提高分割准确性。实验结果表明,改进后的模型在汽车、行人和摩托车的检测精度上分别达到了98.3%、89.1%和88.4%,相较于改进前的模型和YOLOv9模型,在精度等六个指标上均有显著提升。
链接: https://arxiv.org/abs/2501.00300
作者: Haocheng Guo,Yaqiong Zhang,Lieyang Chen,Arfat Ahmad Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The key to ensuring the safe obstacle avoidance function of autonomous driving systems lies in the use of extremely accurate vehicle recognition techniques. However, the variability of the actual road environment and the diverse characteristics of vehicles and pedestrians together constitute a huge obstacle to improving detection accuracy, posing a serious challenge to the realization of this goal. To address the above issues, this paper proposes an improved YOLOv8 vehicle detection method. Specifically, taking the YOLOv8n-seg model as the base model, firstly, the FasterNet network is used to replace the backbone network to achieve the purpose of reducing the computational complexity and memory while improving the detection accuracy and speed; secondly, the feature enhancement is achieved by adding the attention mechanism CBAM to the Neck; and lastly, the loss function CIoU is modified to WIoU, which optimizes the detection box localization while improving the segmentation accuracy. The results show that the improved model achieves 98.3%, 89.1% and 88.4% detection accuracy for car, Person and Motorcycle. Compared with the pre-improvement and YOLOv9 models in six metrics such as Precision.
zh
[CV-132] Predicate Invention from Pixels via Pretrained Vision-Language Models AAAI2025
【速读】: 该论文旨在解决在高度可变、组合复杂的机器人领域中,基于原始传感器输入(如图像)进行长期决策的问题。以往的研究表明,通过学习一种结构化的抽象转移模型(symbolic predicates and operators),并在该模型中进行规划,可以在测试时解决新任务。然而,这些学习到的模型无法直接从少量演示中直接映射到像素级别。本文提出了一种新的方法,即利用预训练的视觉-语言模型(VLMs)来发明直接操作输入图像的谓词(predicates)。关键思想是,给定一组演示,VLM可以提出一组可能对决策相关的谓词,并确定这些谓词在给定演示和新图像输入中的真值。本文在现有的谓词发明框架基础上,扩展了基于对象中心状态的特征谓词,生成了操作图像的视觉谓词。实验表明,本文提出的方法——pix2pred——能够发明具有语义意义的谓词,从而在两个模拟机器人环境中实现对新任务、复杂任务和长期任务的泛化。
链接: https://arxiv.org/abs/2501.00296
作者: Ashay Athalye,Nishanth Kumar,Tom Silver,Yichao Liang,Tomás Lozano-Pérez,Leslie Pack Kaelbling
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Workshop on Planning in the Era of LLMs (LM4Plan @ AAAI 2025)
Abstract:Our aim is to learn to solve long-horizon decision-making problems in highly-variable, combinatorially-complex robotics domains given raw sensor input in the form of images. Previous work has shown that one way to achieve this aim is to learn a structured abstract transition model in the form of symbolic predicates and operators, and then plan within this model to solve novel tasks at test time. However, these learned models do not ground directly into pixels from just a handful of demonstrations. In this work, we propose to invent predicates that operate directly over input images by leveraging the capabilities of pretrained vision-language models (VLMs). Our key idea is that, given a set of demonstrations, a VLM can be used to propose a set of predicates that are potentially relevant for decision-making and then to determine the truth values of these predicates in both the given demonstrations and new image inputs. We build upon an existing framework for predicate invention, which generates feature-based predicates operating on object-centric states, to also generate visual predicates that operate on images. Experimentally, we show that our approach – pix2pred – is able to invent semantically meaningful predicates that enable generalization to novel, complex, and long-horizon tasks across two simulated robotic environments.
zh
[CV-133] Dual Diffusion for Unified Image Generation and Understanding
【速读】: 该论文旨在解决扩散模型(Diffusion Models)在多模态理解和生成任务中表现落后于自回归视觉-语言模型(autoregressive vision-language models)的问题。论文提出了一种大规模、完全端到端的扩散模型,用于多模态理解和生成,显著改进了现有的基于扩散的多模态模型。解决方案的关键在于引入了一种跨模态最大似然估计框架(cross-modal maximum likelihood estimation framework),该框架通过单一损失函数同时训练图像和文本的条件似然,并通过扩散变换器(diffusion transformer)的双分支进行反向传播。这种设计使得模型具有高度灵活性,能够执行图像生成、图像描述和视觉问答等多种任务。实验结果表明,该模型在统一的图像理解和生成任务中表现出色,展示了多模态扩散模型作为自回归下一词预测模型的有力替代潜力。
链接: https://arxiv.org/abs/2501.00289
作者: Zijie Li,Henry Li,Yichun Shi,Amir Barati Farimani,Yuval Kluger,Linjie Yang,Peng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-language modeling capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion language modeling, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.
zh
[CV-134] Outlier-Robust Training of Machine Learning Models
【速读】: 该论文旨在解决在存在异常值(outliers)的情况下,如何稳健地训练机器学习模型的问题。异常值的存在会显著影响模型的性能,因此需要设计能够有效减轻异常值影响的损失函数。论文通过揭示两种设计稳健损失函数的方法——一种是在机器人学和计算机视觉中常用的M估计(M-estimation),另一种是在深度学习中常用的风险最小化框架(risk-minimization framework)——并提出了一种统一的视角。关键解决方案包括:1)通过修改Black-Rangarajan对偶性(Black-Rangarajan duality),提出了一种统一的稳健损失核(robust loss kernel)定义,适用于两种方法;2)基于修改后的对偶性,提出了一种自适应交替算法(Adaptive Alternation Algorithm, AAA),该算法通过迭代训练模型并使用加权版本的非稳健损失函数,同时更新权重,从而避免复杂的参数调优;3)研究了该算法在无异常值情况下的收敛性,并证明了使用稳健损失核可以扩大收敛区域。实验结果表明,该算法在回归、分类和神经场景重建问题上均表现出色。
链接: https://arxiv.org/abs/2501.00265
作者: Rajat Talak,Charis Georgiou,Jingnan Shi,Luca Carlone
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel \sigma that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels \sigma increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: this https URL.
zh
[CV-135] Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition ICASSP2025
【速读】: 该论文旨在解决超细粒度图像识别(Ultra-fine-grained image recognition, UFGIR)任务中因使用令牌缩减(token reduction)技术而导致的信息丢失问题。UFGIR任务涉及对同一物种内的亚类别(如植物的栽培品种)进行分类,传统细粒度图像识别(FGIR)则主要针对不同物种的分类。尽管基于视觉Transformer(Vision Transformer)的骨干网络在该任务中表现出色,但高分辨率图像的引入显著增加了计算成本。为了降低计算成本,令牌缩减技术被广泛应用,但减少令牌数量会导致细粒度类别信息的丢失,尤其是在令牌保留率较低时。为解决这一问题,论文提出了一种新颖的跨层聚合分类头(Cross-Layer Aggregation Classification Head)和跨层缓存机制(Cross-Layer Cache),通过从先前层恢复和访问信息来弥补信息丢失。实验结果表明,该方法在保持与现有最先进模型相当准确性的同时,能够将令牌保留率降低至10%,从而在精度与计算成本之间取得了更好的平衡。
链接: https://arxiv.org/abs/2501.00243
作者: Edwin Arkel Rios,Jansen Christopher Yuanda,Vincent Leon Ghanz,Cheng-Wei Yu,Bo-Cheng Lai,Min-Chun Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2025. Main: 5 pages, 4 figures, 1 table
Abstract:Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \urlthis https URL
zh
[CV-136] Make Domain Shift a Catastrophic Forgetting Alleviator in Class-Incremental Learning AAAI2025
【速读】: 该论文试图解决类增量学习(Class-Incremental Learning, CIL)中的灾难性遗忘(catastrophic forgetting)问题。灾难性遗忘是指在增量学习过程中,模型在学习新任务时容易遗忘先前任务的知识。论文通过引入领域偏移(domain shift)来减少遗忘率,发现领域偏移能够使不同任务之间的特征分布更加清晰分离,并减少学习过程中的参数干扰。基于这一观察,论文提出了一种名为DisCo的简单而有效的方法。DisCo通过引入一个轻量级的原型池(prototype pool),利用对比学习(contrastive learning)来促进当前任务与先前任务之间的特征分布分离,从而有效减少任务间的干扰。该方法可以轻松集成到现有的类增量学习方法中,实验结果表明,DisCo显著提升了多种CIL方法的性能,验证了其在分离特征表示和减少干扰方面的优势。
链接: https://arxiv.org/abs/2501.00237
作者: Wei Chen,Yi Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as poster paper of AAAI2025
Abstract:In the realm of class-incremental learning (CIL), alleviating the catastrophic forgetting problem is a pivotal challenge. This paper discovers a counter-intuitive observation: by incorporating domain shift into CIL tasks, the forgetting rate is significantly reduced. Our comprehensive studies demonstrate that incorporating domain shift leads to a clearer separation in the feature distribution across tasks and helps reduce parameter interference during the learning process. Inspired by this observation, we propose a simple yet effective method named DisCo to deal with CIL tasks. DisCo introduces a lightweight prototype pool that utilizes contrastive learning to promote distinct feature distributions for the current task relative to previous ones, effectively mitigating interference across tasks. DisCo can be easily integrated into existing state-of-the-art class-incremental learning methods. Experimental results show that incorporating our method into various CIL methods achieves substantial performance improvements, validating the benefits of our approach in enhancing class-incremental learning by separating feature representation and reducing interference. These findings illustrate that DisCo can serve as a robust fashion for future research in class-incremental learning.
zh
[CV-137] DecoratingFusion: A LiDAR-Camera Fusion Network with the Combination of Point-level and Feature-level Fusion ICANN2024
【速读】: 该论文旨在解决自动驾驶中激光雷达(Lidar)和相机数据融合时存在的软关联(soft association)缺乏可解释性以及忽视硬关联(hard association)的问题。现有的先进融合方法主要在特征层面进行融合,依赖于点云和图像之间的学习软关联,这导致融合过程缺乏透明性且未能充分利用两者之间的硬关联。论文提出的解决方案关键是将特征级融合与点级融合相结合,利用校准矩阵(calibration matrices)建立的硬关联来指导目标查询(object queries)的生成。具体而言,在早期融合阶段,使用图像的2D CNN特征来修饰点云数据,并通过两个独立的稀疏卷积(sparse convolutions)提取修饰后的点云特征;在中级融合阶段,通过中心热图(center heatmap)初始化查询,并将预测的类别标签作为辅助信息嵌入查询中,使初始位置更接近目标的实际中心。实验结果表明,该方法在KITTI和Waymo数据集上表现优越。
链接: https://arxiv.org/abs/2501.00220
作者: Zixuan Yin,Han Sun,Ningzhong Liu,Huiyu Zhou,Jiaquan Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 2 figures. accepted by ICANN2024
Abstract:Lidars and cameras play essential roles in autonomous driving, offering complementary information for 3D detection. The state-of-the-art fusion methods integrate them at the feature level, but they mostly rely on the learned soft association between point clouds and images, which lacks interpretability and neglects the hard association between them. In this paper, we combine feature-level fusion with point-level fusion, using hard association established by the calibration matrices to guide the generation of object queries. Specifically, in the early fusion stage, we use the 2D CNN features of images to decorate the point cloud data, and employ two independent sparse convolutions to extract the decorated point cloud features. In the mid-level fusion stage, we initialize the queries with a center heatmap and embed the predicted class labels as auxiliary information into the queries, making the initial positions closer to the actual centers of the targets. Extensive experiments conducted on two popular datasets, i.e. KITTI, Waymo, demonstrate the superiority of DecoratingFusion.
zh
[CV-138] rajLearn: Trajectory Prediction Learning using Deep Generative Models
【速读】: 该论文旨在解决轨迹预测(Trajectory Prediction)中的两个主要挑战:复杂空间依赖性的管理和动态环境的适应性。轨迹预测的目标是通过实体的当前位置和历史运动数据来估计其未来路径,这对于自动驾驶导航、机器人技术和人类运动分析等领域具有重要意义。现有的深度学习方法虽然能够利用大规模轨迹数据集来建模运动模式,但在处理复杂空间依赖性和适应动态环境方面仍面临困难。
论文提出的解决方案是TrajLearn模型,其关键创新点在于基于六边形空间表示的高阶移动流生成建模(Generative Modeling of Higher-Order Mobility Flows)。该模型通过定制化的束搜索(Beam Search)来探索多条潜在路径,同时保持空间连续性,从而预测未来的k步轨迹。此外,TrajLearn还引入了一种新颖的算法,通过将六边形区域分层细分为更精细的片段来生成混合分辨率地图(Mixed-Resolution Maps),从而在感兴趣区域或高活动区域(如城市中心)应用更精细的分辨率,而在不太重要的区域(如农村地区)使用较粗的分辨率,有效减少了数据存储需求和计算开销。通过严格的评估,TrajLearn在多个真实世界轨迹数据集上实现了显著的性能提升,最高可达约40%的改进。
链接: https://arxiv.org/abs/2501.00184
作者: Amirhossein Nadiri,Jing Li,Ali Faraji,Ghadeer Abuoda,Manos Papagelis
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Trajectory prediction aims to estimate an entity’s future path using its current position and historical movement data, benefiting fields like autonomous navigation, robotics, and human movement analytics. Deep learning approaches have become key in this area, utilizing large-scale trajectory datasets to model movement patterns, but face challenges in managing complex spatial dependencies and adapting to dynamic environments. To address these challenges, we introduce TrajLearn, a novel model for trajectory prediction that leverages generative modeling of higher-order mobility flows based on hexagonal spatial representation. TrajLearn predicts the next k steps by integrating a customized beam search for exploring multiple potential paths while maintaining spatial continuity. We conducted a rigorous evaluation of TrajLearn, benchmarking it against leading state-of-the-art approaches and meaningful baselines. The results indicate that TrajLearn achieves significant performance gains, with improvements of up to ~40% across multiple real-world trajectory datasets. In addition, we evaluated different prediction horizons (i.e., various values of k ), conducted resolution sensitivity analysis, and performed ablation studies to assess the impact of key model components. Furthermore, we developed a novel algorithm to generate mixed-resolution maps by hierarchically subdividing hexagonal regions into finer segments within a specified observation area. This approach supports selective detailing, applying finer resolution to areas of interest or high activity (e.g., urban centers) while using coarser resolution for less significant regions (e.g., rural areas), effectively reducing data storage requirements and computational overhead. We promote reproducibility and adaptability by offering complete code, data, and detailed documentation with flexible configuration options for various applications.
zh
[CV-139] Minimalist Vision with Freeform Pixels ECCV2024
【速读】: 该论文试图解决传统相机在视觉任务中使用大量像素导致的高计算成本和隐私泄露问题。解决方案的关键在于设计一种极简视觉系统(minimalist vision system),该系统使用最少数量的自由形状像素(freeform pixels)来完成任务。这些自由形状像素通过光学掩模(optical mask)和光电探测器(photodetector)实现,其形状通过神经网络训练优化,以最大化信息内容。极简视觉系统不仅能够以极少的像素实现与传统相机相当的性能,还具备两大优势:一是由于捕获的信息不足以提取视觉细节,自然保护了场景中个体的隐私;二是由于测量次数极少,系统可以实现完全自供电(self-powered),无需外部电源或电池。
链接: https://arxiv.org/abs/2501.00142
作者: Jeremy Klotz,Shree K. Nayar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project page: this https URL , published at ECCV 2024
Abstract:A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While traditional cameras use a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. We show that the hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera’s freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). The performance demonstrated by these systems is on par with a traditional camera with orders of magnitude more pixels. Minimalist vision has two major advantages. First, it naturally tends to preserve the privacy of individuals in the scene since the captured information is inadequate for extracting visual details. Second, since the number of measurements made by a minimalist camera is very small, we show that it can be fully self-powered, i.e., function without an external power supply or a battery.
zh
[CV-140] Detection-Fusion for Knowledge Graph Extraction from Videos
【速读】: 该论文试图解决视频理解领域中从视频输入中提取语义内容的挑战性问题。现有系统通常依赖语言模型(language models)用自然语言句子描述视频,但这种方法存在几个主要缺陷:过度依赖语言模型,导致输出基于自然语言文本的统计规律而非视频的视觉内容;自然语言标注难以被计算机直接处理,难以用性能指标评估,且不易翻译成其他自然语言。论文提出了一种基于知识图谱(knowledge graphs)的视频标注方法,以规避这些问题。其解决方案的关键在于提出了一种深度学习模型,该模型首先预测视频中的个体对,然后预测它们之间的关系,并进一步扩展模型以在知识图谱构建中融入背景知识。
链接: https://arxiv.org/abs/2501.00136
作者: Taniya Das,Louis Mahon,Thomas Lukasiewicz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, To be submitted to a conference
Abstract:One of the challenging tasks in the field of video understanding is extracting semantic content from video inputs. Most existing systems use language models to describe videos in natural language sentences, but this has several major shortcomings. Such systems can rely too heavily on the language model component and base their output on statistical regularities in natural language text rather than on the visual contents of the video. Additionally, natural language annotations cannot be readily processed by a computer, are difficult to evaluate with performance metrics and cannot be easily translated into a different natural language. In this paper, we propose a method to annotate videos with knowledge graphs, and so avoid these problems. Specifically, we propose a deep-learning-based model for this task that first predicts pairs of individuals and then the relations between them. Additionally, we propose an extension of our model for the inclusion of background knowledge in the construction of knowledge graphs.
zh
[CV-141] PQD: Post-training Quantization for Efficient Diffusion Models WACV
【速读】: 该论文旨在解决扩散模型(Diffusion Models, DMs)在图像合成中计算需求大、生成速度慢的问题,这些问题限制了其广泛应用。论文提出的解决方案是一种新颖的后训练量化方法(Post-Training Quantization for Diffusion Models, PQD),该框架通过时间感知优化(time-aware optimization)来改进推理过程。具体而言,PQD通过选择代表性样本并进行时间感知校准(time-aware calibration),能够在无需重新训练的情况下,将全精度扩散模型直接量化为8位或4位模型,同时保持可比的性能。实验结果表明,该方法在无条件图像生成任务中,仅导致ImageNet上的FID(Fréchet Inception Distance)轻微变化,并且首次成功应用于512x512分辨率的文本引导图像生成任务。
链接: https://arxiv.org/abs/2501.00124
作者: Jiaojiao Ye,Zhen Wang,Linnan Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, uses this http URL
Abstract:Diffusionmodels(DMs)havedemonstratedremarkableachievements in synthesizing images of high fidelity and diversity. However, the extensive computational requirements and slow generative speed of diffusion models have limited their widespread adoption. In this paper, we propose a novel post-training quantization for diffusion models (PQD), which is a time-aware optimization framework for diffusion models based on post-training quantization. The proposed framework optimizes the inference process by selecting representative samples and conducting time-aware calibration. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a few FID change on ImageNet for unconditional image generation. Our approach demonstrates compatibility and can also be applied to 512x512 text-guided image generation for the first time.
zh
[CV-142] xt-to-Image GAN with Pretrained Representations
【速读】: 该论文旨在解决文本到图像生成(text-to-image synthesis)任务中存在的推理速度慢和训练成本高的问题。为此,作者提出了TIGER模型,这是一种基于生成对抗网络(GAN)的文本到图像生成模型,结合了预训练表示(pretrained representations)以提升性能。解决方案的关键在于两个方面:(i) 视觉增强的判别器(vision-empowered discriminator),通过堆叠多个预训练视觉模型来吸收复杂的场景理解能力和领域泛化能力,从而提升模型性能;(ii) 高容量生成器(high-capacity generator),通过引入多个高容量融合块(HFBlock)来实现有效的文本-图像融合,并增加模型容量。HFBlock包含多个深度融合模块和一个全局融合模块,分别在不同层次上提升模型表现。实验结果表明,TIGER在标准文本到图像生成任务和零样本文本到图像生成任务中均表现出色,尤其是在推理速度和模型参数效率方面具有显著优势。
链接: https://arxiv.org/abs/2501.00116
作者: Xiaozhou You,Jian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generating desired images conditioned on given text descriptions has received lots of attention. Recently, diffusion models and autoregressive models have demonstrated their outstanding expressivity and gradually replaced GAN as the favored architectures for text-to-image synthesis. However, they still face some obstacles: slow inference speed and expensive training costs. To achieve more powerful and faster text-to-image synthesis under complex scenes, we propose TIGER, a text-to-image GAN with pretrained representations. To be specific, we propose a vision-empowered discriminator and a high-capacity generator. (i) The vision-empowered discriminator absorbs the complex scene understanding ability and the domain generalization ability from pretrained vision models to enhance model performance. Unlike previous works, we explore stacking multiple pretrained models in our discriminator to collect multiple different representations. (ii) The high-capacity generator aims to achieve effective text-image fusion while increasing the model capacity. The high-capacity generator consists of multiple novel high-capacity fusion blocks (HFBlock). And the HFBlock contains several deep fusion modules and a global fusion module, which play different roles to benefit our model. Extensive experiments demonstrate the outstanding performance of our proposed TIGER both on standard and zero-shot text-to-image synthesis tasks. On the standard text-to-image synthesis task, TIGER achieves state-of-the-art performance on two challenging datasets, which obtain a new FID 5.48 (COCO) and 9.38 (CUB). On the zero-shot text-to-image synthesis task, we achieve comparable performance with fewer model parameters, smaller training data size and faster inference speed. Additionally, more experiments and analyses are conducted in the Supplementary Material.
zh
[CV-143] LTX-Video: Realtime Video Latent Diffusion
【速读】: 该论文试图解决现有视频生成方法中视频变分自编码器(Video-VAE)和去噪变压器(denoising transformer)独立处理导致的效率和质量问题。现有方法通常将这两个组件视为独立模块,导致生成高分辨率视频时存在计算效率低和细节丢失的问题。
解决方案的关键在于提出了一种基于Transformer的潜在扩散模型(LTX-Video),通过将视频变分自编码器和去噪变压器无缝集成,优化它们的交互以提高生成效率和质量。具体而言,LTX-Video采用了一种精心设计的视频变分自编码器,实现了1:192的高压缩比,并通过将分块操作从变压器的输入转移到变分自编码器的输入,实现了32 x 32 x 8像素每令牌的时空下采样。这种高度压缩的潜在空间使得变压器能够高效地执行全时空自注意力机制,从而生成具有时间一致性的高分辨率视频。此外,变分自编码器的解码器不仅负责潜在空间到像素空间的转换,还承担了最终的去噪步骤,直接在像素空间中生成清晰结果,避免了单独上采样模块的计算开销。这一方法在保持生成细节能力的同时,显著提升了生成速度,支持文本到视频和图像到视频的多样化应用场景。
链接: https://arxiv.org/abs/2501.00103
作者: Yoav HaCohen,Nisan Chiprut,Benny Brazowski,Daniel Shalem,Dudu Moshe,Eitan Richardson,Eran Levin,Guy Shiran,Nir Zabari,Ori Gordon,Poriya Panet,Sapir Weissbuch,Victor Kulikov,Yaki Bitterman,Zeev Melumian,Ofir Bibi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer’s input to the VAE’s input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.
zh
[CV-144] VisTabNet: Adapting Vision Transformers for Tabular Data
【速读】: 该论文试图解决在表格数据(tabular data)上应用深度学习模型时面临的挑战,尤其是在小规模数据集上难以有效迁移大规模预训练模型的问题。表格数据在生物、工业和金融等领域中广泛应用,但现有的深度学习模型在这些数据上的表现不如在自然语言处理和计算机视觉领域显著。为解决这一问题,论文提出了VisTabNet,一种跨模态迁移学习方法,通过将表格数据投影为适合视觉Transformer(Vision Transformer, ViT)的patch embeddings,从而直接利用预训练的Transformer Encoder处理表格数据。这一方法的关键在于避免了为表格数据设计专用架构的复杂性,同时减少了从头训练模型的计算成本。实验结果表明,VisTabNet在多个小规模表格数据集上表现优异,超越了传统集成方法和现有的深度学习模型,证明了预训练图像模型可以迁移用于解决表格数据问题,扩展了迁移学习的应用边界。
链接: https://arxiv.org/abs/2501.00057
作者: Witold Wydmański,Ulvi Movsum-zada,Jacek Tabor,Marek Śmieja
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet – a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet’s superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning.
zh
[CV-145] ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
【速读】: 该论文旨在解决现有可解释人工智能(Explainable AI)方法在医学影像应用中表现不佳的问题,特别是基于生成对抗网络(GANs)的框架在自然图像生成中表现良好,但在医学影像中由于图像的特性和复杂性导致性能不理想。论文提出了三个关键贡献来解决这一问题:首先,提出了ProjectedEx生成框架,该框架提供可解释的多属性解释,有效将医学影像特征与分类器决策联系起来;其次,通过引入特征金字塔(feature pyramids)增强编码器模块,实现多尺度反馈以优化潜在空间,提高生成解释的质量;最后,通过对生成器和分类器的全面实验,验证了ProjectedEx在增强可解释性和支持AI在医疗环境中应用的临床相关性和有效性。
链接: https://arxiv.org/abs/2501.01392
作者: Xuyin Qi,Zeyu Zhang,Aaron Berliano Handoko,Huazhan Zheng,Mingxi Chen,Ta Duc Huy,Vu Minh Hieu Phan,Lei Zhang,Linqi Cheng,Shiyu Jiang,Zhiwei Zhang,Zhibin Liao,Yang Zhao,Minh-Son To
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. However, existing explainable AI methods, particularly those based on frameworks like generative adversarial networks (GANs), are predominantly developed for natural image generation, and their application to medical imaging often leads to suboptimal performance due to the unique characteristics and complexity of medical image. To address these challenges, our paper introduces three key contributions. First, we propose ProjectedEx, a generative framework that provides interpretable, multi-attribute explanations, effectively linking medical image features to classifier decisions. Second, we enhance the encoder module by incorporating feature pyramids, which enables multiscale feedback to refine the latent space and improves the quality of generated explanations. Additionally, we conduct comprehensive experiments on both the generator and classifier, demonstrating the clinical relevance and effectiveness of ProjectedEx in enhancing interpretability and supporting the adoption of AI in medical settings. Code will be released at this https URL
zh
[CV-146] ScarNet: A Novel Foundation Model for Automated Myocardial Scar Quantification from LGE in Cardiac MRI
【速读】: 该论文试图解决在晚期钆增强(Late Gadolinium Enhancement, LGE)成像中,左心室(Left Ventricular, LV)瘢痕量化过程中由于手动分割的劳动密集性和观察者间变异性(inter-observer variability)所带来的挑战。LGE成像是评估心肌纤维化和瘢痕的金标准,而左心室瘢痕的范围是预测主要不良心脏事件(Major Adverse Cardiac Events, MACE)的重要指标。
解决方案的关键在于提出了ScarNet模型,该模型结合了基于Transformer的编码器(来自Medical Segment Anything Model, MedSAM)和基于卷积的U-Net解码器,并通过定制的注意力块(attention blocks)进行增强。ScarNet在552名缺血性心肌病患者的专家分割数据上进行了训练,并在184名独立患者的数据集上进行了测试。结果表明,ScarNet在瘢痕分割任务中表现出色,显著优于MedSAM和nnU-Net,尤其是在面对不同图像质量和瘢痕模式时表现出较强的鲁棒性。
链接: https://arxiv.org/abs/2501.01372
作者: Neda Tavakoli,Amir Ali Rahsepar,Brandon C. Benefield,Daming Shen,Santiago López-Tapia,Florian Schiffers,Jeffrey J. Goldberger,Christine M. Albert,Edwin Wu,Aggelos K. Katsaggelos,Daniel C. Lee,Daniel Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 8 figures
Abstract:Background: Late Gadolinium Enhancement (LGE) imaging is the gold standard for assessing myocardial fibrosis and scarring, with left ventricular (LV) LGE extent predicting major adverse cardiac events (MACE). Despite its importance, routine LGE-based LV scar quantification is hindered by labor-intensive manual segmentation and inter-observer variability. Methods: We propose ScarNet, a hybrid model combining a transformer-based encoder from the Medical Segment Anything Model (MedSAM) with a convolution-based U-Net decoder, enhanced by tailored attention blocks. ScarNet was trained on 552 ischemic cardiomyopathy patients with expert segmentations of myocardial and scar boundaries and tested on 184 separate patients. Results: ScarNet achieved robust scar segmentation in 184 test patients, yielding a median Dice score of 0.912 (IQR: 0.863–0.944), significantly outperforming MedSAM (median Dice = 0.046, IQR: 0.043–0.047) and nnU-Net (median Dice = 0.638, IQR: 0.604–0.661). ScarNet demonstrated lower bias (-0.63%) and coefficient of variation (4.3%) compared to MedSAM (bias: -13.31%, CoV: 130.3%) and nnU-Net (bias: -2.46%, CoV: 20.3%). In Monte Carlo simulations with noise perturbations, ScarNet achieved significantly higher scar Dice (0.892 \pm 0.053, CoV = 5.9%) than MedSAM (0.048 \pm 0.112, CoV = 233.3%) and nnU-Net (0.615 \pm 0.537, CoV = 28.7%). Conclusion: ScarNet outperformed MedSAM and nnU-Net in accurately segmenting myocardial and scar boundaries in LGE images. The model exhibited robust performance across diverse image qualities and scar patterns.
zh
[CV-147] Enhancing Early Diabetic Retinopathy Detection through Synthetic DR1 Image Generation: A StyleGAN3 Approach
【速读】: 该论文试图解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期检测中高质量眼底图像(fundus images)稀缺的问题,特别是在DR1阶段。由于数据稀缺,监督分类器的性能受到限制。论文提出使用StyleGAN3生成高保真度和多样性的合成DR1图像,以解决数据不足的问题并提升分类器的性能。解决方案的关键在于利用StyleGAN3生成具有微动脉瘤特征的合成图像,并通过定量指标(如Frechet Inception Distance, FID、Kernel Inception Distance, KID、平移和旋转的等变性EQ-T和EQ-R)和定性评估(如人类图灵测试)来验证图像的质量和真实性。最终,模型在FID评分上取得了17.29的成绩,优于通过自助重采样得到的平均FID(21.18),表明生成的合成图像具有高度的真实性和应用潜力,能够有效扩充训练数据集,从而提升糖尿病视网膜病变的早期检测准确性。
链接: https://arxiv.org/abs/2501.00954
作者: Sagarnil Das,Pradeep Walia
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 11 figures
Abstract:Diabetic Retinopathy (DR) is a leading cause of preventable blindness. Early detection at the DR1 stage is critical but is hindered by a scarcity of high-quality fundus images. This study uses StyleGAN3 to generate synthetic DR1 images characterized by microaneurysms with high fidelity and diversity. The aim is to address data scarcity and enhance the performance of supervised classifiers. A dataset of 2,602 DR1 images was used to train the model, followed by a comprehensive evaluation using quantitative metrics, including Frechet Inception Distance (FID), Kernel Inception Distance (KID), and Equivariance with respect to translation (EQ-T) and rotation (EQ-R). Qualitative assessments included Human Turing tests, where trained ophthalmologists evaluated the realism of synthetic images. Spectral analysis further validated image quality. The model achieved a final FID score of 17.29, outperforming the mean FID of 21.18 (95 percent confidence interval - 20.83 to 21.56) derived from bootstrap resampling. Human Turing tests demonstrated the model’s ability to produce highly realistic images, though minor artifacts near the borders were noted. These findings suggest that StyleGAN3-generated synthetic DR1 images hold significant promise for augmenting training datasets, enabling more accurate early detection of Diabetic Retinopathy. This methodology highlights the potential of synthetic data in advancing medical imaging and AI-driven diagnostics.
zh
[CV-148] A Novel Approach using CapsNet and Deep Belief Network for Detection and Identification of Oral Leukopenia
【速读】: 该论文旨在解决口腔癌(oral cancer)的早期检测问题,特别是在低中收入国家中,口腔癌的高发病率和死亡率使得早期诊断尤为重要。论文提出了一种自动化检测口腔内可能恶性病变(possibly malignant and malignant lesions)的方法,以降低成本并提高诊断效率。解决方案的关键在于建立一个由全球临床专家提供的、经过精细标注的口腔病变图像数据库,并结合深度信念网络(Deep Belief Network)和胶囊网络(CAPSNET)来开发自动化系统。这些系统能够提取复杂的病变模式,并通过图像分类和目标检测技术实现病变的自动识别和分类。实验结果表明,图像分类在检测病变图像和识别需要转诊的图像时分别达到了94.23%和93.46%的F1分数,而目标检测在识别需要转诊的病变时达到了89.34%的F1分数。这些初步结果表明,深度学习技术在解决这一复杂问题上具有显著潜力。
链接: https://arxiv.org/abs/2501.00876
作者: Hirthik Mathesh GV,Kavin Chakravarthy M,Sentil Pandi S
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE International Conference on Advancement in Communication and Computing Technology (INOACC), will be held in Sai Vidya Institute of Technology, Bengaluru, Karnataka, India. (Preprint)
Abstract:Oral cancer constitutes a significant global health concern, resulting in 277,484 fatalities in 2023, with the highest prevalence observed in low- and middle-income nations. Facilitating automation in the detection of possibly malignant and malignant lesions in the oral cavity could result in cost-effective and early disease diagnosis. Establishing an extensive repository of meticulously annotated oral lesions is essential. In this research photos are being collected from global clinical experts, who have been equipped with an annotation tool to generate comprehensive labelling. This research presents a novel approach for integrating bounding box annotations from various doctors. Additionally, Deep Belief Network combined with CAPSNET is employed to develop automated systems that extracted intricate patterns to address this challenging problem. This study evaluated two deep learning-based computer vision methodologies for the automated detection and classification of oral lesions to facilitate the early detection of oral cancer: image classification utilizing CAPSNET. Image classification attained an F1 score of 94.23% for detecting photos with lesions 93.46% for identifying images necessitating referral. Object detection attained an F1 score of 89.34% for identifying lesions for referral. Subsequent performances are documented about classification based on the sort of referral decision. Our preliminary findings indicate that deep learning possesses the capability to address this complex problem.
zh
[CV-149] HCMA-UNet: A Hybrid CNN-Mamba UNet with Inter-Slice Self-Attention for Efficient Breast Cancer Segmentation
【速读】: 该论文试图解决在动态对比增强磁共振成像(DCE-MRI)中乳腺癌病灶分割的挑战,这些挑战主要源于肿瘤形态的异质性和边界的不清晰。为了解决这些问题,研究提出了一种新颖的混合分割网络HCMA-UNet。该网络的关键创新在于其轻量级的卷积神经网络(CNN)骨干和Multi-view Inter-Slice Self-Attention Mamba(MISM)模块。MISM模块集成了视觉状态空间块(VSSB)和层间自注意力机制(ISSA),通过非对称分割通道(ASC)策略有效减少参数,实现高效的三向特征提取。此外,研究还提出了一种特征引导的区域感知损失函数(FRLoss),以增强分割精度。实验结果表明,该方法在保持计算效率的同时,达到了最先进的性能,并且FRLoss展示了良好的跨架构泛化能力。
链接: https://arxiv.org/abs/2501.00751
作者: Haoxuan Li,Wei song,Peiwu Qin,Xi Yuan,Zhenglin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast cancer lesion segmentation in DCE-MRI remains challenging due to heterogeneous tumor morphology and indistinct boundaries. To address these challenges, this study proposes a novel hybrid segmentation network, HCMA-UNet, for lesion segmentation of breast cancer. Our network consists of a lightweight CNN backbone and a Multi-view Inter-Slice Self-Attention Mamba (MISM) module. The MISM module integrates Visual State Space Block (VSSB) and Inter-Slice Self-Attention (ISSA) mechanism, effectively reducing parameters through Asymmetric Split Channel (ASC) strategy to achieve efficient tri-directional feature extraction. Our lightweight model achieves superior performance with 2.87M parameters and 126.44 GFLOPs. A Feature-guided Region-aware loss function (FRLoss) is proposed to enhance segmentation accuracy. Extensive experiments on one private and two public DCE-MRI breast cancer datasets demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. FRLoss also exhibits good cross-architecture generalization capabilities. The source code and dataset is available on this link.
zh
[CV-150] Lightweight G-YOLOv11: Advancing Efficient Fracture Detection in Pediatric Wrist X-rays
【速读】: 该论文旨在解决当前计算机辅助诊断(CAD)系统在X射线图像骨折检测中依赖于大型、资源密集型检测器的问题,这限制了其在临床环境中的实用性。为了解决这一限制,作者提出了一种基于YOLO检测器的轻量级CAD系统,命名为基于鬼影卷积的YOLOv11(G-YOLOv11)。该系统的关键在于引入了鬼影卷积操作(ghost convolution operation),该操作在生成与传统卷积相同数量的特征图的同时,减少了线性运算的需求,从而降低了检测器的计算资源需求。通过在GRAZPEDWRI-DX数据集上的评估,G-YOLOv11在NVIDIA A10 GPU上实现了0.535的mAP@0.5和2.4 ms的推理时间,显著提升了效率,并超越了现有检测器的性能。
链接: https://arxiv.org/abs/2501.00647
作者: Abdesselam Ferdi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Computer-aided diagnosis (CAD) systems have greatly improved the interpretation of medical images by radiologists and surgeons. However, current CAD systems for fracture detection in X-ray images primarily rely on large, resource-intensive detectors, which limits their practicality in clinical settings. To address this limitation, we propose a novel lightweight CAD system based on the YOLO detector for fracture detection. This system, named ghost convolution-based YOLOv11 (G-YOLOv11), builds on the latest version of the YOLO detector family and incorporates the ghost convolution operation for feature extraction. The ghost convolution operation generates the same number of feature maps as traditional convolution but requires fewer linear operations, thereby reducing the detector’s computational resource requirements. We evaluated the performance of the proposed G-YOLOv11 detector on the GRAZPEDWRI-DX dataset, achieving an mAP@0.5 of 0.535 with an inference time of 2.4 ms on an NVIDIA A10 GPU. Compared to the standard YOLOv11l, G-YOLOv11l achieved reductions of 13.6% in mAP@0.5 and 68.7% in size. These results establish a new state-of-the-art benchmark in terms of efficiency, outperforming existing detectors. Code and models are available at this https URL.
zh
[CV-151] Advanced Lung Nodule Segmentation and Classification for Early Detection of Lung Cancer using SAM and Transfer Learning
【速读】: 该论文试图解决肺癌早期诊断中的关键问题,即如何在CT或MRI图像中精确分割和分类肺结节(lung nodules)。肺癌的高死亡率主要归因于晚期诊断,因此精确的结节分割对于早期检测至关重要。论文提出了一种创新的解决方案,结合了Segment Anything Model (SAM) 和迁移学习(transfer learning)技术。该方法通过使用Bounding Box提示和视觉变换器(vision transformer)模型来增强分割性能,显著提高了计算机辅助检测(CAD)系统在医学影像中的应用效果。实验结果表明,该模型在肺结节分割和分类任务中表现出色,Dice相似系数(DSC)达到97.08%,交并比(IoU)达到95.6%,分类准确率为96.71%,显著优于现有技术。这一解决方案的关键在于SAM与迁移学习的结合,能够有效提升肺结节分割的精度和分类的准确性,从而推动肺癌早期诊断的进展。
链接: https://arxiv.org/abs/2501.00586
作者: Asha V,Bhavanishankar K
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Lung cancer is an extremely lethal disease primarily due to its late-stage diagnosis and significant mortality rate, making it the major cause of cancer-related demises globally. Machine Learning (ML) and Convolution Neural network (CNN) based Deep Learning (DL) techniques are primarily used for precise segmentation and classification of cancerous nodules in the CT (Computed Tomography) or MRI images. This study introduces an innovative approach to lung nodule segmentation by utilizing the Segment Anything Model (SAM) combined with transfer learning techniques. Precise segmentation of lung nodules is crucial for the early detection of lung cancer. The proposed method leverages Bounding Box prompts and a vision transformer model to enhance segmentation performance, achieving high accuracy, Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) metrics. The integration of SAM and Transfer Learning significantly improves Computer-Aided Detection (CAD) systems in medical imaging, particularly for lung cancer diagnosis. The findings demonstrate the proposed model effectiveness in precisely segmenting lung nodules from CT scans, underscoring its potential to advance early detection and improve patient care outcomes in lung cancer diagnosis. The results show SAM Model with transfer learning achieving a DSC of 97.08% and an IoU of 95.6%, for segmentation and accuracy of 96.71% for classification indicates that ,its performance is noteworthy compared to existing techniques.
zh
[CV-152] H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters
【速读】: 该论文试图解决在导管插入手术中,如何同时从两个不同角度分割导管并估计三维空间中施加的力的问题。现有的研究通常将力估计和导管分割分开处理,缺乏一个能够同时完成这两项任务的综合架构。为解决这一问题,论文提出了一种轻量级的多输入多输出编码器-解码器架构。该架构通过处理来自双平面透视系统的两幅同时拍摄的X射线图像,利用两个并行子网络(共享参数)输出对应的分割图,并通过立体视觉技术估计导管尖端在三维空间中的受力情况。该架构的关键在于其能够在一个端到端的网络中同时实现导管分割和力估计,且在处理能力和成本上具有优势,达到了当前最先进的性能水平。
链接: https://arxiv.org/abs/2501.00514
作者: Pedram Fekri,Mehrdad Zadeh,Javad Dargahi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:The success rate of catheterization procedures is closely linked to the sensory data provided to the surgeon. Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner, while also being cost-effective to produce. Given the complexity of these models for devices with limited computational resources, research has focused on force estimation and catheter segmentation separately. However, there is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles and estimating the applied forces in 3D. To bridge this gap, this work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture. It is designed to segment the catheter from two points of view and concurrently measure the applied forces in the x, y, and z directions. This network processes two simultaneous X-Ray images, intended to be fed by a biplane fluoroscopy system, showing a catheter’s deflection from different angles. It uses two parallel sub-networks with shared parameters to output two segmentation maps corresponding to the inputs. Additionally, it leverages stereo vision to estimate the applied forces at the catheter’s tip in 3D. The architecture features two input channels, two classification heads for segmentation, and a regression head for force estimation through a single end-to-end architecture. The output of all heads was assessed and compared with the literature, demonstrating state-of-the-art performance in both segmentation and force estimation. To the best of the authors’ knowledge, this is the first time such a model has been proposed
zh
[CV-153] STARFormer: A Novel Spatio-Temporal Aggregation Reorganization Transformer of FMRI for Brain Disorder Diagnosis
【速读】: 该论文试图解决现有方法在使用功能磁共振成像(fMRI)对自闭症谱系障碍(ASD)和注意力缺陷多动障碍(ADHD)等脑部疾病进行分类时,往往忽略了血氧水平依赖(BOLD)信号的空间和时间依赖性的整合,从而导致分类结果不准确或不精确的问题。为解决这一问题,论文提出了一种时空聚合重组变换器(STARFormer),通过引入三个关键模块来有效捕捉BOLD信号的空间和时间特征。这些模块包括:基于有效连接性使用特征向量中心性(EC)重组脑区的感兴趣区域(ROI)空间结构分析模块;将时间序列系统分割为等维窗口标记并通过可变窗口和跨窗口注意力捕捉多尺度特征的时间特征重组模块;以及采用并行变换器架构、专门提取时空特征的时空特征融合模块。STARFormer在两个公开数据集上进行了严格评估,实验结果表明其在多个评估指标上达到了最先进的性能,为脑部疾病的诊断和生物医学研究提供了更准确和可靠的工具。
链接: https://arxiv.org/abs/2501.00378
作者: Wenhao Dong,Yueyang Li,Weiming Zeng,Lei Chen,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes will be available at: this https URL.
zh
人工智能
[AI-0] A Unified Hyperparameter Optimization Pipeline for Transformer-Based Time Series Forecasting Models
链接: https://arxiv.org/abs/2501.01394
作者: Jingjing Xu,Caesar Wu,Yuan-Fang Li,Grégoire Danoy,Pascal Bouvry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Transformer-based models for time series forecasting (TSF) have attracted significant attention in recent years due to their effectiveness and versatility. However, these models often require extensive hyperparameter optimization (HPO) to achieve the best possible performance, and a unified pipeline for HPO in transformer-based TSF remains lacking. In this paper, we present one such pipeline and conduct extensive experiments on several state-of-the-art (SOTA) transformer-based TSF models. These experiments are conducted on standard benchmark datasets to evaluate and compare the performance of different models, generating practical insights and examples. Our pipeline is generalizable beyond transformer-based architectures and can be applied to other SOTA models, such as Mamba and TimeMixer, as demonstrated in our experiments. The goal of this work is to provide valuable guidance to both industry practitioners and academic researchers in efficiently identifying optimal hyperparameters suited to their specific domain applications. The code and complete experimental results are available on GitHub.
[AI-1] Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
链接: https://arxiv.org/abs/2501.01367
作者: Nathaniel Dennler,Stefanos Nikolaidis,Maja Matarić
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to HRI 2025
Abstract:People have a variety of preferences for how robots behave. To understand and reason about these preferences, robots aim to learn a reward function that describes how aligned robot behaviors are with a user’s preferences. Good representations of a robot’s behavior can significantly reduce the time and effort required for a user to teach the robot their preferences. Specifying these representations – what “features” of the robot’s behavior matter to users – remains a difficult problem; Features learned from raw data lack semantic meaning and features learned from user data require users to engage in tedious labeling processes. Our key insight is that users tasked with customizing a robot are intrinsically motivated to produce labels through exploratory search; they explore behaviors that they find interesting and ignore behaviors that are irrelevant. To harness this novel data source of exploratory actions, we propose contrastive learning from exploratory actions (CLEA) to learn trajectory features that are aligned with features that users care about. We learned CLEA features from exploratory actions users performed in an open-ended signal design activity (N=25) with a Kuri robot, and evaluated CLEA features through a second user study with a different set of users (N=42). CLEA features outperformed self-supervised features when eliciting user preferences over four metrics: completeness, simplicity, minimality, and explainability.
[AI-2] Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark
链接: https://arxiv.org/abs/2501.01349
作者: Liang He,Yougang Chu,Zhen Wu,Jianbing Zhang,Xinyu Dai,Jiajun Chen
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. However, biases within datasets can lead models to learn shortcut patterns, resulting in inaccurate assessments and hindering real-world applicability. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. DREB utilizes Bias Evaluator and PPL Evaluator to ensure low bias and high naturalness, providing a reliable and accurate assessment of model generalization in entity bias scenarios. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques. MixDebias effectively improves model performance on DREB while maintaining performance on the original dataset. Extensive experiments demonstrate the effectiveness and robustness of MixDebias compared to existing methods, highlighting its potential for improving the generalization ability of relation extraction models. We will release DREB and MixDebias publicly.
[AI-3] DeepFilter: An Instrumental Baseline for Accurate and Efficient Process Monitoring
链接: https://arxiv.org/abs/2501.01342
作者: Hao Wang,Zhichao Chen,Licheng Pan,Xiaoyu Jiang,Yichen Song,Qunshan He,Xinggao Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Effective process monitoring is increasingly vital in industrial automation for ensuring operational safety, necessitating both high accuracy and efficiency. Although Transformers have demonstrated success in various fields, their canonical form based on the self-attention mechanism is inadequate for process monitoring due to two primary limitations: (1) the step-wise correlations captured by self-attention mechanism are difficult to capture discriminative patterns in monitoring logs due to the lacking semantics of each step, thus compromising accuracy; (2) the quadratic computational complexity of self-attention hampers efficiency. To address these issues, we propose DeepFilter, a Transformer-style framework for process monitoring. The core innovation is an efficient filtering layer that excel capturing long-term and periodic patterns with reduced complexity. Equipping with the global filtering layer, DeepFilter enhances both accuracy and efficiency, meeting the stringent demands of process monitoring. Experimental results on real-world process monitoring datasets validate DeepFilter’s superiority in terms of accuracy and efficiency compared to existing state-of-the-art models.
[AI-4] CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models
链接: https://arxiv.org/abs/2501.01335
作者: Johan Wahréus,Ahmed Mohamed Hussain,Panos Papadimitratos
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Numerous studies have investigated methods for jailbreaking Large Language Models (LLMs) to generate harmful content. Typically, these methods are evaluated using datasets of malicious prompts designed to bypass security policies established by LLM providers. However, the generally broad scope and open-ended nature of existing datasets can complicate the assessment of jailbreaking effectiveness, particularly in specific domains, notably cybersecurity. To address this issue, we present and publicly release CySecBench, a comprehensive dataset containing 12662 prompts specifically designed to evaluate jailbreaking techniques in the cybersecurity domain. The dataset is organized into 10 distinct attack-type categories, featuring close-ended prompts to enable a more consistent and accurate assessment of jailbreaking attempts. Furthermore, we detail our methodology for dataset generation and filtration, which can be adapted to create similar datasets in other domains. To demonstrate the utility of CySecBench, we propose and evaluate a jailbreaking approach based on prompt obfuscation. Our experimental results show that this method successfully elicits harmful content from commercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT and 88% with Gemini; in contrast, Claude demonstrated greater resilience with a jailbreaking SR of 17%. Compared to existing benchmark approaches, our method shows superior performance, highlighting the value of domain-specific evaluation datasets for assessing LLM security measures. Moreover, when evaluated using prompts from a widely used dataset (i.e., AdvBench), it achieved an SR of 78.5%, higher than the state-of-the-art methods.
[AI-5] Understanding Difficult-to-learn Examples in Contrastive Learning: A Theoretical Framework for Spectral Contrastive Learning
链接: https://arxiv.org/abs/2501.01317
作者: Yi-Ge Zhang,Jingyi Cui,Qiran Li,Yisen Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from that of supervised learning. Previous works have shown that difficult-to-learn examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult-to-learn examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this theoretical framework, we conduct a thorough theoretical analysis revealing that the presence of difficult-to-learn examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance. Empirically, we propose a simple and efficient mechanism for selecting difficult-to-learn examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.
[AI-6] LEO-Split: A Semi-Supervised Split Learning Framework over LEO Satellite Networks
链接: https://arxiv.org/abs/2501.01293
作者: Zheng Lin,Yuxin Zhang,Zhe Chen,Zihan Fang,Cong Wu,Xianhao Chen,Yue Gao,Jun Luo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: 13 pages, 15 figures
Abstract:Recently, the increasing deployment of LEO satellite systems has enabled various space analytics (e.g., crop and climate monitoring), which heavily relies on the advancements in deep learning (DL). However, the intermittent connectivity between LEO satellites and ground station (GS) significantly hinders the timely transmission of raw data to GS for centralized learning, while the scaled-up DL models hamper distributed learning on resource-constrained LEO satellites. Though split learning (SL) can be a potential solution to these problems by partitioning a model and offloading primary training workload to GS, the labor-intensive labeling process remains an obstacle, with intermittent connectivity and data heterogeneity being other challenges. In this paper, we propose LEO-Split, a semi-supervised (SS) SL design tailored for satellite networks to combat these challenges. Leveraging SS learning to handle (labeled) data scarcity, we construct an auxiliary model to tackle the training failure of the satellite-GS non-contact time. Moreover, we propose a pseudo-labeling algorithm to rectify data imbalances across satellites. Lastly, an adaptive activation interpolation scheme is devised to prevent the overfitting of server-side sub-model training at GS. Extensive experiments with real-world LEO satellite traces (e.g., Starlink) demonstrate that our LEO-Split framework achieves superior performance compared to state-ofthe-art benchmarks.
[AI-7] Change Detection-Based Procedures for Piecewise Stationary MABs: A Modular Approach
链接: https://arxiv.org/abs/2501.01291
作者: Yu-Han Huang,Argyrios Gerogiannis,Subhonmesh Bose,Venugopal V. Veeravalli
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 34 pages, 2 figures, 1 table, submitted to JMLR
Abstract:Conventional Multi-Armed Bandit (MAB) algorithms are designed for stationary environments, where the reward distributions associated with the arms do not change with time. In many applications, however, the environment is more accurately modeled as being nonstationary. In this work, piecewise stationary MAB (PS-MAB) environments are investigated, in which the reward distributions associated with a subset of the arms change at some change-points and remain stationary between change-points. Our focus is on the asymptotic analysis of PS-MABs, for which practical algorithms based on change detection (CD) have been previously proposed. Our goal is to modularize the design and analysis of such CD-based Bandit (CDB) procedures. To this end, we identify the requirements for stationary bandit algorithms and change detectors in a CDB procedure that are needed for the modularization. We assume that the rewards are sub-Gaussian. Under this assumption and a condition on the separation of the change-points, we show that the analysis of CDB procedures can indeed be modularized, so that regret bounds can be obtained in a unified manner for various combinations of change detectors and bandit algorithms. Through this analysis, we develop new modular CDB procedures that are order-optimal. We compare the performance of our modular CDB procedures with various other methods in simulations.
[AI-8] PIMAEX: Multi-Agent Exploration through Peer Incentivization
链接: https://arxiv.org/abs/2501.01266
作者: Michael Kölle,Johannes Tochtermann,Julian Schönberger,Gerhard Stenzel,Philipp Altmann,Claudia Linnhoff-Popien
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Accepted at ICAART 2025
Abstract:While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textitPIMAEX reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textitPIMAEX reward in conjunction with \textitPIMAEX-Communication, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textitConsume/Explore environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textitPIMAEX reward with \textitPIMAEX-Communication outperform those that do not.
[AI-9] Stealthy Backdoor Attack to Real-world Models in Android Apps
链接: https://arxiv.org/abs/2501.01263
作者: Jiali Wei,Ming Fan,Xicheng Zhang,Wenjing Jiao,Haijun Wang,Ting Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Powered by their superior performance, deep neural networks (DNNs) have found widespread applications across various domains. Many deep learning (DL) models are now embedded in mobile apps, making them more accessible to end users through on-device DL. However, deploying on-device DL to users’ smartphones simultaneously introduces several security threats. One primary threat is backdoor attacks. Extensive research has explored backdoor attacks for several years and has proposed numerous attack approaches. However, few studies have investigated backdoor attacks on DL models deployed in the real world, or they have shown obvious deficiencies in effectiveness and stealthiness. In this work, we explore more effective and stealthy backdoor attacks on real-world DL models extracted from mobile apps. Our main justification is that imperceptible and sample-specific backdoor triggers generated by DNN-based steganography can enhance the efficacy of backdoor attacks on real-world models. We first confirm the effectiveness of steganography-based backdoor attacks on four state-of-the-art DNN models. Subsequently, we systematically evaluate and analyze the stealthiness of the attacks to ensure they are difficult to perceive. Finally, we implement the backdoor attacks on real-world models and compare our approach with three baseline methods. We collect 38,387 mobile apps, extract 89 DL models from them, and analyze these models to obtain the prerequisite model information for the attacks. After identifying the target models, our approach achieves an average of 12.50% higher attack success rate than DeepPayload while better maintaining the normal performance of the models. Extensive experimental results demonstrate that our method enables more effective, robust, and stealthy backdoor attacks on real-world models.
[AI-10] An Efficient Attention Mechanism for Sequential Recommendation Tasks: HydraRec
链接: https://arxiv.org/abs/2501.01242
作者: Uzma Mushtaque
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Transformer based models are increasingly being used in various domains including recommender systems (RS). Pretrained transformer models such as BERT have shown good performance at language modelling. With the greater ability to model sequential tasks, variants of Encoder-only models (like BERT4Rec, SASRec etc.) have found success in sequential RS problems. Computing dot-product attention in traditional transformer models has quadratic complexity in sequence length. This is a bigger problem with RS because unlike language models, new items are added to the catalogue every day. User buying history is a dynamic sequence which depends on multiple factors. Recently, various linear attention models have tried to solve this problem by making the model linear in sequence length (token dimensions). Hydra attention is one such linear complexity model proposed for vision transformers which reduces the complexity of attention for both the number of tokens as well as model embedding dimensions. Building on the idea of Hydra attention, we introduce an efficient Transformer based Sequential RS (HydraRec) which significantly improves theoretical complexity of computing attention for longer sequences and bigger datasets while preserving the temporal context. Extensive experiments are conducted to evaluate other linear transformer-based RS models and compared with HydraRec across various evaluation metrics. HydraRec outperforms other linear attention-based models as well as dot-product based attention models when used with causal masking for sequential recommendation next item prediction tasks. For bi-directional models its performance is comparable to the BERT4Rec model with an improvement in running time.
[AI-11] A redescription mining framework for post-hoc explaining and relating deep learning models
链接: https://arxiv.org/abs/2501.01209
作者: Matej Mihelčić,Ivan Grubišić,Miha Keber
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning models (DLMs) achieve increasingly high performance both on structured and unstructured data. They significantly extended applicability of machine learning to various domains. Their success in making predictions, detecting patterns and generating new data made significant impact on science and industry. Despite these accomplishments, DLMs are difficult to explain because of their enormous size. In this work, we propose a novel framework for post-hoc explaining and relating DLMs using redescriptions. The framework allows cohort analysis of arbitrary DLMs by identifying statistically significant redescriptions of neuron activations. It allows coupling neurons to a set of target labels or sets of descriptive attributes, relating layers within a single DLM or associating different DLMs. The proposed framework is independent of the artificial neural network architecture and can work with more complex target labels (e.g. multi-label or multi-target scenario). Additionally, it can emulate both pedagogical and decompositional approach to rule extraction. The aforementioned properties of the proposed framework can increase explainability and interpretability of arbitrary DLMs by providing different information compared to existing explainable-AI approaches.
[AI-12] A3: Android Agent Arena for Mobile GUI Agents
链接: https://arxiv.org/abs/2501.01149
作者: Yuxiang Chai,Hanhao Li,Jiayu Zhang,Liang Liu,Guozhi Wang,Shuai Ren,Siyuan Huang,Hongsheng Li
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \urlthis https URL.
[AI-13] Symmetries-enhanced Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2501.01136
作者: Nikolaos Bousias,Stefanos Pertigkiozoglou,Kostas Daniilidis,George Pappas
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Representation Theory (math.RT)
*备注:
Abstract:Multi-agent reinforcement learning has emerged as a powerful framework for enabling agents to learn complex, coordinated behaviors but faces persistent challenges regarding its generalization, scalability and sample efficiency. Recent advancements have sought to alleviate those issues by embedding intrinsic symmetries of the systems in the policy. Yet, most dynamical systems exhibit little to no symmetries to exploit. This paper presents a novel framework for embedding extrinsic symmetries in multi-agent system dynamics that enables the use of symmetry-enhanced methods to address systems with insufficient intrinsic symmetries, expanding the scope of equivariant learning to a wide variety of MARL problems. Central to our framework is the Group Equivariant Graphormer, a group-modular architecture specifically designed for distributed swarming tasks. Extensive experiments on a swarm of symmetry-breaking quadrotors validate the effectiveness of our approach, showcasing its potential for improved generalization and zero-shot scalability. Our method achieves significant reductions in collision rates and enhances task success rates across a diverse range of scenarios and varying swarm sizes.
[AI-14] Pruning-based Data Selection and Network Fusion for Efficient Deep Learning NEURIPS2024
链接: https://arxiv.org/abs/2501.01118
作者: Humaira Kousar,Hasnain Irshad Bhatti,Jaekyun Moon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Workshop on Attributing Model Behavior at Scale (ATTRIB)
Abstract:Efficient data selection is essential for improving the training efficiency of deep neural networks and reducing the associated annotation costs. However, traditional methods tend to be computationally expensive, limiting their scalability and real-world applicability. We introduce PruneFuse, a novel method that combines pruning and network fusion to enhance data selection and accelerate network training. In PruneFuse, the original dense network is pruned to generate a smaller surrogate model that efficiently selects the most informative samples from the dataset. Once this iterative data selection selects sufficient samples, the insights learned from the pruned model are seamlessly integrated with the dense model through network fusion, providing an optimized initialization that accelerates training. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.
[AI-15] Robust COVID-19 Detection from Cough Sounds using Deep Neural Decision Tree and Forest: A Comprehensive Cross-Datasets Evaluation
链接: https://arxiv.org/abs/2501.01117
作者: Rofiqul Islam,Nihad Karim Chowdhury,Muhammad Ashad Kabir
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 39 pages
Abstract:This research presents a robust approach to classifying COVID-19 cough sounds using cutting-edge machine-learning techniques. Leveraging deep neural decision trees and deep neural decision forests, our methodology demonstrates consistent performance across diverse cough sound datasets. We begin with a comprehensive extraction of features to capture a wide range of audio features from individuals, whether COVID-19 positive or negative. To determine the most important features, we use recursive feature elimination along with cross-validation. Bayesian optimization fine-tunes hyper-parameters of deep neural decision tree and deep neural decision forest models. Additionally, we integrate the SMOTE during training to ensure a balanced representation of positive and negative data. Model performance refinement is achieved through threshold optimization, maximizing the ROC-AUC score. Our approach undergoes a comprehensive evaluation in five datasets: Cambridge, Coswara, COUGHVID, Virufy, and the combined Virufy with the NoCoCoDa dataset. Consistently outperforming state-of-the-art methods, our proposed approach yields notable AUC scores of 0.97, 0.98, 0.92, 0.93, 0.99, and 0.99 across the respective datasets. Merging all datasets into a combined dataset, our method, using a deep neural decision forest classifier, achieves an AUC of 0.97. Also, our study includes a comprehensive cross-datasets analysis, revealing demographic and geographic differences in the cough sounds associated with COVID-19. These differences highlight the challenges in transferring learned features across diverse datasets and underscore the potential benefits of dataset integration, improving generalizability and enhancing COVID-19 detection from audio signals.
[AI-16] MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification AAAI2025
链接: https://arxiv.org/abs/2501.01110
作者: Jimin Park,AHyun Ji,Minji Park,Mohammad Saidur Rahman,Se Eun Oh
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted paper at AAAI 2025. 9 pages, Figure 6, Table 1
Abstract:Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model’s performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model’s hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario – where new classes are introduced continuously over multiple tasks – demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url this https URL\footnoteThe code will be made public upon the presentation of the paper. Comments: Accepted paper at AAAI 2025. 9 pages, Figure 6, Table 1 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.01110 [cs.CR] (or arXiv:2501.01110v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.01110 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Thirty-Ninth AAAI Conference on Artificial Intelligence 2025 (AAAI-25)
[AI-17] MMVA: Multimodal Matching Based on Valence and Arousal across Images Music and Musical Captions AAAI2025
链接: https://arxiv.org/abs/2501.01094
作者: Suhwan Choi,Kyu Won Kim,Myungjoo Kang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Paper accepted in Artificial Intelligence for Music workshop at AAAI 2025
Abstract:We introduce Multimodal Matching based on Valence and Arousal (MMVA), a tri-modal encoder framework designed to capture emotional content across images, music, and musical captions. To support this framework, we expand the Image-Music-Emotion-Matching-Net (IMEMNet) dataset, creating IMEMNet-C which includes 24,756 images and 25,944 music clips with corresponding musical captions. We employ multimodal matching scores based on the continuous valence (emotional positivity) and arousal (emotional intensity) values. This continuous matching score allows for random sampling of image-music pairs during training by computing similarity scores from the valence-arousal values across different modalities. Consequently, the proposed approach achieves state-of-the-art performance in valence-arousal prediction tasks. Furthermore, the framework demonstrates its efficacy in various zeroshot tasks, highlighting the potential of valence and arousal predictions in downstream applications.
[AI-18] Graph Generative Pre-trained Transformer
链接: https://arxiv.org/abs/2501.01073
作者: Xiaohui Chen,Yinkai Wang,Jiaxing He,Yuanqi Du,Soha Hassoun,Xiaolin Xu,Li-Ping Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: preprint
Abstract:Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT’s capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.
[AI-19] owards Adversarially Robust Deep Metric Learning
链接: https://arxiv.org/abs/2501.01025
作者: Xiaopeng Ke
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Deep Metric Learning (DML) has shown remarkable successes in many domains by taking advantage of powerful deep neural networks. Deep neural networks are prone to adversarial attacks and could be easily fooled by adversarial examples. The current progress on this robustness issue is mainly about deep classification models but pays little attention to DML models. Existing works fail to thoroughly inspect the robustness of DML and neglect an important DML scenario, the clustering-based inference. In this work, we first point out the robustness issue of DML models in clustering-based inference scenarios. We find that, for the clustering-based inference, existing defenses designed DML are unable to be reused and the adaptions of defenses designed for deep classification models cannot achieve satisfactory robustness performance. To alleviate the hazard of adversarial examples, we propose a new defense, the Ensemble Adversarial Training (EAT), which exploits ensemble learning and adversarial training. EAT promotes the diversity of the ensemble, encouraging each model in the ensemble to have different robustness features, and employs a self-transferring mechanism to make full use of the robustness statistics of the whole ensemble in the update of every single model. We evaluate the EAT method on three widely-used datasets with two popular model architectures. The results show that the proposed EAT method greatly outperforms the adaptions of defenses designed for deep classification models.
[AI-20] CryptoMamba: Leveraging State Space Models for Accurate Bitcoin Price Prediction
链接: https://arxiv.org/abs/2501.01010
作者: Mohammad Shahab Sepehri,Asal Mehradfar,Mahdi Soltanolkotabi,Salman Avestimehr
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Predicting Bitcoin price remains a challenging problem due to the high volatility and complex non-linear dynamics of cryptocurrency markets. Traditional time-series models, such as ARIMA and GARCH, and recurrent neural networks, like LSTMs, have been widely applied to this task but struggle to capture the regime shifts and long-range dependencies inherent in the data. In this work, we propose CryptoMamba, a novel Mamba-based State Space Model (SSM) architecture designed to effectively capture long-range dependencies in financial time-series data. Our experiments show that CryptoMamba not only provides more accurate predictions but also offers enhanced generalizability across different market conditions, surpassing the limitations of previous models. Coupled with trading algorithms for real-world scenarios, CryptoMamba demonstrates its practical utility by translating accurate forecasts into financial outcomes. Our findings signal a huge advantage for SSMs in stock and cryptocurrency price forecasting tasks.
[AI-21] Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review
链接: https://arxiv.org/abs/2501.01007
作者: Yan Gu,Zhaoze Liu,Shuhong Dai,Cong Liu,Ying Wang,Shen Wang,Georgios Theodoropoulos,Long Cheng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:
Abstract:Cloud computing has revolutionized the provisioning of computing resources, offering scalable, flexible, and on-demand services to meet the diverse requirements of modern applications. At the heart of efficient cloud operations are job scheduling and resource management, which are critical for optimizing system performance and ensuring timely and cost-effective service delivery. However, the dynamic and heterogeneous nature of cloud environments presents significant challenges for these tasks, as workloads and resource availability can fluctuate unpredictably. Traditional approaches, including heuristic and meta-heuristic algorithms, often struggle to adapt to these real-time changes due to their reliance on static models or predefined rules. Deep Reinforcement Learning (DRL) has emerged as a promising solution to these challenges by enabling systems to learn and adapt policies based on continuous observations of the environment, facilitating intelligent and responsive decision-making. This survey provides a comprehensive review of DRL-based algorithms for job scheduling and resource management in cloud computing, analyzing their methodologies, performance metrics, and practical applications. We also highlight emerging trends and future research directions, offering valuable insights into leveraging DRL to advance both job scheduling and resource management in cloud computing.
[AI-22] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
链接: https://arxiv.org/abs/2501.01005
作者: Zihao Ye,Lequn Chen,Ruihang Lai,Wuwei Lin,Yineng Zhang,Stephanie Wang,Tianqi Chen,Baris Kasikci,Vinod Grover,Arvind Krishnamurthy,Luis Ceze
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: code available at this http URL
Abstract:Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer’s load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer’s ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
[AI-23] Bootstrapped Reward Shaping AAAI-2025
链接: https://arxiv.org/abs/2501.00989
作者: Jacob Adamczyk,Volodymyr Makarenko,Stas Tiomkin,Rahul V. Kulkarni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at AAAI-2025, Main Track
Abstract:In reinforcement learning, especially in sparse-reward domains, many environment steps are required to observe reward information. In order to increase the frequency of such observations, “potential-based reward shaping” (PBRS) has been proposed as a method of providing a more dense reward signal while leaving the optimal policy invariant. However, the required “potential function” must be carefully designed with task-dependent knowledge to not deter training performance. In this work, we propose a “bootstrapped” method of reward shaping, termed BSRS, in which the agent’s current estimate of the state-value function acts as the potential function for PBRS. We provide convergence proofs for the tabular setting, give insights into training dynamics for deep RL, and show that the proposed method improves training speed in the Atari suite.
[AI-24] beta-DQN: Improving Deep Q-Learning By Evolving the Behavior
链接: https://arxiv.org/abs/2501.00913
作者: Hongming Zhang,Fengshuo Bai,Chenjun Xiao,Chao Gao,Bo Xu,Martin Müller
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like \epsilon -greedy. Motivated by this, we introduce \beta -DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function \beta . This function estimates the probability that each action has been taken at each state. By leveraging \beta , we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. \beta -DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that \beta -DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.
[AI-25] Population Aware Diffusion for Time Series Generation AAAI-2025
链接: https://arxiv.org/abs/2501.00910
作者: Yang Li,Han Meng,Zhenyu Bi,Ingolv T. Urnes,Haipeng Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at AAAI-2025, 8 pages
Abstract:Diffusion models have shown promising ability in generating high-quality time series (TS) data. Despite the initial success, existing works mostly focus on the authenticity of data at the individual level, but pay less attention to preserving the population-level properties on the entire dataset. Such population-level properties include value distributions for each dimension and distributions of certain functional dependencies (e.g., cross-correlation, CC) between different dimensions. For instance, when generating house energy consumption TS data, the value distributions of the outside temperature and the kitchen temperature should be preserved, as well as the distribution of CC between them. Preserving such TS population-level properties is critical in maintaining the statistical insights of the datasets, mitigating model bias, and augmenting downstream tasks like TS prediction. Yet, it is often overlooked by existing models. Hence, data generated by existing models often bear distribution shifts from the original data. We propose Population-aware Diffusion for Time Series (PaD-TS), a new TS generation model that better preserves the population-level properties. The key novelties of PaD-TS include 1) a new training method explicitly incorporating TS population-level property preservation, and 2) a new dual-channel encoder model architecture that better captures the TS data structure. Empirical results in major benchmark datasets show that PaD-TS can improve the average CC distribution shift score between real and synthetic data by 5.9x while maintaining a performance comparable to state-of-the-art models on individual-level authenticity.
[AI-26] Large Language Model Based Multi-Agent System Augmented Complex Event Processing Pipeline for Internet of Multimedia Things
链接: https://arxiv.org/abs/2501.00906
作者: Talha Zeeshan,Abhishek Kumar,Susanna Pirttikangas,Sasu Tarkoma
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
Abstract:This paper presents the development and evaluation of a Large Language Model (LLM), also known as foundation models, based multi-agent system framework for complex event processing (CEP) with a focus on video query processing use cases. The primary goal is to create a proof-of-concept (POC) that integrates state-of-the-art LLM orchestration frameworks with publish/subscribe (pub/sub) tools to address the integration of LLMs with current CEP systems. Utilizing the Autogen framework in conjunction with Kafka message brokers, the system demonstrates an autonomous CEP pipeline capable of handling complex workflows. Extensive experiments evaluate the system’s performance across varying configurations, complexities, and video resolutions, revealing the trade-offs between functionality and latency. The results show that while higher agent count and video complexities increase latency, the system maintains high consistency in narrative coherence. This research builds upon and contributes to, existing novel approaches to distributed AI systems, offering detailed insights into integrating such systems into existing infrastructures.
[AI-27] Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts
链接: https://arxiv.org/abs/2501.00891
作者: Zhuohua Li,Maoli Liu,Xiangxiang Dai,John C.S. Lui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
Abstract:The contextual multi-armed bandit (MAB) problem is crucial in sequential decision-making. A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters, utilizing shared features to improve learning efficiency. However, existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters. As a result, their theoretical analyses require several strong assumptions about the “diversity” of contexts generated by the environment, leading to impractical settings, complicated analyses, and poor practical performance. Removing these assumptions has been a long-standing open problem in the clustering of bandits literature. In this paper, we provide two solutions to this open problem. First, following the i.i.d. context generation setting in existing studies, we propose two novel algorithms, UniCLUB and PhaseUniCLUB, which incorporate enhanced exploration mechanisms to accelerate cluster identification. Remarkably, our algorithms require substantially weaker assumptions while achieving regret bounds comparable to prior work. Second, inspired by the smoothed analysis framework, we propose a more practical setting that eliminates the requirement for i.i.d. context generation used in previous studies, thus enhancing the performance of existing algorithms for online clustering of bandits. Our technique can be applied to both graph-based and set-based clustering of bandits frameworks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our proposed algorithms consistently outperform existing approaches.
[AI-28] Diversity Optimization for Travelling Salesman Problem via Deep Reinforcement Learning
链接: https://arxiv.org/abs/2501.00884
作者: Qi Li,Zhiguang Cao,Yining Ma,Yaoxin Wu,Yue-Jiao Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Existing neural methods for the Travelling Salesman Problem (TSP) mostly aim at finding a single optimal solution. To discover diverse yet high-quality solutions for Multi-Solution TSP (MSTSP), we propose a novel deep reinforcement learning based neural solver, which is primarily featured by an encoder-decoder structured policy. Concretely, on the one hand, a Relativization Filter (RF) is designed to enhance the robustness of the encoder to affine transformations of the instances, so as to potentially improve the quality of the found solutions. On the other hand, a Multi-Attentive Adaptive Active Search (MA3S) is tailored to allow the decoders to strike a balance between the optimality and diversity. Experimental evaluations on benchmark instances demonstrate the superiority of our method over recent neural baselines across different metrics, and its competitive performance against state-of-the-art traditional heuristics with significantly reduced computational time, ranging from 1.3\times to 15\times faster. Furthermore, we demonstrate that our method can also be applied to the Capacitated Vehicle Routing Problem (CVRP).
[AI-29] What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics
链接: https://arxiv.org/abs/2501.00855
作者: Lynnette Hui Xian Ng,Kathleen M. Carley
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:
Abstract:Chatter on social media is 20% bots and 80% humans. Chatter by bots and humans is consistently different: bots tend to use linguistic cues that can be easily automated while humans use cues that require dialogue understanding. Bots use words that match the identities they choose to present, while humans may send messages that are not related to the identities they present. Bots and humans differ in their communication structure: sampled bots have a star interaction structure, while sampled humans have a hierarchical structure. These conclusions are based on a large-scale analysis of social media tweets across ~200mil users across 7 events. Social media bots took the world by storm when social-cybersecurity researchers realized that social media users not only consisted of humans but also of artificial agents called bots. These bots wreck havoc online by spreading disinformation and manipulating narratives. Most research on bots are based on special-purposed definitions, mostly predicated on the event studied. This article first begins by asking, “What is a bot?”, and we study the underlying principles of how bots are different from humans. We develop a first-principle definition of a social media bot. With this definition as a premise, we systematically compare characteristics between bots and humans across global events, and reflect on how the software-programmed bot is an Artificial Intelligent algorithm, and its potential for evolution as technology advances. Based on our results, we provide recommendations for the use and regulation of bots. Finally, we discuss open challenges and future directions: Detect, to systematically identify these automated and potentially evolving bots; Differentiate, to evaluate the goodness of the bot in terms of their content postings and relationship interactions; Disrupt, to moderate the impact of malicious bots.
[AI-30] Distilled Lifelong Self-Adaptation for Configurable Systems ICSE2025
链接: https://arxiv.org/abs/2501.00840
作者: Yulong Ye,Tao Chen,Miqing Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by the 2025 International Conference on Software Engineering (ICSE 2025)
Abstract:Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 229% and a resource acceleration of up to 2.22x on generating promising adaptation configurations. All data and sources can be found at our repository: this https URL.
[AI-31] An LLM -Empowered Adaptive Evolutionary Algorithm For Multi-Component Deep Learning Systems
链接: https://arxiv.org/abs/2501.00829
作者: Haoxiang Tian,Xingshuo Han,Guoquan Wu,An Guo,Yuan Zhou. Jie Zhang,Shuo Li,Jun Wei,Tianwei Zhang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 9
Abstract:Multi-objective evolutionary algorithms (MOEAs) are widely used for searching optimal solutions in complex multi-component applications. Traditional MOEAs for multi-component deep learning (MCDL) systems face challenges in enhancing the search efficiency while maintaining the diversity. To combat these, this paper proposes \mu MOEA, the first LLM-empowered adaptive evolutionary search algorithm to detect safety violations in MCDL systems. Inspired by the context-understanding ability of Large Language Models (LLMs), \mu MOEA promotes the LLM to comprehend the optimization problem and generate an initial population tailed to evolutionary objectives. Subsequently, it employs adaptive selection and variation to iteratively produce offspring, balancing the evolutionary efficiency and diversity. During the evolutionary process, to navigate away from the local optima, \mu MOEA integrates the evolutionary experience back into the LLM. This utilization harnesses the LLM’s quantitative reasoning prowess to generate differential seeds, breaking away from current optimal solutions. We evaluate \mu MOEA in finding safety violations of MCDL systems, and compare its performance with state-of-the-art MOEA methods. Experimental results show that \mu MOEA can significantly improve the efficiency and diversity of the evolutionary search.
[AI-32] Make Shuffling Great Again: A Side-Channel Resistant Fisher-Yates Algorithm for Protecting Neural Networks
链接: https://arxiv.org/abs/2501.00798
作者: Leonard Puškáč,Marek Benovič,Jakub Breier,Xiaolu Hou
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Neural network models implemented in embedded devices have been shown to be susceptible to side-channel attacks (SCAs), allowing recovery of proprietary model parameters, such as weights and biases. There are already available countermeasure methods currently used for protecting cryptographic implementations that can be tailored to protect embedded neural network models. Shuffling, a hiding-based countermeasure that randomly shuffles the order of computations, was shown to be vulnerable to SCA when the Fisher-Yates algorithm is used. In this paper, we propose a design of an SCA-secure version of the Fisher-Yates algorithm. By integrating the masking technique for modular reduction and Blakely’s method for modular multiplication, we effectively remove the vulnerability in the division operation that led to side-channel leakage in the original version of the algorithm. We experimentally evaluate that the countermeasure is effective against SCA by implementing a correlation power analysis attack on an embedded neural network model implemented on ARM Cortex-M4. Compared to the original proposal, the memory overhead is 2\times the biggest layer of the network, while the time overhead varies from 4% to 0.49% for a layer with 100 and 1000 neurons, respectively.
[AI-33] LENS-XAI: Redefining Lightweight and Explainable Network Security through Knowledge Distillation and Variational Autoencoders for Scalable Intrusion Detection in Cybersecurity
链接: https://arxiv.org/abs/2501.00790
作者: Muhammet Anil Yagiz,Polat Goktas
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
*备注:
Abstract:The rapid proliferation of Industrial Internet of Things (IIoT) systems necessitates advanced, interpretable, and scalable intrusion detection systems (IDS) to combat emerging cyber threats. Traditional IDS face challenges such as high computational demands, limited explainability, and inflexibility against evolving attack patterns. To address these limitations, this study introduces the Lightweight Explainable Network Security framework (LENS-XAI), which combines robust intrusion detection with enhanced interpretability and scalability. LENS-XAI integrates knowledge distillation, variational autoencoder models, and attribution-based explainability techniques to achieve high detection accuracy and transparency in decision-making. By leveraging a training set comprising 10% of the available data, the framework optimizes computational efficiency without sacrificing performance. Experimental evaluation on four benchmark datasets: Edge-IIoTset, UKM-IDS20, CTU-13, and NSL-KDD, demonstrates the framework’s superior performance, achieving detection accuracies of 95.34%, 99.92%, 98.42%, and 99.34%, respectively. Additionally, the framework excels in reducing false positives and adapting to complex attack scenarios, outperforming existing state-of-the-art methods. Key strengths of LENS-XAI include its lightweight design, suitable for resource-constrained environments, and its scalability across diverse IIoT and cybersecurity contexts. Moreover, the explainability module enhances trust and transparency, critical for practical deployment in dynamic and sensitive applications. This research contributes significantly to advancing IDS by addressing computational efficiency, feature interpretability, and real-world applicability. Future work could focus on extending the framework to ensemble AI systems for distributed environments, further enhancing its robustness and adaptability.
[AI-34] REM: A Scalable Reinforced Multi-Expert Framework for Multiplex Influence Maximization
链接: https://arxiv.org/abs/2501.00779
作者: Huyen Nguyen,Hieu Dam,Nguyen Do,Cong Tran,Cuong Pham
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:
Abstract:In social online platforms, identifying influential seed users to maximize influence spread is a crucial as it can greatly diminish the cost and efforts required for information dissemination. While effective, traditional methods for Multiplex Influence Maximization (MIM) have reached their performance limits, prompting the emergence of learning-based approaches. These novel methods aim for better generalization and scalability for more sizable graphs but face significant challenges, such as (1) inability to handle unknown diffusion patterns and (2) reliance on high-quality training samples. To address these issues, we propose the Reinforced Expert Maximization framework (REM). REM leverages a Propagation Mixture of Experts technique to encode dynamic propagation of large multiplex networks effectively in order to generate enhanced influence propagation. Noticeably, REM treats a generative model as a policy to autonomously generate different seed sets and learn how to improve them from a Reinforcement Learning perspective. Extensive experiments on several real-world datasets demonstrate that REM surpasses state-of-the-art methods in terms of influence spread, scalability, and inference time in influence maximization tasks.
[AI-35] Revisiting Graph Neural Networks on Graph-level Tasks: Comprehensive Experiments Analysis and Improvements
链接: https://arxiv.org/abs/2501.00773
作者: Haoyang Li,Yuming Xu,Chen Jason Zhang,Alexander Zhou,Lei Chen,Qing Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:
Abstract:Graphs are essential data structures for modeling complex interactions in domains such as social networks, molecular structures, and biological systems. Graph-level tasks, which predict properties or classes for the entire graph, are critical for applications, such as molecular property prediction and subgraph counting. Graph Neural Networks (GNNs) have shown promise in these tasks, but their evaluations are often limited to narrow datasets, tasks, and inconsistent experimental setups, restricting their generalizability. To address these limitations, we propose a unified evaluation framework for graph-level GNNs. This framework provides a standardized setting to evaluate GNNs across diverse datasets, various graph tasks (e.g., graph classification and regression), and challenging scenarios, including noisy, imbalanced, and few-shot graphs. Additionally, we propose a novel GNN model with enhanced expressivity and generalization capabilities. Specifically, we enhance the expressivity of GNNs through a k -path rooted subgraph approach, enabling the model to effectively count subgraphs (e.g., paths and cycles). Moreover, we introduce a unified graph contrastive learning algorithm for graphs across diverse domains, which adaptively removes unimportant edges to augment graphs, thereby significantly improving generalization performance. Extensive experiments demonstrate that our model achieves superior performance against fourteen effective baselines across twenty-seven graph datasets, establishing it as a robust and generalizable model for graph-level tasks.
[AI-36] Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform
链接: https://arxiv.org/abs/2501.00750
作者: Cheonsu Jeong
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 27 figures
Abstract:This study proposes the design and implementation of a multimodal LLM-based Multi-Agent System (MAS) leveraging a No-Code platform to address the practical constraints and significant entry barriers associated with AI adoption in enterprises. Advanced AI technologies, such as Large Language Models (LLMs), often pose challenges due to their technical complexity and high implementation costs, making them difficult for many organizations to adopt. To overcome these limitations, this research develops a No-Code-based Multi-Agent System designed to enable users without programming knowledge to easily build and manage AI systems. The study examines various use cases to validate the applicability of AI in business processes, including code generation from image-based notes, Advanced RAG-based question-answering systems, text-based image generation, and video generation using images and prompts. These systems lower the barriers to AI adoption, empowering not only professional developers but also general users to harness AI for significantly improved productivity and efficiency. By demonstrating the scalability and accessibility of No-Code platforms, this study advances the democratization of AI technologies within enterprises and validates the practical applicability of Multi-Agent Systems, ultimately contributing to the widespread adoption of AI across various industries.
[AI-37] AttriReBoost: A Gradient-Free Propagation Optimization Method for Cold Start Mitigation in Attribute Missing Graphs
链接: https://arxiv.org/abs/2501.00743
作者: Mengran Li,Chaojun Ding,Junzhou Chen,Wenbin Xing,Cong Ye,Ronghui Zhang,Songlin Zhuang,Jia Hu,Tony Z. Qiu,Huijun Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Missing attribute issues are prevalent in the graph learning, leading to biased outcomes in Graph Neural Networks (GNNs). Existing methods that rely on feature propagation are prone to cold start problem, particularly when dealing with attribute resetting and low-degree nodes, which hinder effective propagation and convergence. To address these challenges, we propose AttriReBoost (ARB), a novel method that incorporates propagation-based method to mitigate cold start problems in attribute-missing graphs. ARB enhances global feature propagation by redefining initial boundary conditions and strategically integrating virtual edges, thereby improving node connectivity and ensuring more stable and efficient convergence. This method facilitates gradient-free attribute reconstruction with lower computational overhead. The proposed method is theoretically grounded, with its convergence rigorously established. Extensive experiments on several real-world benchmark datasets demonstrate the effectiveness of ARB, achieving an average accuracy improvement of 5.11% over state-of-the-art methods. Additionally, ARB exhibits remarkable computational efficiency, processing a large-scale graph with 2.49 million nodes in just 16 seconds on a single GPU. Our code is available at this https URL.
[AI-38] Grade Inflation in Generative Models
链接: https://arxiv.org/abs/2501.00664
作者: Phuc Nguyen,Miao Li,Alexandra Morgan,Rima Arnaout,Ramy Arnaout
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 6 figures, 1 table
Abstract:Generative models hold great potential, but only if one can trust the evaluation of the data they generate. We show that many commonly used quality scores for comparing two-dimensional distributions of synthetic vs. ground-truth data give better results than they should, a phenomenon we call the “grade inflation problem.” We show that the correlation score, Jaccard score, earth-mover’s score, and Kullback-Leibler (relative-entropy) score all suffer grade inflation. We propose that any score that values all datapoints equally, as these do, will also exhibit grade inflation; we refer to such scores as “equipoint” scores. We introduce the concept of “equidensity” scores, and present the Eden score, to our knowledge the first example of such a score. We found that Eden avoids grade inflation and agrees better with human perception of goodness-of-fit than the equipoint scores above. We propose that any reasonable equidensity score will avoid grade inflation. We identify a connection between equidensity scores and Rényi entropy of negative order. We conclude that equidensity scores are likely to outperform equipoint scores for generative models, and for comparing low-dimensional distributions more generally.
[AI-39] Enabling New HDLs with Agents
链接: https://arxiv.org/abs/2501.00642
作者: Mark Zakharov,Farzaneh Rabiei Kashanaki,Jose Renau
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
Abstract:Large Language Models (LLMs) based agents are transforming the programming language landscape by facilitating learning for beginners, enabling code generation, and optimizing documentation workflows. Hardware Description Languages (HDLs), with their smaller user community, stand to benefit significantly from the application of LLMs as tools for learning new HDLs. This paper investigates the challenges and solutions of enabling LLMs for HDLs, particularly for HDLs that LLMs have not been previously trained on. This work introduces HDLAgent, an AI agent optimized for LLMs with limited knowledge of various HDLs. It significantly enhances off-the-shelf LLMs.
[AI-40] Unbiased GNN Learning via Fairness-Aware Subgraph Diffusion
链接: https://arxiv.org/abs/2501.00595
作者: Abdullah Alchihabi,Yuhong Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable efficacy in tackling a wide array of graph-related tasks across diverse domains. However, a significant challenge lies in their propensity to generate biased predictions, particularly with respect to sensitive node attributes such as age and gender. These biases, inherent in many machine learning models, are amplified in GNNs due to the message-passing mechanism, which allows nodes to influence each other, rendering the task of making fair predictions notably challenging. This issue is particularly pertinent in critical domains where model fairness holds paramount importance. In this paper, we propose a novel generative Fairness-Aware Subgraph Diffusion (FASD) method for unbiased GNN learning. The method initiates by strategically sampling small subgraphs from the original large input graph, and then proceeds to conduct subgraph debiasing via generative fairness-aware graph diffusion processes based on stochastic differential equations (SDEs). To effectively diffuse unfairness in the input data, we introduce additional adversary bias perturbations to the subgraphs during the forward diffusion process, and train score-based models to predict these applied perturbations, enabling them to learn the underlying dynamics of the biases present in the data. Subsequently, the trained score-based models are utilized to further debias the original subgraph samples through the reverse diffusion process. Finally, FASD induces fair node predictions on the input graph by performing standard GNN learning on the debiased subgraphs. Experimental results demonstrate the superior performance of the proposed method over state-of-the-art Fair GNN baselines across multiple benchmark datasets.
[AI-41] Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLM s
链接: https://arxiv.org/abs/2501.00555
作者: Harit Vishwakarma,Alan Mishler,Thomas Cook,Niccolò Dalmasso,Natraj Raman,Sumitra Ganesh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a \emphscore function into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP’s utility beyond uncertainty quantification to improve accuracy. We propose \emphconformal revision of questions (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM’s chances of answering it correctly. Experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with Gemma-2, Llama-3 and Phi-3 models show that CP-OPT significantly reduces set sizes while maintaining coverage, and CROQ improves accuracy over the standard inference, especially when paired with CP-OPT scores. Together, CP-OPT and CROQ offer a robust framework for improving both the safety and accuracy of LLM-driven decision-making.
[AI-42] Extending XReason: Formal Explanations for Adversarial Detection
链接: https://arxiv.org/abs/2501.00537
作者: Amira Jemaa,Adnan Rashid,Sofiene Tahar
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: International Congress on Information and Communication Technology (ICICT), Lecture Notes in Networks and Systems (LNNS), Springer, 2025
Abstract:Explainable Artificial Intelligence (XAI) plays an important role in improving the transparency and reliability of complex machine learning models, especially in critical domains such as cybersecurity. Despite the prevalence of heuristic interpretation methods such as SHAP and LIME, these techniques often lack formal guarantees and may produce inconsistent local explanations. To fulfill this need, few tools have emerged that use formal methods to provide formal explanations. Among these, XReason uses a SAT solver to generate formal instance-level explanation for XGBoost models. In this paper, we extend the XReason tool to support LightGBM models as well as class-level explanations. Additionally, we implement a mechanism to generate and detect adversarial examples in XReason. We evaluate the efficiency and accuracy of our approach on the CICIDS-2017 dataset, a widely used benchmark for detecting network attacks.
[AI-43] PyMilo: A Python Library for ML I/O
链接: https://arxiv.org/abs/2501.00528
作者: AmirHosein Rostami,Sepand Haghighi,Sadra Sabouri,Alireza Zolanvari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, 2 tables, 3 code blocks
Abstract:PyMilo is an open-source Python package that addresses the limitations of existing Machine Learning (ML) model storage formats by providing a transparent, reliable, and safe method for exporting and deploying trained models. Current formats, such as pickle and other binary formats, have significant problems, such as reliability, safety, and transparency issues. In contrast, PyMilo serializes ML models in a transparent non-executable format, enabling straightforward and safe model exchange, while also facilitating the deserialization and deployment of exported models in production environments. This package aims to provide a seamless, end-to-end solution for the exportation and importation of pre-trained ML models, which simplifies the model development and deployment pipeline.
[AI-44] A Method for Enhancing the Safety of Large Model Generation Based on Multi-dimensional Attack and Defense
链接: https://arxiv.org/abs/2501.00517
作者: Keke Zhai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Currently, large models are prone to generating harmful content when faced with complex attack instructions, significantly reducing their defensive capabilities. To address this issue, this paper proposes a method based on constructing data aligned with multi-dimensional attack defense to enhance the generative security of large models. The core of our method lies in improving the effectiveness of safe alignment learning for large models by innova-tively increasing the diversity of attack instruction dimensions and the accuracy of generat-ing safe responses. To validate the effectiveness of our method, beyond existing security evaluation benchmarks, we additionally designed new security evaluation benchmarks and conducted comparative experiments using Llama3.2 as the baseline model. The final ex-perimental results demonstrate that our method can significantly improve the generative security of large models under complex instructional attacks, while also maintaining and enhancing the models’ general capabilities.
[AI-45] Exploring Physics-Informed Neural Networks for Crop Yield Loss Forecasting NEURIPS2024
链接: https://arxiv.org/abs/2501.00502
作者: Miro Miranda,Marcela Charfuelan,Andreas Dengel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, NeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning
Abstract:In response to climate change, assessing crop productivity under extreme weather conditions is essential to enhance food security. Crop simulation models, which align with physical processes, offer explainability but often perform poorly. Conversely, machine learning (ML) models for crop modeling are powerful and scalable yet operate as black boxes and lack adherence to crop growths physical principles. To bridge this gap, we propose a novel method that combines the strengths of both approaches by estimating the water use and the crop sensitivity to water scarcity at the pixel level. This approach enables yield loss estimation grounded in physical principles by sequentially solving the equation for crop yield response to water scarcity, using an enhanced loss function. Leveraging Sentinel-2 satellite imagery, climate data, simulated water use data, and pixel-level yield data, our model demonstrates high accuracy, achieving an R2 of up to 0.77, matching or surpassing state-of-the-art models like RNNs and Transformers. Additionally, it provides interpretable and physical consistent outputs, supporting industry, policymakers, and farmers in adapting to extreme weather conditions.
[AI-46] Efficient support ticket resolution using Knowledge Graphs
链接: https://arxiv.org/abs/2501.00461
作者: Sherwin Varghese,James Tian
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:A review of over 160,000 customer cases indicates that about 90% of time is spent by the product support for solving around 10% of subset of tickets where a trivial solution may not exist. Many of these challenging cases require the support of several engineers working together within a “swarm”, and some also need to go to development support as bugs. These challenging customer issues represent a major opportunity for machine learning and knowledge graph that identifies the ideal engineer / group of engineers(swarm) that can best address the solution, reducing the wait times for the customer. The concrete ML task we consider here is a learning-to-rank(LTR) task that given an incident and a set of engineers currently assigned to the incident (which might be the empty set in the non-swarming context), produce a ranked list of engineers best fit to help resolve that incident. To calculate the rankings, we may consider a wide variety of input features including the incident description provided by the customer, the affected component(s), engineer ratings of their expertise, knowledge base article text written by engineers, response to customer text written by engineers, and historic swarming data. The central hypothesis test is that by including a holistic set of contextual data around which cases an engineer has solved, we can significantly improve the LTR algorithm over benchmark models. The article proposes a novel approach of modelling Knowledge Graph embeddings from multiple data sources, including the swarm information. The results obtained proves that by incorporating this additional context, we can improve the recommendations significantly over traditional machine learning methods like TF-IDF.
[AI-47] Do Students with Different Personality Traits Demonstrate Different Physiological Signals in Video-based Learning?
链接: https://arxiv.org/abs/2501.00449
作者: Chun-Hsiung Tseng,Hao-Chiang Koong Lin,Yung-Hui Chen,Jia-Rou Lin,Andrew Chih-Wei Huang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
Abstract:Past researches show that personality trait is a strong predictor for ones academic performance. Today, mature and verified marker systems for assessing personality traits already exist. However, marker systems-based assessing methods have their own limitations. For example, dishonest responses cannot be avoided. In this research, the goal is to develop a method that can overcome the limitations. The proposed method will rely on physiological signals for the assessment. Thirty participants have participated in this experiment. Based on the statistical results, we found that there are correlations between students personality traits and their physiological signal change when learning via videos. Specifically, we found that participants degree of extraversion, agreeableness, conscientiousness, and openness to experiences are correlated with the variance of heart rates, the variance of GSR values, and the skewness of voice frequencies, etc.
[AI-48] Knowledge-aware equation discovery with automated background knowledge extraction
链接: https://arxiv.org/abs/2501.00444
作者: Elizaveta Ivanchik,Alexander Hvatov
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:In differential equation discovery algorithms, a priori expert knowledge is mainly used implicitly to constrain the form of the expected equation, making it impossible for the algorithm to truly discover equations. Instead, most differential equation discovery algorithms try to recover the coefficients for a known structure. In this paper, we describe an algorithm that allows the discovery of unknown equations using automatically or manually extracted background knowledge. Instead of imposing rigid constraints, we modify the structure space so that certain terms are likely to appear within the crossover and mutation operators. In this way, we mimic expertly chosen terms while preserving the possibility of obtaining any equation form. The paper shows that the extraction and use of knowledge allows it to outperform the SINDy algorithm in terms of search stability and robustness. Synthetic examples are given for Burgers, wave, and Korteweg–De Vries equations.
[AI-49] Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
链接: https://arxiv.org/abs/2501.00418
作者: Martin Pawelczyk,Lillian Sun,Zhenting Qi,Aounon Kumar,Himabindu Lakkaraju
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally
Abstract:The rapid proliferation of generative AI, especially large language models, has led to their integration into a variety of applications. A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model’s outputs surpasses the weak model in task performance - has gained significant attention. Yet, whether critical trustworthiness properties such as robustness, fairness, and privacy can generalize similarly remains an open question. In this work, we study this question by examining if a stronger model can inherit trustworthiness properties when fine-tuned on a weaker model’s outputs, a process we term weak-to-strong trustworthiness generalization. To address this, we introduce two foundational training strategies: 1) Weak Trustworthiness Finetuning (Weak TFT), which leverages trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends regularization to both weak and strong models. Our experimental evaluation on real-world datasets reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore trustworthiness generalization via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of weak-to-strong generalization.
[AI-50] Proactive Conversational Agents with Inner Thoughts
链接: https://arxiv.org/abs/2501.00383
作者: Xingyu Bruce Liu,Shitao Fang,Weiyan Shi,Chien-Sheng Wu,Takeo Igarashi,Xiang `Anthony’ Chen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
Abstract:One of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e., being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.
[AI-51] Design Optimizer for Soft Growing Robot Manipulators in Three-Dimensional Environments
链接: https://arxiv.org/abs/2501.00368
作者: Ahmet Astar,Ozan Nurcan,Erk Demirel,Emir Ozen,Ozan Kutlar,Fabio Stroppa
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 20 pages, 10 figures
Abstract:Soft growing robots are novel devices that mimic plant-like growth for navigation in cluttered or dangerous environments. Their ability to adapt to surroundings, combined with advancements in actuation and manufacturing technologies, allows them to perform specialized manipulation tasks. This work presents an approach for design optimization of soft growing robots; specifically, the three-dimensional extension of the optimizer designed for planar manipulators. This tool is intended to be used by engineers and robot enthusiasts before manufacturing their robot: it suggests the optimal size of the robot for solving a specific task. The design process models a multi-objective optimization problem to refine a soft manipulator’s kinematic chain. Thanks to the novel Rank Partitioning algorithm integrated into Evolutionary Computation (EC) algorithms, this method achieves high precision in reaching targets and is efficient in resource usage. Results show significantly high performance in solving three-dimensional tasks, whereas comparative experiments indicate that the optimizer features robust output when tested with different EC algorithms, particularly genetic algorithms.
[AI-52] Low-Rank Adaptation for Foundation Models: A Comprehensive Review
链接: https://arxiv.org/abs/2501.00365
作者: Menglin Yang,Jialin Chen,Yifei Zhang,Jiahong Liu,Jiasheng Zhang,Qiyao Ma,Harshit Verma,Qianru Zhang,Min Zhou,Irwin King,Rex Ying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:The rapid advancement of foundation modelslarge-scale neural networks trained on diverse, extensive datasetshas revolutionized artificial intelligence, enabling unprecedented advancements across domains such as natural language processing, computer vision, and scientific discovery. However, the substantial parameter count of these models, often reaching billions or trillions, poses significant challenges in adapting them to specific downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a highly promising approach for mitigating these challenges, offering a parameter-efficient mechanism to fine-tune foundation models with minimal computational overhead. This survey provides the first comprehensive review of LoRA techniques beyond large Language Models to general foundation models, including recent techniques foundations, emerging frontiers and applications of low-rank adaptation across multiple domains. Finally, this survey discusses key challenges and future research directions in theoretical understanding, scalability, and robustness. This survey serves as a valuable resource for researchers and practitioners working with efficient foundation model adaptation.
[AI-53] textttFORM: Learning Expressive and Transferable First-Order Logic Reward Machines AAMAS’25
链接: https://arxiv.org/abs/2501.00364
作者: Leo Ardon,Daniel Furelos-Blanco,Roko Parać,Alessandra Russo
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注: AAMAS’25
Abstract:Reward machines (RMs) are an effective approach for addressing non-Markovian rewards in reinforcement learning (RL) through finite-state machines. Traditional RMs, which label edges with propositional logic formulae, inherit the limited expressivity of propositional logic. This limitation hinders the learnability and transferability of RMs since complex tasks will require numerous states and edges. To overcome these challenges, we propose First-Order Reward Machines ( \textttFORM s), which use first-order logic to label edges, resulting in more compact and transferable RMs. We introduce a novel method for \textbflearning \textttFORM s and a multi-agent formulation for \textbfexploiting them and facilitate their transferability, where multiple agents collaboratively learn policies for a shared \textttFORM . Our experimental results demonstrate the scalability of \textttFORM s with respect to traditional RMs. Specifically, we show that \textttFORM s can be effectively learnt for tasks where traditional RM learning approaches fail. We also show significant improvements in learning speed and task transferability thanks to the multi-agent learning framework and the abstraction provided by the first-order language.
[AI-54] mporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification
链接: https://arxiv.org/abs/2501.00348
作者: Qi Zhang,Huamin Wang,Hangchi Shen,Shukai Duan,Shiping Wen,Tingwen Huang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 9 pages, 5 figures
Abstract:Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these this http URL solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04% on SHD for the classification accuracy of all models.
[AI-55] Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind
链接: https://arxiv.org/abs/2501.00320
作者: Haibo Tong,Enmeng Lum,Yinqian Sun,Zhengqiang Han,Chao Liu,Feifei Zhao,Yi Zeng
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:With the widespread application of Artificial Intelligence (AI) in human society, enabling AI to autonomously align with human values has become a pressing issue to ensure its sustainable development and benefit to humanity. One of the most important aspects of aligning with human values is the necessity for agents to autonomously make altruistic, safe, and ethical decisions, considering and caring for human well-being. Current AI extremely pursues absolute superiority in certain tasks, remaining indifferent to the surrounding environment and other agents, which has led to numerous safety risks. Altruistic behavior in human society originates from humans’ capacity for empathizing others, known as Theory of Mind (ToM), combined with predictive imaginative interactions before taking action to produce thoughtful and altruistic behaviors. Inspired by this, we are committed to endow agents with considerate self-imagination and ToM capabilities, driving them through implicit intrinsic motivations to autonomously align with human altruistic values. By integrating ToM within the imaginative space, agents keep an eye on the well-being of other agents in real time, proactively anticipate potential risks to themselves and others, and make thoughtful altruistic decisions that balance negative effects on the environment. The ancient Chinese story of Sima Guang Smashes the Vat illustrates the moral behavior of the young Sima Guang smashed a vat to save a child who had accidentally fallen into it, which is an excellent reference scenario for this paper. We design an experimental scenario similar to Sima Guang Smashes the Vat and its variants with different complexities, which reflects the trade-offs and comprehensive considerations between self-goals, altruistic rescue, and avoiding negative side effects.
[AI-56] M2I2: Learning Efficient Multi-Agent Communication via Masked State Modeling and Intention Inference
链接: https://arxiv.org/abs/2501.00312
作者: Chuxiong Sun,Peng He,Qirui Ji,Zehua Zang,Jiangmeng Li,Rui Wang,Wei Wang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:
Abstract:Communication is essential in coordinating the behaviors of multiple agents. However, existing methods primarily emphasize content, timing, and partners for information sharing, often neglecting the critical aspect of integrating shared information. This gap can significantly impact agents’ ability to understand and respond to complex, uncertain interactions, thus affecting overall communication efficiency. To address this issue, we introduce M2I2, a novel framework designed to enhance the agents’ capabilities to assimilate and utilize received information effectively. M2I2 equips agents with advanced capabilities for masked state modeling and joint-action prediction, enriching their perception of environmental uncertainties and facilitating the anticipation of teammates’ intentions. This approach ensures that agents are furnished with both comprehensive and relevant information, bolstering more informed and synergistic behaviors. Moreover, we propose a Dimensional Rational Network, innovatively trained via a meta-learning paradigm, to identify the importance of dimensional pieces of information, evaluating their contributions to decision-making and auxiliary tasks. Then, we implement an importance-based heuristic for selective information masking and sharing. This strategy optimizes the efficiency of masked state modeling and the rationale behind information sharing. We evaluate M2I2 across diverse multi-agent tasks, the results demonstrate its superior performance, efficiency, and generalization capabilities, over existing state-of-the-art methods in various complex scenarios.
[AI-57] Fast and Interpretable Mixed-Integer Linear Program Solving by Learning Model Reduction
链接: https://arxiv.org/abs/2501.00307
作者: Yixuan Li,Can Chen,Jiajun Li,Jiahui Duan,Xiongwei Han,Tao Zhong,Vincent Chau,Weiwei Wu,Wanyuan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:By exploiting the correlation between the structure and the solution of Mixed-Integer Linear Programming (MILP), Machine Learning (ML) has become a promising method for solving large-scale MILP problems. Existing ML-based MILP solvers mainly focus on end-to-end solution learning, which suffers from the scalability issue due to the high dimensionality of the solution space. Instead of directly learning the optimal solution, this paper aims to learn a reduced and equivalent model of the original MILP as an intermediate step. The reduced model often corresponds to interpretable operations and is much simpler, enabling us to solve large-scale MILP problems much faster than existing commercial solvers. However, current approaches rely only on the optimal reduced model, overlooking the significant preference information of all reduced models. To address this issue, this paper proposes a preference-based model reduction learning method, which considers the relative performance (i.e., objective cost and constraint feasibility) of all reduced models on each MILP instance as preferences. We also introduce an attention mechanism to capture and represent preference information, which helps improve the performance of model reduction learning tasks. Moreover, we propose a SetCover based pruning method to control the number of reduced models (i.e., labels), thereby simplifying the learning process. Evaluation on real-world MILP problems shows that 1) compared to the state-of-the-art model reduction ML methods, our method obtains nearly 20% improvement on solution accuracy, and 2) compared to the commercial solver Gurobi, two to four orders of magnitude speedups are achieved.
[AI-58] Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization
链接: https://arxiv.org/abs/2501.00298
作者: Huanting Wang,Patrick Lenihan,Zheng Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
Abstract:Supervised machine learning techniques have shown promising results in code analysis and optimization problems. However, a learning-based solution can be brittle because minor changes in hardware or application workloads – such as facing a new CPU architecture or code pattern – may jeopardize decision accuracy, ultimately undermining model robustness. We introduce Prom, an open-source library to enhance the robustness and performance of predictive models against such changes during deployment. Prom achieves this by using statistical assessments to identify test samples prone to mispredictions and using feedback on these samples to improve a deployed model. We showcase Prom by applying it to 13 representative machine learning models across 5 code analysis and optimization tasks. Our extensive evaluation demonstrates that Prom can successfully identify an average of 96% (up to 100%) of mispredictions. By relabeling up to 5% of the Prom-identified samples through incremental learning, Prom can help a deployed model achieve a performance comparable to that attained during its model training phase.
[AI-59] Enhancing Wireless Sensor Network Security through Integration with the ServiceNow Cloud Platform
链接: https://arxiv.org/abs/2501.00264
作者: Syed Atif Ali,Salwa Din
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 17 pages, 2 figures
Abstract:Wireless Sensor Networks (WSNs) continue to experience rapid developments and integration into modern-day applications. Overall, WSNs collect and process relevant data through sensors or nodes and communicate with different networks for superior information management. Nevertheless, a primary concern relative to WSNs is security. Considering the high constraints on throughput, battery, processing power, and memory, typical security procedures present limitations for application in WSNs. This research focuses on the integration of WSNs with the cloud platform, specifically to address these security risks. The cloud platform also adopts a security-driven approach and has attracted many applications across various sectors globally. This research specifically explores how cloud computing could be exploited to impede Denial of Service attacks from endangering WSNs. WSNs are now deployed in various low-powered applications, including disaster management, homeland security, battlefield surveillance, agriculture, and the healthcare industry. WSNs are distinguished from traditional networks by the numerous wireless connected sensors being deployed to conduct an assigned task. In testing scenarios, the size of WSNs ranges from a few to several thousand. The overarching requirements of WSNs include rapid processing of collected data, low-cost installation and maintenance, and low latency in network operations. Given that a substantial amount of WSN applications are used in high-risk and volatile environments, they must effectively address security concerns. This includes the secure movement, storage, and communication of data through networks, an environment in which WSNs are notably vulnerable. The limitations of WSNs have meant that they are predominantly used in unsecured applications despite positive advancements. This study explores methods for integrating the WSN with the cloud.
[AI-60] Collaborative Approaches to Enhancing Smart Vehicle Cybersecurity by AI-Driven Threat Detection
链接: https://arxiv.org/abs/2501.00261
作者: Syed Atif Ali,Salwa Din
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 Pages
Abstract:The introduction sets the stage for exploring collaborative approaches to bolstering smart vehicle cybersecurity through AI-driven threat detection. As the automotive industry increasingly adopts connected and automated vehicles (CAVs), the need for robust cybersecurity measures becomes paramount. With the emergence of new vulnerabilities and security requirements, the integration of advanced technologies such as 5G networks, blockchain, and quantum computing presents promising avenues for enhancing CAV cybersecurity . Additionally, the roadmap for cybersecurity in autonomous vehicles emphasizes the importance of efficient intrusion detection systems and AI-based techniques, along with the integration of secure hardware, software stacks, and advanced threat intelligence to address cybersecurity challenges in future autonomous vehicles.
[AI-61] Federated Deep Subspace Clustering
链接: https://arxiv.org/abs/2501.00230
作者: Yupei Zhang,Ruojia Feng,Yifei Wang,Xuequn Shang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 8pages,4 figures, 4 Tables
Abstract:This paper introduces FDSC, a private-protected subspace clustering (SC) approach with federated learning (FC) schema. In each client, there is a deep subspace clustering network accounting for grouping the isolated data, composed of a encode network, a self-expressive layer, and a decode network. FDSC is achieved by uploading the encode network to communicate with other clients in the server. Besides, FDSC is also enhanced by preserving the local neighborhood relationship in each client. With the effects of federated learning and locality preservation, the learned data features from the encoder are boosted so as to enhance the self-expressiveness learning and result in better clustering performance. Experiments test FDSC on public datasets and compare with other clustering methods, demonstrating the effectiveness of FDSC.
[AI-62] CancerKG.ORG A Web-scale Interactive Verifiable Knowledge Graph-LLM Hybrid for Assisting with Optimal Cancer Treatment and Care
链接: https://arxiv.org/abs/2501.00223
作者: Michael Gubanov,Anna Pyayt,Aleksandra Karolak
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Here, we describe one of the first Web-scale hybrid Knowledge Graph (KG)-Large Language Model (LLM), populated with the latest peer-reviewed medical knowledge on colorectal Cancer. It is currently being evaluated to assist with both medical research and clinical information retrieval tasks at Moffitt Cancer Center, which is one of the top Cancer centers in the U.S. and in the world. Our hybrid is remarkable as it serves the user needs better than just an LLM, KG or a search-engine in isolation. LLMs as is are known to exhibit hallucinations and catastrophic forgetting as well as are trained on outdated corpora. The state of the art KGs, such as PrimeKG, cBioPortal, ChEMBL, NCBI, and other require manual curation, hence are quickly getting stale. CancerKG is unsupervised and is capable of automatically ingesting and organizing the latest medical findings. To alleviate the LLMs shortcomings, the verified KG serves as a Retrieval Augmented Generation (RAG) guardrail. CancerKG exhibits 5 different advanced user interfaces, each tailored to serve different data modalities better and more convenient for the user.
[AI-63] he Potential of LLM s in Automating Software Testing: From Generation to Reporting
链接: https://arxiv.org/abs/2501.00217
作者: Betim Sherifi,Khaled Slhoub,Fitzroy Nembhard
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, 1 table
Abstract:Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system’s high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.
[AI-64] Debunking the CUDA Myth Towards GPU-based AI Systems
链接: https://arxiv.org/abs/2501.00210
作者: Yunjae Lee,Juntaek Lim,Jehyeon Bang,Eunyeong Cho,Huijong Jeong,Taesu Kim,Hyungjun Kim,Joonhyung Lee,Jinseop Im,Ranggi Hwang,Se Jung Kwon,Dongsoo Lee,Minsoo Rhu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Under Review
Abstract:With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU’s programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU’s dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA’s robust software ecosystem.
[AI-65] owards Unraveling and Improving Generalization in World Models NEURIPS
链接: https://arxiv.org/abs/2501.00195
作者: Qiaoyi Fang,Weiyu Du,Hang Wang,Junshan Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: An earlier version of this paper was submitted to NeurIPS and received ratings of (7, 6, 6). The reviewers’ comments and the original draft are available at OpenReview. This version contains minor modifications based on that submission
Abstract:World models have recently emerged as a promising approach to reinforcement learning (RL), achieving state-of-the-art performance across a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero-drift representation errors and with non-zero-drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non-zero drift, thereby enhancing training stability and robustness. Our experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long-horizon prediction.
[AI-66] SepsisCalc: Integrating Clinical Calculators into Early Sepsis Prediction via Dynamic Temporal Graph Construction
链接: https://arxiv.org/abs/2501.00190
作者: Changchang Yin,Shihan Fu,Bingsheng Yao,Thai-Hoang Pham,Weidan Cao,Dakuo Wang,Jeffrey Caterino,Ping Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Sepsis is an organ dysfunction caused by a deregulated immune response to an infection. Early sepsis prediction and identification allow for timely intervention, leading to improved clinical outcomes. Clinical calculators (e.g., the six-organ dysfunction assessment of SOFA) play a vital role in sepsis identification within clinicians’ workflow, providing evidence-based risk assessments essential for sepsis diagnosis. However, artificial intelligence (AI) sepsis prediction models typically generate a single sepsis risk score without incorporating clinical calculators for assessing organ dysfunctions, making the models less convincing and transparent to clinicians. To bridge the gap, we propose to mimic clinicians’ workflow with a novel framework SepsisCalc to integrate clinical calculators into the predictive model, yielding a clinically transparent and precise model for utilization in clinical settings. Practically, clinical calculators usually combine information from multiple component variables in Electronic Health Records (EHR), and might not be applicable when the variables are (partially) missing. We mitigate this issue by representing EHRs as temporal graphs and integrating a learning module to dynamically add the accurately estimated calculator to the graphs. Experimental results on real-world datasets show that the proposed model outperforms state-of-the-art methods on sepsis prediction tasks. Moreover, we developed a system to identify organ dysfunctions and potential sepsis risks, providing a human-AI interaction tool for deployment, which can help clinicians understand the prediction outputs and prepare timely interventions for the corresponding dysfunctions, paving the way for actionable clinical decision-making support for early intervention.
[AI-67] Federated Learning with Workload Reduction through Partial Training of Client Models and Entropy-Based Data Selection
链接: https://arxiv.org/abs/2501.00170
作者: Hongrui Shi,Valentin Radu,Po Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:With the rapid expansion of edge devices, such as IoT devices, where crucial data needed for machine learning applications is generated, it becomes essential to promote their participation in privacy-preserving Federated Learning (FL) systems. The best way to achieve this desiderate is by reducing their training workload to match their constrained computational resources. While prior FL research has address the workload constrains by introducing lightweight models on the edge, limited attention has been given to optimizing on-device training efficiency through reducing the amount of data need during training. In this work, we propose FedFT-EDS, a novel approach that combines Fine-Tuning of partial client models with Entropy-based Data Selection to reduce training workloads on edge devices. By actively selecting the most informative local instances for learning, FedFT-EDS reduces training data significantly in FL and demonstrates that not all user data is equally beneficial for FL on all rounds. Our experiments on CIFAR-10 and CIFAR-100 show that FedFT-EDS uses only 50% user data while improving the global model performance compared to baseline methods, FedAvg and FedProx. Importantly, FedFT-EDS improves client learning efficiency by up to 3 times, using one third of training time on clients to achieve an equivalent performance to the baselines. This work highlights the importance of data selection in FL and presents a promising pathway to scalable and efficient Federate Learning.
[AI-68] Class-based Subset Selection for Transfer Learning under Extreme Label Shift
链接: https://arxiv.org/abs/2501.00162
作者: Akul Goyal,Carl Edwards
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages
Abstract:Existing work within transfer learning often follows a two-step process – pre-training over a large-scale source domain and then finetuning over limited samples from the target domain. Yet, despite its popularity, this methodology has been shown to suffer in the presence of distributional shift – specifically when the output spaces diverge. Previous work has focused on increasing model performance within this setting by identifying and classifying only the shared output classes between distributions. However, these methods are inherently limited as they ignore classes outside the shared class set, disregarding potential information relevant to the model transfer. This paper proposes a new process for few-shot transfer learning that selects and weighs classes from the source domain to optimize the transfer between domains. More concretely, we use Wasserstein distance to choose a set of source classes and their weights that minimize the distance between the source and target domain. To justify our proposed algorithm, we provide a generalization analysis of the performance of the learned classifier over the target domain and show that our method corresponds to a bound minimization algorithm. We empirically demonstrate the effectiveness of our approach (WaSS) by experimenting on several different datasets and presenting superior performance within various label shift settings, including the extreme case where the label spaces are disjoint.
[AI-69] Probabilistic Explanations for Linear Models AAAI
链接: https://arxiv.org/abs/2501.00154
作者: Bernardo Subercaseaux,Marcelo Arenas,Kuldeep S Meel
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: Extended version of AAAI paper
Abstract:Formal XAI is an emerging field that focuses on providing explanations with mathematical guarantees for the decisions made by machine learning models. A significant amount of work in this area is centered on the computation of “sufficient reasons”. Given a model M and an input instance \vecx , a sufficient reason for the decision M(\vecx) is a subset S of the features of \vecx such that for any instance \vecz that has the same values as \vecx for every feature in S , it holds that M(\vecx) = M(\vecz) . Intuitively, this means that the features in S are sufficient to fully justify the classification of \vecx by M . For sufficient reasons to be useful in practice, they should be as small as possible, and a natural way to reduce the size of sufficient reasons is to consider a probabilistic relaxation; the probability of M(\vecx) = M(\vecz) must be at least some value \delta \in (0,1] , for a random instance \vecz that coincides with \vecx on the features in S . Computing small \delta -sufficient reasons ( \delta -SRs) is known to be a theoretically hard problem; even over decision trees–traditionally deemed simple and interpretable models–strong inapproximability results make the efficient computation of small \delta -SRs unlikely. We propose the notion of (\delta, \epsilon) -SR, a simple relaxation of \delta -SRs, and show that this kind of explanation can be computed efficiently over linear models.
[AI-70] NiaAutoARM: Automated generation and evaluation of Association Rule Mining pipelines
链接: https://arxiv.org/abs/2501.00138
作者: Uroš Mlakar,Iztok Fister Jr.,Iztok Fister
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:
Abstract:The Numerical Association Rule Mining paradigm that includes concurrent dealing with numerical and categorical attributes is beneficial for discovering associations from datasets consisting of both features. The process is not considered as easy since it incorporates several processing steps running sequentially that form an entire pipeline, e.g., preprocessing, algorithm selection, hyper-parameter optimization, and the definition of metrics evaluating the quality of the association rule. In this paper, we proposed a novel Automated Machine Learning method, NiaAutoARM, for constructing the full association rule mining pipelines based on stochastic population-based meta-heuristics automatically. Along with the theoretical representation of the proposed method, we also present a comprehensive experimental evaluation of the proposed method.
[AI-71] AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility
链接: https://arxiv.org/abs/2501.00113
作者: Yixian Shen,Hang Zhang,Yanxin Shen,Lun Wang,Chuanqi Shi,Shaoshuai Du,Yiyi Tao
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00113 [cs.AI] (or arXiv:2501.00113v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.00113 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-72] An Unsupervised Anomaly Detection in Electricity Consumption Using Reinforcement Learning and Time Series Forest Based Framework
链接: https://arxiv.org/abs/2501.00107
作者: Jihan Ghanim,Mariette Awad
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Anomaly detection (AD) plays a crucial role in time series applications, primarily because time series data is employed across real-world scenarios. Detecting anomalies poses significant challenges since anomalies take diverse forms making them hard to pinpoint accurately. Previous research has explored different AD models, making specific assumptions with varying sensitivity toward particular anomaly types. To address this issue, we propose a novel model selection for unsupervised AD using a combination of time series forest (TSF) and reinforcement learning (RL) approaches that dynamically chooses an AD technique. Our approach allows for effective AD without explicitly depending on ground truth labels that are often scarce and expensive to obtain. Results from the real-time series dataset demonstrate that the proposed model selection approach outperforms all other AD models in terms of the F1 score metric. For the synthetic dataset, our proposed model surpasses all other AD models except for KNN, with an impressive F1 score of 0.989. The proposed model selection framework also exceeded the performance of GPT-4 when prompted to act as an anomaly detector on the synthetic dataset. Exploring different reward functions revealed that the original reward function in our proposed AD model selection approach yielded the best overall scores. We evaluated the performance of the six AD models on an additional three datasets, having global, local, and clustered anomalies respectively, showing that each AD model exhibited distinct performance depending on the type of anomalies. This emphasizes the significance of our proposed AD model selection framework, maintaining high performance across all datasets, and showcasing superior performance across different anomaly types.
[AI-73] LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
链接: https://arxiv.org/abs/2501.00106
作者: Jingwen Tan,Gopi Krishnan Rajbahadur,Zi Li,Xiangfu Song,Jianshan Lin,Dan Li,Zibin Zheng,Ahmed E. Hassan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
Abstract:Dataset license compliance is a critical yet complex aspect of developing commercial AI products, particularly with the increasing use of publicly available datasets. Ambiguities in dataset licenses pose significant legal risks, making it challenging even for software IP lawyers to accurately interpret rights and obligations. In this paper, we introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We first evaluate existing legal FMs (i.e., FMs specialized in understanding and processing legal texts) and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. LicenseGPT, fine-tuned on a curated dataset of 500 licenses annotated by legal experts, significantly improves PA to 64.30%, outperforming both legal and general-purpose FMs. Through an A/B test and user study with software IP lawyers, we demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy. Software IP lawyers perceive LicenseGPT as a valuable supplementary tool that enhances efficiency while acknowledging the need for human oversight in complex cases. Our work underscores the potential of specialized AI tools in legal practice and offers a publicly available resource for practitioners and researchers.
[AI-74] Machine Learning-Based Security Policy Analysis
链接: https://arxiv.org/abs/2501.00085
作者: Krish Jain,Joann Sum,Pranav Kapoor,Amir Eaman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
Abstract:Security-Enhanced Linux (SELinux) is a robust security mechanism that enforces mandatory access controls (MAC), but its policy language’s complexity creates challenges for policy analysis and management. This research investigates the automation of SELinux policy analysis using graph-based techniques combined with machine learning approaches to detect policy anomalies. The study addresses two key questions: Can SELinux policy analysis be automated through graph analysis, and how do different anomaly detection models compare in analyzing SELinux policies? We will be comparing different machine learning models by evaluating their effectiveness in detecting policy violations and anomalies. Our approach utilizes Neo4j for graph representation of policies, with Node2vec transforming these graph structures into meaningful vector embeddings that can be processed by our machine learning models. In our results, the MLP Neural Network consistently demonstrated superior performance across different dataset sizes, achieving 95% accuracy with balanced precision and recall metrics, while both Random Forest and SVM models showed competitive but slightly lower performance in detecting policy violations. This combination of graph-based modeling and machine learning provides a more sophisticated and automated approach to understanding and analyzing complex SELinux policies compared to traditional manual analysis methods.
[AI-75] AI Agent for Education: von Neumann Multi-Agent System Framework
链接: https://arxiv.org/abs/2501.00083
作者: Yuan-Hao Jiang,Ruijia Li,Yizhou Zhou,Changyong Qi,Hanglei Hu,Yuang Wei,Bo Jiang,Yonghe Wu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Conference Proceedings of the 28th Global Chinese Conference on Computers in Education, GCCCE 2024
Abstract:The development of large language models has ushered in new paradigms for education. This paper centers on the multi-Agent system in education and proposes the von Neumann multi-Agent system framework. It breaks down each AI Agent into four modules: control unit, logic unit, storage unit, and input-output devices, defining four types of operations: task deconstruction, self-reflection, memory processing, and tool invocation. Furthermore, it introduces related technologies such as Chain-of-Thought, Reson+Act, and Multi-Agent Debate associated with these four types of operations. The paper also discusses the ability enhancement cycle of a multi-Agent system for education, including the outer circulation for human learners to promote knowledge construction and the inner circulation for LLM-based-Agents to enhance swarm intelligence. Through collaboration and reflection, the multi-Agent system can better facilitate human learners’ learning and enhance their teaching abilities in this process.
[AI-76] Human-like Bots for Tactical Shooters Using Compute-Efficient Sensors
链接: https://arxiv.org/abs/2501.00078
作者: Niels Justesen(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Maria Kaselimi(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Sam Snodgrass(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Miruna Vozaru(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Matthew Schlegel(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Jonas Wingren(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Gabriella A. B. Barros(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Tobias Mahlmann(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Shyam Sudhakaran(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Wesley Kerr(Riot Games),Albert Wang(Riot Games),Christoffer Holmgård(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Georgios N. Yannakakis(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Sebastian Risi(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a),Julian Togelius(a href=“http://modl.ai” rel=“external noopener nofollow” class="link-external link-http"this http URL/a)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Artificial intelligence (AI) has enabled agents to master complex video games, from first-person shooters like Counter-Strike to real-time strategy games such as StarCraft II and racing games like Gran Turismo. While these achievements are notable, applying these AI methods in commercial video game production remains challenging due to computational constraints. In commercial scenarios, the majority of computational resources are allocated to 3D rendering, leaving limited capacity for AI methods, which often demand high computational power, particularly those relying on pixel-based sensors. Moreover, the gaming industry prioritizes creating human-like behavior in AI agents to enhance player experience, unlike academic models that focus on maximizing game performance. This paper introduces a novel methodology for training neural networks via imitation learning to play a complex, commercial-standard, VALORANT-like 2v2 tactical shooter game, requiring only modest CPU hardware during inference. Our approach leverages an innovative, pixel-free perception architecture using a small set of ray-cast sensors, which capture essential spatial information efficiently. These sensors allow AI to perform competently without the computational overhead of traditional methods. Models are trained to mimic human behavior using supervised learning on human trajectory data, resulting in realistic and engaging AI agents. Human evaluation tests confirm that our AI agents provide human-like gameplay experiences while operating efficiently under computational constraints. This offers a significant advancement in AI model development for tactical shooter games and possibly other genres.
[AI-77] A Novel Framework for Learning Stochastic Representations for Sequence Generation and Recognition
链接: https://arxiv.org/abs/2501.00076
作者: Jungsik Hwang,Ahmadreza Ahmadi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 14 pages, 6 figures
Abstract:The ability to generate and recognize sequential data is fundamental for autonomous systems operating in dynamic environments. Inspired by the key principles of the brain-predictive coding and the Bayesian brain-we propose a novel stochastic Recurrent Neural Network with Parametric Biases (RNNPB). The proposed model incorporates stochasticity into the latent space using the reparameterization trick used in variational autoencoders. This approach enables the model to learn probabilistic representations of multidimensional sequences, capturing uncertainty and enhancing robustness against overfitting. We tested the proposed model on a robotic motion dataset to assess its performance in generating and recognizing temporal patterns. The experimental results showed that the stochastic RNNPB model outperformed its deterministic counterpart in generating and recognizing motion sequences. The results highlighted the proposed model’s capability to quantify and adjust uncertainty during both learning and inference. The stochasticity resulted in a continuous latent space representation, facilitating stable motion generation and enhanced generalization when recognizing novel sequences. Our approach provides a biologically inspired framework for modeling temporal patterns and advances the development of robust and adaptable systems in artificial intelligence and robotics.
[AI-78] Open-Book Neural Algorithmic Reasoning NEURIPS2024
链接: https://arxiv.org/abs/2501.00072
作者: Hefei Li,Chao Peng,Chenyang Xu,Zhengfeng Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Appeared at NeurIPS 2024
Abstract:Neural algorithmic reasoning is an emerging area of machine learning that focuses on building neural networks capable of solving complex algorithmic tasks. Recent advancements predominantly follow the standard supervised learning paradigm – feeding an individual problem instance into the network each time and training it to approximate the execution steps of a classical algorithm. We challenge this mode and propose a novel open-book learning framework. In this framework, whether during training or testing, the network can access and utilize all instances in the training dataset when reasoning for a given instance. Empirical evaluation is conducted on the challenging CLRS Algorithmic Reasoning Benchmark, which consists of 30 diverse algorithmic tasks. Our open-book learning framework exhibits a significant enhancement in neural reasoning capabilities. Further, we notice that there is recent literature suggesting that multi-task training on CLRS can improve the reasoning accuracy of certain tasks, implying intrinsic connections between different algorithmic tasks. We delve into this direction via the open-book framework. When the network reasons for a specific task, we enable it to aggregate information from training instances of other tasks in an attention-based manner. We show that this open-book attention mechanism offers insights into the inherent relationships among various tasks in the benchmark and provides a robust tool for interpretable multi-task training. Comments: Appeared at NeurIPS 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00072 [cs.LG] (or arXiv:2501.00072v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.00072 Focus to learn more arXiv-issued DOI via DataCite
[AI-79] Ensemble of classifiers for speech evaluation
链接: https://arxiv.org/abs/2501.00067
作者: G. Belokrylov,A. Korenev,B. Lodonova,A. Novokhrestov
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
Abstract:The article describes an attempt to apply an ensemble of binary classifiers to solve the problem of speech assessment in medicine. A dataset was compiled based on quantitative and expert assessments of syllable pronunciation quality. Quantitative assessments of 7 selected metrics were used as features: dynamic time warp distance, Minkowski distance, correlation coefficient, longest common subsequence (LCSS), edit distance of real se-quence (EDR), edit distance with real penalty (ERP), and merge split (MSM). Expert as-sessment of pronunciation quality was used as a class label: class 1 means high-quality speech, class 0 means distorted. A comparison of training results was carried out for five classification methods: logistic regression (LR), support vector machine (SVM), naive Bayes (NB), decision trees (DT), and K-nearest neighbors (KNN). The results of using the mixture method to build an ensemble of classifiers are also presented. The use of an en-semble for the studied data sets allowed us to slightly increase the classification accuracy compared to the use of individual binary classifiers.
[AI-80] Predicting Preschoolers Externalizing Problems with Mother-Child Interaction Dynamics and Deep Learning
链接: https://arxiv.org/abs/2501.00065
作者: Xi Chen,Yu Ji,Cong Xia,Wen Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 34 pages, 3 figures, 2 tables
Abstract:Objective: Predicting children’s future levels of externalizing problems helps to identify children at risk and guide targeted prevention. Existing studies have shown that mothers providing support in response to children’s dysregulation was associated with children’s lower levels of externalizing problems. The current study aims to evaluate and improve the accuracy of predicting children’s externalizing problems with mother-child interaction dynamics. Method: This study used mother-child interaction dynamics during a challenging puzzle task to predict children’s externalizing problems six months later (N=101, 46 boys, Mage=57.41 months, SD=6.58). Performance of the Residual Dynamic Structural Equation Model (RDSEM) was compared with the Attention-based Sequential Behavior Interaction Modeling (ASBIM) model, developed using the deep learning techniques. Results: The RDSEM revealed that children whose mothers provided more autonomy support after increases of child defeat had lower levels of externalizing problems. Five-fold cross-validation showed that the RDSEM had good prediction accuracy. The ASBIM model further improved prediction accuracy, especially after including child inhibitory control as a personalized individual feature. Conclusions: The dynamic process of mother-child interaction provides important information for predicting children’s externalizing problems, especially maternal autonomy supportive response to child defeat. The deep learning model is a useful tool to further improve prediction accuracy.
[AI-81] "Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market
链接: https://arxiv.org/abs/2501.00063
作者: Guangming Che
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:The financial industry is increasingly seeking robust methods to address the challenges posed by data scarcity and low signal-to-noise ratios, which limit the application of deep learning techniques in stock market analysis. This paper presents two innovative generative model-based approaches to synthesize stock data, specifically tailored for different scenarios within the A-share market in China. The first method, a sector-based synthesis approach, enhances the signal-to-noise ratio of stock data by classifying the characteristics of stocks from various sectors in China’s A-share market. This method employs an Approximate Non-Local Total Variation algorithm to smooth the generated data, a bandpass filtering method based on Fourier Transform to eliminate noise, and Denoising Diffusion Implicit Models to accelerate sampling speed. The second method, a recursive stock data synthesis approach based on pattern recognition, is designed to synthesize data for stocks with short listing periods and limited comparable companies. It leverages pattern recognition techniques and Markov models to learn and generate variable-length stock sequences, while introducing a sub-time-level data augmentation method to alleviate data scarcity this http URL validate the effectiveness of these methods through extensive experiments on various datasets, including those from the main board, STAR Market, Growth Enterprise Market Board, Beijing Stock Exchange, NASDAQ, NYSE, and AMEX. The results demonstrate that our synthesized data not only improve the performance of predictive models but also enhance the signal-to-noise ratio of individual stock signals in price trading strategies. Furthermore, the introduction of sub-time-level data significantly improves the quality of synthesized data.
[AI-82] raining-free Heterogeneous Model Merging
链接: https://arxiv.org/abs/2501.00061
作者: Zhengqi Xu,Han Zheng,Jie Song,Li Sun,Mingli Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Model merging has attracted significant attention as a powerful paradigm for model reuse, facilitating the integration of task-specific models into a singular, versatile framework endowed with multifarious capabilities. Previous studies, predominantly utilizing methods such as Weight Average (WA), have shown that model merging can effectively leverage pretrained models without the need for laborious retraining. However, the inherent heterogeneity among models poses a substantial constraint on its applicability, particularly when confronted with discrepancies in model architectures. To overcome this challenge, we propose an innovative model merging framework designed for heterogeneous models, encompassing both depth and width heterogeneity. To address depth heterogeneity, we introduce a layer alignment strategy that harmonizes model layers by segmenting deeper models, treating consecutive layers with similar representations as a cohesive segment, thus enabling the seamless merging of models with differing layer depths. For width heterogeneity, we propose a novel elastic neuron zipping algorithm that projects the weights from models of varying widths onto a common dimensional space, eliminating the need for identical widths. Extensive experiments validate the efficacy of these proposed methods, demonstrating that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging, across both vision and NLP tasks. Our code is publicly available at this https URL.
[AI-83] ransforming CCTV cameras into NO_2 sensors at city scale for adaptive policymaking
链接: https://arxiv.org/abs/2501.00056
作者: Mohamed R. Ibrahim,Terry Lyons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 43 pages
Abstract:Air pollution in cities, especially NO\textsubscript2, is linked to numerous health problems, ranging from mortality to mental health challenges and attention deficits in children. While cities globally have initiated policies to curtail emissions, real-time monitoring remains challenging due to limited environmental sensors and their inconsistent distribution. This gap hinders the creation of adaptive urban policies that respond to the sequence of events and daily activities affecting pollution in cities. Here, we demonstrate how city CCTV cameras can act as a pseudo-NO\textsubscript2 sensors. Using a predictive graph deep model, we utilised traffic flow from London’s cameras in addition to environmental and spatial factors, generating NO\textsubscript2 predictions from over 133 million frames. Our analysis of London’s mobility patterns unveiled critical spatiotemporal connections, showing how specific traffic patterns affect NO\textsubscript2 levels, sometimes with temporal lags of up to 6 hours. For instance, if trucks only drive at night, their effects on NO\textsubscript2 levels are most likely to be seen in the morning when people commute. These findings cast doubt on the efficacy of some of the urban policies currently being implemented to reduce pollution. By leveraging existing camera infrastructure and our introduced methods, city planners and policymakers could cost-effectively monitor and mitigate the impact of NO\textsubscript2 and other pollutants.
[AI-84] DDD-GenDT: Dynamic Data-driven Generative Digital Twin Framework
链接: https://arxiv.org/abs/2501.00051
作者: Yu-Zheng Lin,Qinxuan Shi,Zhanglong Yang,Banafsheh Saber Latibari,Sicong Shao,Soheil Salehi,Pratik Satam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:
Abstract:Digital twin (DT) technology has emerged as a transformative approach to simulate, predict, and optimize the behavior of physical systems, with applications that span manufacturing, healthcare, climate science, and more. However, the development of DT models often faces challenges such as high data requirements, integration complexity, and limited adaptability to dynamic changes in physical systems. This paper presents a new method inspired by dynamic data-driven applications systems (DDDAS), called the dynamic data-driven generative of digital twins framework (DDD-GenDT), which combines the physical system with LLM, allowing LLM to act as DT to interact with the physical system operating status and generate the corresponding physical behaviors. We apply DDD-GenDT to the computer numerical control (CNC) machining process, and we use the spindle current measurement data in the NASA milling wear data set as an example to enable LLMs to forecast the physical behavior from historical data and interact with current observations. Experimental results show that in the zero-shot prediction setting, the LLM-based DT can adapt to the change in the system, and the average RMSE of the GPT-4 prediction is 0.479A, which is 4.79% of the maximum spindle motor current measurement of 10A, with little training data and instructions required. Furthermore, we analyze the performance of DDD-GenDT in this specific application and their potential to construct digital twins. We also discuss the limitations and challenges that may arise in practical implementations.
[AI-85] Stroke Prediction using Clinical and Social Features in Machine Learning
链接: https://arxiv.org/abs/2501.00048
作者: Aidan Chadha
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Every year in the United States, 800,000 individuals suffer a stroke - one person every 40 seconds, with a death occurring every four minutes. While individual factors vary, certain predictors are more prevalent in determining stroke risk. As strokes are the second leading cause of death and disability worldwide, predicting stroke likelihood based on lifestyle factors is crucial. Showing individuals their stroke risk could motivate lifestyle changes, and machine learning offers solutions to this prediction challenge. Neural networks excel at predicting outcomes based on training features like lifestyle factors, however, they’re not the only option. Logistic regression models can also effectively compute the likelihood of binary outcomes based on independent variables, making them well-suited for stroke prediction. This analysis will compare both neural networks (dense and convolutional) and logistic regression models for stroke prediction, examining their pros, cons, and differences to develop the most effective predictor that minimizes false negatives.
[AI-86] Resource-Efficient Transformer Architecture: Optimizing Memory and Execution Time for Real-Time Applications
链接: https://arxiv.org/abs/2501.00042
作者: Krisvarish V,Priyadarshini T,K P Abhishek Sri Saai,Vaidehi Vijayakumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure
Abstract:This paper describes a memory-efficient transformer model designed to drive a reduction in memory usage and execution time by substantial orders of magnitude without impairing the model’s performance near that of the original model. Recently, new architectures of transformers were presented, focused on parameter efficiency and computational optimization; however, such models usually require considerable resources in terms of hardware when deployed in real-world applications on edge devices. This approach addresses this concern by halving embedding size and applying targeted techniques such as parameter pruning and quantization to optimize the memory footprint with minimum sacrifices in terms of accuracy. Experimental results include a 52% reduction in memory usage and a 33% decrease in execution time, resulting in better efficiency than state-of-the-art models. This work compared our model with existing compelling architectures, such as MobileBERT and DistilBERT, and proved its feasibility in the domain of resource-friendly deep learning architectures, mainly for applications in real-time and in resource-constrained applications.
[AI-87] arning discriminative features from spectrograms using center loss for speech emotion recognition ICASSP2019
链接: https://arxiv.org/abs/2501.01103
作者: Dongyang Dai,Zhiyong Wu,Runnan Li,Xixin Wu,Jia Jia,Helen Meng
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted at ICASSP 2019
Abstract:Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3% on Mel-spectrogram input, and more than 4% on Short Time Fourier Transform spectrogram input.
[AI-88] Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT INTERSPEECH2019
链接: https://arxiv.org/abs/2501.01102
作者: Dongyang Dai,Zhiyong Wu,Shiyin Kang,Xixin Wu,Jia Jia,Dan Su,Dong Yu,Helen Meng
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted at INTERSPEECH 2019
Abstract:Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character’s pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.
[AI-89] LLM -Powered Multi-Agent System for Automated Crypto Portfolio Management
链接: https://arxiv.org/abs/2501.00826
作者: Yichen Luo,Yebo Feng,Jiahua Xu,Paolo Tasca,Yang Liu
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Cryptocurrency investment is inherently difficult due to its shorter history compared to traditional assets, the need to integrate vast amounts of data from various modalities, and the requirement for complex reasoning. While deep learning approaches have been applied to address these challenges, their black-box nature raises concerns about trust and explainability. Recently, large language models (LLMs) have shown promise in financial applications due to their ability to understand multi-modal data and generate explainable decisions. However, single LLM faces limitations in complex, comprehensive tasks such as asset investment. These limitations are even more pronounced in cryptocurrency investment, where LLMs have less domain-specific knowledge in their training corpora. To overcome these challenges, we propose an explainable, multi-modal, multi-agent framework for cryptocurrency investment. Our framework uses specialized agents that collaborate within and across teams to handle subtasks such as data analysis, literature integration, and investment decision-making for the top 30 cryptocurrencies by market capitalization. The expert training module fine-tunes agents using multi-modal historical data and professional investment literature, while the multi-agent investment module employs real-time data to make informed cryptocurrency investment decisions. Unique intrateam and interteam collaboration mechanisms enhance prediction accuracy by adjusting final predictions based on confidence levels within agent teams and facilitating information sharing between teams. Empirical evaluation using data from November 2023 to September 2024 demonstrates that our framework outperforms single-agent models and market benchmarks in classification, asset pricing, portfolio, and explainability performance. Subjects: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.00826 [q-fin.TR] (or arXiv:2501.00826v1 [q-fin.TR] for this version) https://doi.org/10.48550/arXiv.2501.00826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-90] An AI-powered Bayesian generative modeling approach for causal inference in observational studies
链接: https://arxiv.org/abs/2501.00755
作者: Qiao Liu,Wing Hung Wong
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Causal inference in observational studies with high-dimensional covariates presents significant challenges. We introduce CausalBGM, an AI-powered Bayesian generative modeling approach that captures the causal relationship among covariates, treatment, and outcome variables. The core innovation of CausalBGM lies in its ability to estimate the individual treatment effect (ITE) by learning individual-specific distributions of a low-dimensional latent feature set (e.g., latent confounders) that drives changes in both treatment and outcome. This approach not only effectively mitigates confounding effects but also provides comprehensive uncertainty quantification, offering reliable and interpretable causal effect estimates at the individual level. CausalBGM adopts a Bayesian model and uses a novel iterative algorithm to update the model parameters and the posterior distribution of latent features until convergence. This framework leverages the power of AI to capture complex dependencies among variables while adhering to the Bayesian principles. Extensive experiments demonstrate that CausalBGM consistently outperforms state-of-the-art methods, particularly in scenarios with high-dimensional covariates and large-scale datasets. Its Bayesian foundation ensures statistical rigor, providing robust and well-calibrated posterior intervals. By addressing key limitations of existing methods, CausalBGM emerges as a robust and promising framework for advancing causal inference in modern applications in fields such as genomics, healthcare, and social sciences. CausalBGM is maintained at the website this https URL.
[AI-91] Adventures in Demand Analysis Using AI
链接: https://arxiv.org/abs/2501.00382
作者: Philipp Bach,Victor Chernozhukov,Sven Klaassen,Martin Spindler,Jan Teichert-Kluge,Suhas Vijaykumar
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 42 pages, 9 figures
Abstract:This paper advances empirical demand analysis by integrating multimodal product representations derived from artificial intelligence (AI). Using a detailed dataset of toy cars on \textitthis http URL, we combine text descriptions, images, and tabular covariates to represent each product using transformer-based embedding models. These embeddings capture nuanced attributes, such as quality, branding, and visual characteristics, that traditional methods often struggle to summarize. Moreover, we fine-tune these embeddings for causal inference tasks. We show that the resulting embeddings substantially improve the predictive accuracy of sales ranks and prices and that they lead to more credible causal estimates of price elasticity. Notably, we uncover strong heterogeneity in price elasticity driven by these product-specific features. Our findings illustrate that AI-driven representations can enrich and modernize empirical demand analysis. The insights generated may also prove valuable for applied causal inference more broadly.
[AI-92] Efficient Human-in-the-Loop Active Learning: A Novel Framework for Data Labeling in AI Systems
链接: https://arxiv.org/abs/2501.00277
作者: Yiran Huang,Jian-Feng Yang,Haoda Fu
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Modern AI algorithms require labeled data. In real world, majority of data are unlabeled. Labeling the data are costly. this is particularly true for some areas requiring special skills, such as reading radiology images by physicians. To most efficiently use expert’s time for the data labeling, one promising approach is human-in-the-loop active learning algorithm. In this work, we propose a novel active learning framework with significant potential for application in modern AI systems. Unlike the traditional active learning methods, which only focus on determining which data point should be labeled, our framework also introduces an innovative perspective on incorporating different query scheme. We propose a model to integrate the information from different types of queries. Based on this model, our active learning frame can automatically determine how the next question is queried. We further developed a data driven exploration and exploitation framework into our active learning method. This method can be embedded in numerous active learning algorithms. Through simulations on five real-world datasets, including a highly complex real image task, our proposed active learning framework exhibits higher accuracy and lower loss compared to other methods.
[AI-93] GroverGPT: A Large Language Model with 8 Billion Parameters for Quantum Searching
链接: https://arxiv.org/abs/2501.00135
作者: Haoran Wang,Pingzhi Li,Min Chen,Jinglei Cheng,Junyu Liu,Tianlong Chen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages including appendices
Abstract:Quantum computing is an exciting non-Von Neumann paradigm, offering provable speedups over classical computing for specific problems. However, the practical limits of classical simulatability for quantum circuits remain unclear, especially with current noisy quantum devices. In this work, we explore the potential of leveraging Large Language Models (LLMs) to simulate the output of a quantum Turing machine using Grover’s quantum circuits, known to provide quadratic speedups over classical counterparts. To this end, we developed GroverGPT, a specialized model based on LLaMA’s 8-billion-parameter architecture, trained on over 15 trillion tokens. Unlike brute-force state-vector simulations, which demand substantial computational resources, GroverGPT employs pattern recognition to approximate quantum search algorithms without explicitly representing quantum states. Analyzing 97K quantum search instances, GroverGPT consistently outperformed OpenAI’s GPT-4o (45% accuracy), achieving nearly 100% accuracy on 6- and 10-qubit datasets when trained on 4-qubit or larger datasets. It also demonstrated strong generalization, surpassing 95% accuracy for systems with over 20 qubits when trained on 3- to 6-qubit data. Analysis indicates GroverGPT captures quantum features of Grover’s search rather than classical patterns, supported by novel prompting strategies to enhance performance. Although accuracy declines with increasing system size, these findings offer insights into the practical boundaries of classical simulatability. This work suggests task-specific LLMs can surpass general-purpose models like GPT-4o in quantum algorithm learning and serve as powerful tools for advancing quantum research.
[AI-94] Implementing Trust in Non-Small Cell Lung Cancer Diagnosis with a Conformalized Uncertainty-Aware AI Framework in Whole-Slide Images
链接: https://arxiv.org/abs/2501.00053
作者: Xiaoge Zhang,Tao Wang,Chao Yan,Fedaa Najdawi,Kai Zhou,Yuan Ma,Yiu-ming Cheung,Bradley A. Malin
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Ensuring trustworthiness is fundamental to the development of artificial intelligence (AI) that is considered societally responsible, particularly in cancer diagnostics, where a misdiagnosis can have dire consequences. Current digital pathology AI models lack systematic solutions to address trustworthiness concerns arising from model limitations and data discrepancies between model deployment and development environments. To address this issue, we developed TRUECAM, a framework designed to ensure both data and model trustworthiness in non-small cell lung cancer subtyping with whole-slide images. TRUECAM integrates 1) a spectral-normalized neural Gaussian process for identifying out-of-scope inputs and 2) an ambiguity-guided elimination of tiles to filter out highly ambiguous regions, addressing data trustworthiness, as well as 3) conformal prediction to ensure controlled error rates. We systematically evaluated the framework across multiple large-scale cancer datasets, leveraging both task-specific and foundation models, illustrate that an AI model wrapped with TRUECAM significantly outperforms models that lack such guidance, in terms of classification accuracy, robustness, interpretability, and data efficiency, while also achieving improvements in fairness. These findings highlight TRUECAM as a versatile wrapper framework for digital pathology AI models with diverse architectural designs, promoting their responsible and effective applications in real-world settings.
[AI-95] me Series Feature Redundancy Paradox: An Empirical Study Based on Mortgage Default Prediction
链接: https://arxiv.org/abs/2501.00034
作者: Chengyue Huang,Yahe Yang
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:With the widespread application of machine learning in financial risk management, conventional wisdom suggests that longer training periods and more feature variables contribute to improved model performance. This paper, focusing on mortgage default prediction, empirically discovers a phenomenon that contradicts traditional knowledge: in time series prediction, increased training data timespan and additional non-critical features actually lead to significant deterioration in prediction effectiveness. Using Fannie Mae’s mortgage data, the study compares predictive performance across different time window lengths (2012-2022) and feature combinations, revealing that shorter time windows (such as single-year periods) paired with carefully selected key features yield superior prediction results. The experimental results indicate that extended time spans may introduce noise from historical data and outdated market patterns, while excessive non-critical features interfere with the model’s learning of core default factors. This research not only challenges the traditional “more is better” approach in data modeling but also provides new insights and practical guidance for feature selection and time window optimization in financial risk prediction.
[AI-96] Predicting Crack Nucleation and Propagation in Brittle Materials Using Deep Operator Networks with Diverse Trunk Architectures
链接: https://arxiv.org/abs/2501.00016
作者: Elham Kiyani(1),Manav Manav(2),Nikhil Kadivar(3),Laura De Lorenzis(2),George Em Karniadakis(1) ((1) Division of Applied Mathematics, Brown University, Providence, RI, USA, (2) Department of Mechanical and Process Engineering, ETH Zurich, Zurich, Switzerland, (3) School of Engineering, Providence, RI, USA.)
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
*备注: 25 pages, 21 figures
Abstract:Phase-field modeling reformulates fracture problems as energy minimization problems and enables a comprehensive characterization of the fracture process, including crack nucleation, propagation, merging, and branching, without relying on ad-hoc assumptions. However, the numerical solution of phase-field fracture problems is characterized by a high computational cost. To address this challenge, in this paper, we employ a deep neural operator (DeepONet) consisting of a branch network and a trunk network to solve brittle fracture problems. We explore three distinct approaches that vary in their trunk network configurations. In the first approach, we demonstrate the effectiveness of a two-step DeepONet, which results in a simplification of the learning task. In the second approach, we employ a physics-informed DeepONet, whereby the mathematical expression of the energy is integrated into the trunk network’s loss to enforce physical consistency. The integration of physics also results in a substantially smaller data size needed for training. In the third approach, we replace the neural network in the trunk with a Kolmogorov-Arnold Network and train it without the physics loss. Using these methods, we model crack nucleation in a one-dimensional homogeneous bar under prescribed end displacements, as well as crack propagation and branching in single edge-notched specimens with varying notch lengths subjected to tensile and shear loading. We show that the networks predict the solution fields accurately, and the error in the predicted fields is localized near the crack.
[AI-97] Relation-Aware Equivariant Graph Networks for Epitope-Unknown Antibody Design and Specificity Optimization
链接: https://arxiv.org/abs/2501.00013
作者: Lirong Wu,Haitao Lin,Yufei Huang,Zhangyang Gao,Cheng Tan,Yunfan Liu,Tailin Wu,Stan Z. Li
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Antibodies are Y-shaped proteins that protect the host by binding to specific antigens, and their binding is mainly determined by the Complementary Determining Regions (CDRs) in the antibody. Despite the great progress made in CDR design, existing computational methods still encounter several challenges: 1) poor capability of modeling complex CDRs with long sequences due to insufficient contextual information; 2) conditioned on pre-given antigenic epitopes and their static interaction with the target antibody; 3) neglect of specificity during antibody optimization leads to non-specific antibodies. In this paper, we take into account a variety of node features, edge features, and edge relations to include more contextual and geometric information. We propose a novel Relation-Aware Antibody Design (RAAD) framework, which dynamically models antigen-antibody interactions for co-designing the sequences and structures of antigen-specific CDRs. Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies. Extensive experiments have demonstrated the superior capability of RAAD in terms of antibody modeling, generation, and optimization across different CDR types, sequence lengths, pre-training strategies, and input contexts.
[AI-98] Model-Driven Deep Neural Network for Enhanced AoA Estimation Using 5G gNB AAAI2024
链接: https://arxiv.org/abs/2501.00009
作者: Shengheng Liu,Xingkang Li,Zihuan Mao,Peng Liu,Yongming Huang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Presented at AAAI 2024 (Main Technical Track)
Abstract:High-accuracy positioning has become a fundamental enabler for intelligent connected devices. Nevertheless, the present wireless networks still rely on model-driven approaches to achieve positioning functionality, which are susceptible to performance degradation in practical scenarios, primarily due to hardware impairments. Integrating artificial intelligence into the positioning framework presents a promising solution to revolutionize the accuracy and robustness of location-based services. In this study, we address this challenge by reformulating the problem of angle-of-arrival (AoA) estimation into image reconstruction of spatial spectrum. To this end, we design a model-driven deep neural network (MoD-DNN), which can automatically calibrate the angular-dependent phase error. The proposed MoD-DNN approach employs an iterative optimization scheme between a convolutional neural network and a sparse conjugate gradient algorithm. Simulation and experimental results are presented to demonstrate the effectiveness of the proposed method in enhancing spectrum calibration and AoA estimation.
机器学习
[LG-0] Best Transition Matrix Esitimation or Best Label Noise Robustness Classifier? Two Possible Methods to Enhance the Performance of T-revision
链接: https://arxiv.org/abs/2501.01402
作者: Haixu Liu,Zerui Tao,Naihui Zhang,Sixing Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Label noise refers to incorrect labels in a dataset caused by human errors or collection defects, which is common in real-world applications and can significantly reduce the accuracy of models. This report explores how to estimate noise transition matrices and construct deep learning classifiers that are robust against label noise. In cases where the transition matrix is known, we apply forward correction and importance reweighting methods to correct the impact of label noise using the transition matrix. When the transition matrix is unknown or inaccurate, we use the anchor point assumption and T-Revision series methods to estimate or correct the noise matrix. In this study, we further improved the T-Revision method by developing T-Revision-Alpha and T-Revision-Softmax to enhance stability and robustness. Additionally, we designed and implemented two baseline classifiers, a Multi-Layer Perceptron (MLP) and ResNet-18, based on the cross-entropy loss function. We compared the performance of these methods on predicting clean labels and estimating transition matrices using the FashionMINIST dataset with known noise transition matrices. For the CIFAR-10 dataset, where the noise transition matrix is unknown, we estimated the noise matrix and evaluated the ability of the methods to predict clean labels.
[LG-1] Machine Learning for Modeling Wireless Radio Metrics with Crowdsourced Data and Local Environment Features
链接: https://arxiv.org/abs/2501.01344
作者: Yifeng Qiu,Alexis Bose
类目: Machine Learning (cs.LG)
*备注: 6 pages, 12 figures
Abstract:This paper presents a suite of machine learning models, CRC-ML-Radio Metrics, designed for modeling RSRP, RSRQ, and RSSI wireless radio metrics in 4G environments. These models utilize crowdsourced data with local environmental features to enhance prediction accuracy across both indoor at elevation and outdoor urban settings. They achieve RMSE performance of 9.76 to 11.69 dB for RSRP, 2.90 to 3.23 dB for RSRQ, and 9.50 to 10.36 dB for RSSI, evaluated on over 300,000 data points in the Toronto, Montreal, and Vancouver areas. These results demonstrate the robustness and adaptability of the models, supporting precise network planning and quality of service optimization in complex Canadian urban environments.
[LG-2] Simultaneous Latent State Estimation and Latent Linear Dynamics Discovery from Image Observations
链接: https://arxiv.org/abs/2501.01339
作者: Nikita Kostin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The problem of state estimation has a long history with many successful algorithms that allow analytical derivation or approximation of posterior filtering distribution given the noisy observations. This report tries to conclude previous works to resolve the problem of latent state estimation given image-based observations and also suggests a new solution to this problem.
[LG-3] Optimized Relay Lens Design For High-Resolution Image Transmission In Military Target Detection Systems
链接: https://arxiv.org/abs/2501.01287
作者: Burak Celik,Kivanc Dogan,Ezgi Taskin,Ayhan Akbal,Ahmet Orhan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The design and performance analysis of relay lenses that provide high-performance image transmission for target acquisition and tracking in military optical systems. Relay lenses are critical components for clear and lossless image transmission over long distances. In this study, the optical performance of a relay lens system designed and optimized using ZEMAX software is investigated in detail. The analysis focuses on important optical properties such as modulation transfer function (MTF), spot diagrams, Seidel diagram, field curvature and distortion. The results show that the lens has significant potential in military applications for target detection and tracking with high resolution and low aberration.
[LG-4] Bayesian Active Learning By Distribution Disagreement
链接: https://arxiv.org/abs/2501.01248
作者: Thorben Werner,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:
Abstract:Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows \citeberry2023normalizing, berry2023escaping to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.
[LG-5] High-Order Tensor Regression in Sparse Convolutional Neural Networks
链接: https://arxiv.org/abs/2501.01239
作者: Roberto Dias Algarte
类目: Machine Learning (cs.LG)
*备注: 14 pages, 1 algorithm
Abstract:This article presents a generic approach to convolution that significantly differs from conventional methodologies in the current Machine Learning literature. The approach, in its mathematical aspects, proved to be simple and advantageous, particularly when high-order tensors are involved. In this context, a rational theory of regression in neural networks is developed, as a framework for a generic view of sparse convolutional neural networks, the primary focus of this study. As a direct outcome, the classic Backpropagation Algorithmic is redefined to align with this rational tensor-based approach and presented in its simplest, most generic form.
[LG-6] Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent
链接: https://arxiv.org/abs/2501.01230
作者: Yongxian Wei,Anke Tang,Li Shen,Feng Xiong,Chun Yuan,Xiaochun Cao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental requirement of model merging: ensuring the merged model performs comparably to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ( \textiti.e. , minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through \textitdata-free optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a \textitshared subspace spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
[LG-7] Comparative Analysis of Topic Modeling Techniques on ATSB Text Narratives Using Natural Language Processing
链接: https://arxiv.org/abs/2501.01227
作者: Aziida Nanyonga,Hassan Wasswa,Ugur Turhan,Keith Joiner,Graham Wild
类目: Machine Learning (cs.LG)
*备注: conference paper
Abstract:Improvements in aviation safety analysis call for innovative techniques to extract valuable insights from the abundance of textual data available in accident reports. This paper explores the application of four prominent topic modelling techniques, namely Probabilistic Latent Semantic Analysis (pLSA), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF), to dissect aviation incident narratives using the Australian Transport Safety Bureau (ATSB) dataset. The study examines each technique’s ability to unveil latent thematic structures within the data, providing safety professionals with a systematic approach to gain actionable insights. Through a comparative analysis, this research not only showcases the potential of these methods in aviation safety but also elucidates their distinct advantages and limitations.
[LG-8] Classification of Operational Records in Aviation Using Deep Learning Approaches
链接: https://arxiv.org/abs/2501.01222
作者: Aziida Nanyonga,Graham Wild
类目: Machine Learning (cs.LG)
*备注: conference paper; aviation safety, NLP, DL, operational record classification, Socrata
Abstract:Ensuring safety in the aviation industry is critical, even minor anomalies can lead to severe consequences. This study evaluates the performance of four different models for DP (deep learning), including: Bidirectional Long Short-Term Memory (BLSTM), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Simple Recurrent Neural Networks (sRNN), on a multi-class classification task involving Commercial, Military, and Private categories using the Socrata aviation dataset of 4,864 records. The models were assessed using a classification report, confusion matrix analysis, accuracy metrics, validation loss and accuracy curves. Among the models, BLSTM achieved the highest overall accuracy of 72%, demonstrating superior performance in stability and balanced classification, while LSTM followed closely with 71%, excelling in recall for the Commercial class. CNN and sRNN exhibited lower accuracies of 67% and 69%, with significant misclassifications in the Private class. While the results highlight the strengths of BLSTM and LSTM in handling sequential dependencies and complex classification tasks, all models faced challenges with class imbalance, particularly in predicting the Military and Private categories. Addressing these limitations through data augmentation, advanced feature engineering, and ensemble learning techniques could enhance classification accuracy and robustness. This study underscores the importance of selecting appropriate architectures for domain specific tasks
[LG-9] abTreeFormer: Tree Augmented Tabular Data Generation using Transformers
链接: https://arxiv.org/abs/2501.01216
作者: Jiayu Li,Bingyin Zhao,Zilong Zhao,Kevin Yee,Uzair Javaid,Yingjie Lao,Biplab Sikdar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers have achieved remarkable success in tabular data generation. However, they lack domain-specific inductive biases which are critical to preserving the intrinsic characteristics of tabular data. Meanwhile, they suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that incorporates a tree-based model that retains tabular-specific inductive biases of non-smooth and potentially low-correlated patterns due to its discreteness and non-rotational invariance, and hence enhances the fidelity and utility of synthetic data. In addition, we devise a dual-quantization tokenizer to capture the multimodal continuous distribution and further facilitate the learning of numerical value distribution. Moreover, our proposed tokenizer reduces the vocabulary size and sequence length due to the limited dimension-wise semantic meaning and training set size of tabular data, rendering a significant model size shrink without sacrificing the capability of the transformer model. We evaluate TabTreeFormer on 10 datasets against multiple generative models on various metrics; our experimental results show that TabTreeFormer achieves superior fidelity, utility, privacy, and efficiency. Our best model yields a 40% utility improvement with 1/16 of the baseline model size.
[LG-10] Empirical Analysis of Nature-Inspired Algorithms for Autism Spectrum Disorder Detection Using 3D Video Dataset
链接: https://arxiv.org/abs/2501.01202
作者: Aneesh Panchal,Kainat Khan,Rahul Katarya
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Autism Spectrum Disorder (ASD) is a chronic neurodevelopmental disorder symptoms of which includes repetitive behaviour and lack of social and communication skills. Even though these symptoms can be seen very clearly in social but a large number of individuals with ASD remain undiagnosed. In this paper, we worked on a methodology for the detection of ASD from a 3-dimensional walking video dataset, utilizing supervised machine learning (ML) classification algorithms and nature-inspired optimization algorithms for feature extraction from the dataset. The proposed methodology involves the classification of ASD using a supervised ML classification algorithm and extracting important and relevant features from the dataset using nature-inspired optimization algorithms. We also included the ranking coefficients to find the initial leading particle. This selection of particle significantly reduces the computation time and hence, improves the total efficiency and accuracy for ASD detection. To evaluate the efficiency of the proposed methodology, we deployed various combinationsalgorithms of classification algorithm and nature-inspired algorithms resulting in an outstanding classification accuracy of 100% using the random forest classification algorithm and gravitational search algorithm for feature selection. The application of the proposed methodology with different datasets would enhance the robustness and generalizability of the proposed methodology. Due to high accuracy and less total computation time, the proposed methodology will offer a significant contribution to the medical and academic fields, providing a foundation for future research and advancements in ASD diagnosis.
[LG-11] Machine Learning-Based Prediction of ICU Readmissions in Intracerebral Hemorrhage Patients: Insights from the MIMIC Databases
链接: https://arxiv.org/abs/2501.01183
作者: Shuheng Chen,Junyi Fan,Armin Abdollahi,Negin Ashrafi,Kamiar Alaei,Greg Placencia,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Intracerebral hemorrhage (ICH) is a life-risking condition characterized by bleeding within the brain parenchyma. ICU readmission in ICH patients is a critical outcome, reflecting both clinical severity and resource utilization. Accurate prediction of ICU readmission risk is crucial for guiding clinical decision-making and optimizing healthcare resources. This study utilized the Medical Information Mart for Intensive Care (MIMIC-III and MIMIC-IV) databases, which contain comprehensive clinical and demographic data on ICU patients. Patients with ICH were identified from both databases. Various clinical, laboratory, and demographic features were extracted for analysis based on both overview literature and experts’ opinions. Preprocessing methods like imputing and sampling were applied to improve the performance of our models. Machine learning techniques, such as Artificial Neural Network (ANN), XGBoost, and Random Forest, were employed to develop predictive models for ICU readmission risk. Model performance was evaluated using metrics such as AUROC, accuracy, sensitivity, and specificity. The developed models demonstrated robust predictive accuracy for ICU readmission in ICH patients, with key predictors including demographic information, clinical parameters, and laboratory measurements. Our study provides a predictive framework for ICU readmission risk in ICH patients, which can aid in clinical decision-making and improve resource allocation in intensive care settings.
[LG-12] RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer
链接: https://arxiv.org/abs/2501.01182
作者: Seongho Hong,Yong-Hoon Choi
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.
[LG-13] An Inclusive Theoretical Framework of Robust Supervised Contrastive Loss against Label Noise
链接: https://arxiv.org/abs/2501.01130
作者: Jingyi Cui,Yi-Ge Zhang,Hengyu Liu,Yisen Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning from noisy labels is a critical challenge in machine learning, with vast implications for numerous real-world scenarios. While supervised contrastive learning has recently emerged as a powerful tool for navigating label noise, many existing solutions remain heuristic, often devoid of a systematic theoretical foundation for crafting robust supervised contrastive losses. To address the gap, in this paper, we propose a unified theoretical framework for robust losses under the pairwise contrastive paradigm. In particular, we for the first time derive a general robust condition for arbitrary contrastive losses, which serves as a criterion to verify the theoretical robustness of a supervised contrastive loss against label noise. The theory indicates that the popular InfoNCE loss is in fact non-robust, and accordingly inspires us to develop a robust version of InfoNCE, termed Symmetric InfoNCE (SymNCE). Moreover, we highlight that our theory is an inclusive framework that provides explanations to prior robust techniques such as nearest-neighbor (NN) sample selection and robust contrastive loss. Validation experiments on benchmark datasets demonstrate the superiority of SymNCE against label noise.
[LG-14] Graph2text or Graph2token: A Perspective of Large Language Models for Graph Learning
链接: https://arxiv.org/abs/2501.01124
作者: Shuo Yu,Yingbo Wang,Ruolin Li,Guchun Liu,Yanming Shen,Shaoxiong Ji,Bowen Li,Fengling Han,Xiuzhen Zhang,Feng Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graphs are data structures used to represent irregular networks and are prevalent in numerous real-world applications. Previous methods directly model graph structures and achieve significant success. However, these methods encounter bottlenecks due to the inherent irregularity of graphs. An innovative solution is converting graphs into textual representations, thereby harnessing the powerful capabilities of Large Language Models (LLMs) to process and comprehend graphs. In this paper, we present a comprehensive review of methodologies for applying LLMs to graphs, termed LLM4graph. The core of LLM4graph lies in transforming graphs into texts for LLMs to understand and analyze. Thus, we propose a novel taxonomy of LLM4graph methods in the view of the transformation. Specifically, existing methods can be divided into two paradigms: Graph2text and Graph2token, which transform graphs into texts or tokens as the input of LLMs, respectively. We point out four challenges during the transformation to systematically present existing methods in a problem-oriented perspective. For practical concerns, we provide a guideline for researchers on selecting appropriate models and LLMs for different graphs and hardware constraints. We also identify five future research directions for LLM4graph.
[LG-15] Regularized Proportional Fairness Mechanism for Resource Allocation Without Money
链接: https://arxiv.org/abs/2501.01111
作者: Sihan Zeng,Sujay Bhatt,Alec Koppel,Sumitra Ganesh
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Mechanism design in resource allocation studies dividing limited resources among self-interested agents whose satisfaction with the allocation depends on privately held utilities. We consider the problem in a payment-free setting, with the aim of maximizing social welfare while enforcing incentive compatibility (IC), i.e., agents cannot inflate allocations by misreporting their utilities. The well-known proportional fairness (PF) mechanism achieves the maximum possible social welfare but incurs an undesirably high exploitability (the maximum unilateral inflation in utility from misreport and a measure of deviation from IC). In fact, it is known that no mechanism can achieve the maximum social welfare and exact incentive compatibility (IC) simultaneously without the use of monetary incentives (Cole et al., 2013). Motivated by this fact, we propose learning an approximate mechanism that desirably trades off the competing objectives. Our main contribution is to design an innovative neural network architecture tailored to the resource allocation problem, which we name Regularized Proportional Fairness Network (RPF-Net). RPF-Net regularizes the output of the PF mechanism by a learned function approximator of the most exploitable allocation, with the aim of reducing the incentive for any agent to misreport. We derive generalization bounds that guarantee the mechanism performance when trained under finite and out-of-distribution samples and experimentally demonstrate the merits of the proposed mechanism compared to the state-of-the-art.
[LG-16] Long-range Brain Graph Transformer
链接: https://arxiv.org/abs/2501.01100
作者: Shuo Yu,Shan Jin,Ming Li,Tabinda Sarwar,Feng Xia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding communication and information processing among brain regions of interest (ROIs) is highly dependent on long-range connectivity, which plays a crucial role in facilitating diverse functional neural integration across the entire brain. However, previous studies generally focused on the short-range dependencies within brain networks while neglecting the long-range dependencies, limiting an integrated understanding of brain-wide communication. To address this limitation, we propose Adaptive Long-range aware TransformER (ALTER), a brain graph transformer to capture long-range dependencies between brain ROIs utilizing biased random walk. Specifically, we present a novel long-range aware strategy to explicitly capture long-range dependencies between brain ROIs. By guiding the walker towards the next hop with higher correlation value, our strategy simulates the real-world brain-wide communication. Furthermore, by employing the transformer framework, ALERT adaptively integrates both short- and long-range dependencies between brain ROIs, enabling an integrated understanding of multi-level communication across the entire brain. Extensive experiments on ABIDE and ADNI datasets demonstrate that ALTER consistently outperforms generalized state-of-the-art graph learning methods (including SAN, Graphormer, GraphTrans, and LRGNN) and other graph learning based brain network analysis methods (including FBNETGEN, BrainNetGNN, BrainGNN, and BrainNETTF) in neurological disease diagnosis. Cases of long-range dependencies are also presented to further illustrate the effectiveness of ALTER. The implementation is available at \urlthis https URL.
[LG-17] Noise-Resilient Symbolic Regression with Dynamic Gating Reinforcement Learning AAAI2025
链接: https://arxiv.org/abs/2501.01085
作者: Chenglu Sun,Shuo Shen,Wenzhi Tao,Deyi Xue,Zixia Zhou
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, accepted by AAAI 2025
Abstract:Symbolic regression (SR) has emerged as a pivotal technique for uncovering the intrinsic information within data and enhancing the interpretability of AI models. However, current state-of-the-art (sota) SR methods struggle to perform correct recovery of symbolic expressions from high-noise data. To address this issue, we introduce a novel noise-resilient SR (NRSR) method capable of recovering expressions from high-noise data. Our method leverages a novel reinforcement learning (RL) approach in conjunction with a designed noise-resilient gating module (NGM) to learn symbolic selection policies. The gating module can dynamically filter the meaningless information from high-noise data, thereby demonstrating a high noise-resilient capability for the SR process. And we also design a mixed path entropy (MPE) bonus term in the RL process to increase the exploration capabilities of the policy. Experimental results demonstrate that our method significantly outperforms several popular baselines on benchmarks with high-noise data. Furthermore, our method also can achieve sota performance on benchmarks with clean data, showcasing its robustness and efficacy in SR tasks.
[LG-18] Enhancing Precision of Automated Teller Machines Network Quality Assessment: Machine Learning and Multi Classifier Fusion Approaches
链接: https://arxiv.org/abs/2501.01067
作者: Alireza Safarzadeh,Mohammad Reza Jamali,Behzad Moshiri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring reliable ATM services is essential for modern banking, directly impacting customer satisfaction and the operational efficiency of financial institutions. This study introduces a data fusion approach that utilizes multi-classifier fusion techniques, with a special focus on the Stacking Classifier, to enhance the reliability of ATM networks. To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied, enabling balanced learning for both frequent and rare events. The proposed framework integrates diverse classification models - Random Forest, LightGBM, and CatBoost - within a Stacking Classifier, achieving a dramatic reduction in false alarms from 3.56 percent to just 0.71 percent, along with an outstanding overall accuracy of 99.29 percent. This multi-classifier fusion method synthesizes the strengths of individual models, leading to significant cost savings and improved operational decision-making. By demonstrating the power of machine learning and data fusion in optimizing ATM status detection, this research provides practical and scalable solutions for financial institutions aiming to enhance their ATM network performance and customer satisfaction.
[LG-19] HPC Application Parameter Autotuning on Edge Devices: A Bandit Learning Approach
链接: https://arxiv.org/abs/2501.01057
作者: Abrar Hossain,Abdel-Hameed A. Badawy,Mohammad A. Islam,Tapasya Patki,Kishwar Ahmed
类目: Performance (cs.PF); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The growing necessity for enhanced processing capabilities in edge devices with limited resources has led us to develop effective methods for improving high-performance computing (HPC) applications. In this paper, we introduce LASP (Lightweight Autotuning of Scientific Application Parameters), a novel strategy designed to address the parameter search space challenge in edge devices. Our strategy employs a multi-armed bandit (MAB) technique focused on online exploration and exploitation. Notably, LASP takes a dynamic approach, adapting seamlessly to changing environments. We tested LASP with four HPC applications: Lulesh, Kripke, Clomp, and Hypre. Its lightweight nature makes it particularly well-suited for resource-constrained edge devices. By employing the MAB framework to efficiently navigate the search space, we achieved significant performance improvements while adhering to the stringent computational limits of edge devices. Our experimental results demonstrate the effectiveness of LASP in optimizing parameter search on edge devices.
[LG-20] State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection Analyzing Opportunities Threading through Pros Cons and Future Prospects
链接: https://arxiv.org/abs/2501.01029
作者: Harshika Goyal,Mohammad Saif Wajid,Mohd Anas Wajid,Akib Mohi Ud Din Khanday,Mehdi Neshat,Amir Gandomi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid advancement of deepfake technologies, specifically designed to create incredibly lifelike facial imagery and video content, has ignited a remarkable level of interest and curiosity across many fields, including forensic analysis, cybersecurity and the innovative creation of digital characters. By harnessing the latest breakthroughs in deep learning methods, such as Generative Adversarial Networks, Variational Autoencoders, Few-Shot Learning Strategies, and Transformers, the outcomes achieved in generating deepfakes have been nothing short of astounding and transformative. Also, the ongoing evolution of detection technologies is being developed to counteract the potential for misuse associated with deepfakes, effectively addressing critical concerns that range from political manipulation to the dissemination of fake news and the ever-growing issue of cyberbullying. This comprehensive review paper meticulously investigates the most recent developments in deepfake generation and detection, including around 400 publications, providing an in-depth analysis of the cutting-edge innovations shaping this rapidly evolving landscape. Starting with a thorough examination of systematic literature review methodologies, we embark on a journey that delves into the complex technical intricacies inherent in the various techniques used for deepfake generation, comprehensively addressing the challenges faced, potential solutions available, and the nuanced details surrounding manipulation formulations. Subsequently, the paper is dedicated to accurately benchmarking leading approaches against prominent datasets, offering thorough assessments of the contributions that have significantly impacted these vital domains. Ultimately, we engage in a thoughtful discussion of the existing challenges, paving the way for continuous advancements in this critical and ever-dynamic study area.
[LG-21] Prediction of Geoeffective CMEs Using SOHO Images and Deep Learning
链接: https://arxiv.org/abs/2501.01011
作者: Khalid A. Alobaid,Jason T. L. Wang,Haimin Wang,Ju Jing,Yasser Abduallah,Zhenduo Wang,Hameedullah Farooki,Huseyin Cavus,Vasyl Yurchyshyn
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR); Space Physics (physics.space-ph)
*备注: 21 pages, 13 figures
Abstract:The application of machine learning to the study of coronal mass ejections (CMEs) and their impacts on Earth has seen significant growth recently. Understanding and forecasting CME geoeffectiveness is crucial for protecting infrastructure in space and ensuring the resilience of technological systems on Earth. Here we present GeoCME, a deep-learning framework designed to predict, deterministically or probabilistically, whether a CME event that arrives at Earth will cause a geomagnetic storm. A geomagnetic storm is defined as a disturbance of the Earth’s magnetosphere during which the minimum Dst index value is less than -50 nT. GeoCME is trained on observations from the instruments including LASCO C2, EIT and MDI on board the Solar and Heliospheric Observatory (SOHO), focusing on a dataset that includes 136 halo/partial halo CMEs in Solar Cycle 23. Using ensemble and transfer learning techniques, GeoCME is capable of extracting features hidden in the SOHO observations and making predictions based on the learned features. Our experimental results demonstrate the good performance of GeoCME, achieving a Matthew’s correlation coefficient of 0.807 and a true skill statistics score of 0.714 when the tool is used as a deterministic prediction model. When the tool is used as a probabilistic forecasting model, it achieves a Brier score of 0.094 and a Brier skill score of 0.493. These results are promising, showing that the proposed GeoCME can help enhance our understanding of CME-triggered solar-terrestrial interactions.
[LG-22] Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning
链接: https://arxiv.org/abs/2501.01002
作者: Yusi Wei,Hande Y. Benson,Joseph K. Agor,Muge Capan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Data is essential for secondary use, but ensuring its privacy while allowing such use is a critical challenge. Various techniques have been proposed to address privacy concerns in data sharing and publishing. However, these methods often degrade data utility, impacting the performance of machine learning (ML) models. Our research identifies key limitations in existing optimization models for privacy preservation, particularly in handling categorical variables, assessing data utility, and evaluating effectiveness across diverse datasets. We propose a novel multi-objective optimization model that simultaneously minimizes information loss and maximizes protection against attacks. This model is empirically validated using diverse datasets and compared with two existing algorithms. We assess information loss, the number of individuals subject to linkage or homogeneity attacks, and ML performance after anonymization. The results indicate that our model achieves lower information loss and more effectively mitigates the risk of attacks, reducing the number of individuals susceptible to these attacks compared to alternative algorithms in some cases. Additionally, our model maintains comparative ML performance relative to the original data or data anonymized by other methods. Our findings highlight significant improvements in privacy protection and ML model performance, offering a comprehensive framework for balancing privacy and utility in data sharing.
[LG-23] Physics-informed Gaussian Processes for Safe Envelope Expansion
链接: https://arxiv.org/abs/2501.01000
作者: D. Isaiah Harp,Joshua Ott,Dylan M. Asmar,John Alora,Mykel J. Kochenderfer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Flight test analysis often requires predefined test points with arbitrarily tight tolerances, leading to extensive and resource-intensive experimental campaigns. To address this challenge, we propose a novel approach to flight test analysis using Gaussian processes (GPs) with physics-informed mean functions to estimate aerodynamic quantities from arbitrary flight test data, validated using real T-38 aircraft data collected in collaboration with the United States Air Force Test Pilot School. We demonstrate our method by estimating the pitching moment coefficient without requiring predefined or repeated flight test points, significantly reducing the need for extensive experimental campaigns. Our approach incorporates aerodynamic models as priors within the GP framework, enhancing predictive accuracy across diverse flight conditions and providing robust uncertainty quantification. Key contributions include the integration of physics-based priors in a probabilistic model, which allows for precise computation from arbitrary flight test maneuvers, and the demonstration of our method capturing relevant dynamic characteristics such as short-period mode behavior. The proposed framework offers a scalable and generalizable solution for efficient data-driven flight test analysis and is able to accurately predict the short period frequency and damping for the T-38 across several Mach and dynamic pressure profiles.
[LG-24] Is It Still Fair? Investigating Gender Fairness in Cross-Corpus Speech Emotion Recognition
链接: https://arxiv.org/abs/2501.00995
作者: Shreya G. Upadhyay,Woan-Shiuan Chien,Chi-Chun Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speech emotion recognition (SER) is a vital component in various everyday applications. Cross-corpus SER models are increasingly recognized for their ability to generalize performance. However, concerns arise regarding fairness across demographics in diverse corpora. Existing fairness research often focuses solely on corpus-specific fairness, neglecting its generalizability in cross-corpus scenarios. Our study focuses on this underexplored area, examining the gender fairness generalizability in cross-corpus SER scenarios. We emphasize that the performance of cross-corpus SER models and their fairness are two distinct considerations. Moreover, we propose the approach of a combined fairness adaptation mechanism to enhance gender fairness in the SER transfer learning tasks by addressing both source and target genders. Our findings bring one of the first insights into the generalizability of gender fairness in cross-corpus SER systems.
[LG-25] Optimizing Noise Schedules of Generative Models in High Dimensionss
链接: https://arxiv.org/abs/2501.00988
作者: Santiago Aranguri,Giulio Biroli,Marc Mezard,Eric Vanden-Eijnden
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent works have shown that diffusion models can undergo phase transitions, the resolution of which is needed for accurately generating samples. This has motivated the use of different noise schedules, the two most common choices being referred to as variance preserving (VP) and variance exploding (VE). Here we revisit these schedules within the framework of stochastic interpolants. Using the Gaussian Mixture (GM) and Curie-Weiss (CW) data distributions as test case models, we first investigate the effect of the variance of the initial noise distribution and show that VP recovers the low-level feature (the distribution of each mode) but misses the high-level feature (the asymmetry between modes), whereas VE performs oppositely. We also show that this dichotomy, which happens when denoising by a constant amount in each step, can be avoided by using noise schedules specific to VP and VE that allow for the recovery of both high- and low-level features. Finally we show that these schedules yield generative models for the GM and CW model whose probability flow ODE can be discretized using \Theta_d(1) steps in dimension d instead of the \Theta_d(\sqrtd) steps required by constant denoising.
[LG-26] On the Low-Complexity of Fair Learning for Combinatorial Multi-Armed Bandit
链接: https://arxiv.org/abs/2501.00924
作者: Xiaoyi Wu,Bo Ji,Bin Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Combinatorial Multi-Armed Bandit with fairness constraints is a framework where multiple arms form a super arm and can be pulled in each round under uncertainty to maximize cumulative rewards while ensuring the minimum average reward required by each arm. The existing pessimistic-optimistic algorithm linearly combines virtual queue-lengths (tracking the fairness violations) and Upper Confidence Bound estimates as a weight for each arm and selects a super arm with the maximum total weight. The number of super arms could be exponential to the number of arms in many scenarios. In wireless networks, interference constraints can cause the number of super arms to grow exponentially with the number of arms. Evaluating all the feasible super arms to find the one with the maximum total weight can incur extremely high computational complexity in the pessimistic-optimistic algorithm. To avoid this, we develop a low-complexity fair learning algorithm based on the so-called pick-and-compare approach that involves randomly picking M feasible super arms to evaluate. By setting M to a constant, the number of comparison steps in the pessimistic-optimistic algorithm can be reduced to a constant, thereby significantly reducing the computational complexity. Our theoretical proof shows this low-complexity design incurs only a slight sacrifice in fairness and regret performance. Finally, we validate the theoretical result by extensive simulations.
[LG-27] Exploring Geometric Representational Alignment through Ollivier-Ricci Curvature and Ricci Flow NEURIPS2024
链接: https://arxiv.org/abs/2501.00919
作者: Nahid Torbati,Michael Gaebler,Simon M. Hofmann,Nico Scherf
类目: Machine Learning (cs.LG)
*备注: Presented at NeuReps workshop, NeurIPS 2024
Abstract:Representational analysis explores how input data of a neural system are encoded in high dimensional spaces of its distributed neural activations, and how we can compare different systems, for instance, artificial neural networks and brains, on those grounds. While existing methods offer important insights, they typically do not account for local intrinsic geometrical properties within the high-dimensional representation spaces. To go beyond these limitations, we explore Ollivier-Ricci curvature and Ricci flow as tools to study the alignment of representations between humans and artificial neural systems on a geometric level. As a proof-of-principle study, we compared the representations of face stimuli between VGG-Face, a human-aligned version of VGG-Face, and corresponding human similarity judgments from a large online study. Using this discrete geometric framework, we were able to identify local structural similarities and differences by examining the distributions of node and edge curvature and higher-level properties by detecting and comparing community structure in the representational graphs.
[LG-28] Diffusion Policies for Generative Modeling of Spacecraft Trajectories
链接: https://arxiv.org/abs/2501.00915
作者: Julia Briden,Breanna Johnson,Richard Linares,Abhishek Cauligi
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: AIAA SCITECH 2025 Forum
Abstract:Machine learning has demonstrated remarkable promise for solving the trajectory generation problem and in paving the way for online use of trajectory optimization for resource-constrained spacecraft. However, a key shortcoming in current machine learning-based methods for trajectory generation is that they require large datasets and even small changes to the original trajectory design requirements necessitate retraining new models to learn the parameter-to-solution mapping. In this work, we leverage compositional diffusion modeling to efficiently adapt out-of-distribution data and problem variations in a few-shot framework for 6 degree-of-freedom (DoF) powered descent trajectory generation. Unlike traditional deep learning methods that can only learn the underlying structure of one specific trajectory optimization problem, diffusion models are a powerful generative modeling framework that represents the solution as a probability density function (PDF) and this allows for the composition of PDFs encompassing a variety of trajectory design specifications and constraints. We demonstrate the capability of compositional diffusion models for inference-time 6 DoF minimum-fuel landing site selection and composable constraint representations. Using these samples as initial guesses for 6 DoF powered descent guidance enables dynamically feasible and computationally efficient trajectory generation.
[LG-29] Aligning LLM s with Domain Invariant Reward Models
链接: https://arxiv.org/abs/2501.00911
作者: David Wu,Sanjiban Choudhury
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emphdomain-agnostic concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: 0.621 \rightarrow 0.661 ), (2) Clean-to-noisy (accuracy: 0.671 \rightarrow 0.703 ), (3) Few-shot-to-full transfer (accuracy: 0.845 \rightarrow 0.920 ), and (4) Simple-to-complex tasks transfer (correlation: 0.508 \rightarrow 0.556 ). Our code, models and data are available at \urlthis https URL.
[LG-30] Spatial Temporal Attention based Target Vehicle Trajectory Prediction for Internet of Vehicles
链接: https://arxiv.org/abs/2501.00890
作者: Ouhan Huang,Huanle Rao,Xiaowen Cai,Tianyun Wang,Aolong Sun,Sizhe Xing,Yifan Sun,Gangyong Jia
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Forecasting vehicle behavior within complex traffic environments is pivotal within Intelligent Transportation Systems (ITS). Though this technology plays a significant role in alleviating the prevalent operational difficulties in logistics and transportation systems, the precise prediction of vehicle trajectories still poses a substantial challenge. To address this, our study introduces the Spatio Temporal Attention-based methodology for Target Vehicle Trajectory Prediction (STATVTPred). This approach integrates Global Positioning System(GPS) localization technology to track target movement and dynamically predict the vehicle’s future path using comprehensive spatio-temporal trajectory data. We map the vehicle trajectory onto a directed graph, after which spatial attributes are extracted via a Graph Attention Networks(GATs). The Transformer technology is employed to yield temporal features from the sequence. These elements are then amalgamated with local road network structure maps to filter and deliver a smooth trajectory sequence, resulting in precise vehicle trajectory this http URL study validates our proposed STATVTPred method on T-Drive and Chengdu taxi-trajectory datasets. The experimental results demonstrate that STATVTPred achieves 6.38% and 10.55% higher Average Match Rate (AMR) than the Transformer model on the Beijing and Chengdu datasets, respectively. Compared to the LSTM Encoder-Decoder model, STATVTPred boosts AMR by 37.45% and 36.06% on the same datasets. This is expected to establish STATVTPred as a new approach for handling trajectory prediction of targets in logistics and transportation scenarios, thereby enhancing prediction accuracy.
[LG-31] Evaluating Time Series Foundation Models on Noisy Periodic Time Series
链接: https://arxiv.org/abs/2501.00889
作者: Syamantak Datta Gupta
类目: Machine Learning (cs.LG)
*备注:
Abstract:While recent advancements in foundation models have significantly impacted machine learning, rigorous tests on the performance of time series foundation models (TSFMs) remain largely underexplored. This paper presents an empirical study evaluating the zero-shot, long-horizon forecasting abilities of several leading TSFMs over two synthetic datasets constituting noisy periodic time series. We assess model efficacy across different noise levels, underlying frequencies, and sampling rates. As benchmarks for comparison, we choose two statistical techniques: a Fourier transform (FFT)-based approach and a linear autoregressive (AR) model. Our findings demonstrate that while for time series with bounded periods and higher sampling rates, TSFMs can match or outperform the statistical approaches, their forecasting abilities deteriorate with longer periods, higher noise levels, lower sampling rates and more complex shapes of the time series.
[LG-32] Hybridising Reinforcement Learning and Heuristics for Hierarchical Directed Arc Routing Problems
链接: https://arxiv.org/abs/2501.00852
作者: Van Quang Nguyen,Quoc Chuong Nguyen,Thu Huong Dang,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Hierarchical Directed Capacitated Arc Routing Problem (HDCARP) is an extension of the Capacitated Arc Routing Problem (CARP), where the arcs of a graph are divided into classes based on their priority. The traversal of these classes is determined by either precedence constraints or a hierarchical objective, resulting in two distinct HDCARP variants. To the best of our knowledge, only one matheuristic has been proposed for these variants, but it performs relatively slowly, particularly for large-scale instances (Ha et al., 2024). In this paper, we propose a fast heuristic to efficiently address the computational challenges of HDCARP. Furthermore, we incorporate Reinforcement Learning (RL) into our heuristic to effectively guide the selection of local search operators, resulting in a hybrid algorithm. We name this hybrid algorithm as the Hybrid Reinforcement Learning and Heuristic Algorithm for Directed Arc Routing (HRDA). The hybrid algorithm adapts to changes in the problem dynamically, using real-time feedback to improve routing strategies and solution’s quality by integrating heuristic methods. Extensive computational experiments on artificial instances demonstrate that this hybrid approach significantly improves the speed of the heuristic without deteriorating the solution quality. Our source code is publicly available at: this https URL
[LG-33] Hardness of Learning Fixed Parities with Neural Networks
链接: https://arxiv.org/abs/2501.00817
作者: Itamar Shoshani,Ohad Shamir
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Learning parity functions is a canonical problem in learning theory, which although computationally tractable, is not amenable to standard learning algorithms such as gradient-based methods. This hardness is usually explained via statistical query lower bounds [Kearns, 1998]. However, these bounds only imply that for any given algorithm, there is some worst-case parity function that will be hard to learn. Thus, they do not explain why fixed parities - say, the full parity function over all coordinates - are difficult to learn in practice, at least with standard predictors and gradient-based methods [Abbe and Boix-Adsera, 2022]. In this paper, we address this open problem, by showing that for any fixed parity of some minimal size, using it as a target function to train one-hidden-layer ReLU networks with perturbed gradient descent will fail to produce anything meaningful. To establish this, we prove a new result about the decay of the Fourier coefficients of linear threshold (or weighted majority) functions, which may be of independent interest.
[LG-34] Follow The Sparse Approximate Leader for No-Regret Online Sparse Linear Approximation
链接: https://arxiv.org/abs/2501.00799
作者: Samrat Mukhopadhyay,Debasmita Mukherjee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 12 pages, 5 figures
Abstract:We consider the problem of \textitonline sparse linear approximation, where one predicts the best sparse approximation of a sequence of measurements in terms of linear combination of columns of a given measurement matrix. Such online prediction problems are ubiquitous, ranging from medical trials to web caching to resource allocation. The inherent difficulty of offline recovery also makes the online problem challenging. In this letter, we propose Follow-The-Approximate-Sparse-Leader, an efficient online meta-policy to address this online problem. Through a detailed theoretical analysis, we prove that under certain assumptions on the measurement sequence, the proposed policy enjoys a data-dependent sublinear upper bound on the static regret, which can range from logarithmic to square-root. Numerical simulations are performed to corroborate the theoretical findings and demonstrate the efficacy of the proposed online policy.
[LG-35] Avoiding Oversmoothing in Deep Graph Neural Networks: A Multiplicative Ergodic Analysis
链接: https://arxiv.org/abs/2501.00762
作者: Ziang Chen,Zhengjiang Lin,Shi Chen,Yury Polyanskiy,Philippe Rigollet
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as “oversmoothing” persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoothing rates of deep GNNs with and without residual connections by deriving explicit convergence rates for a normalized vertex similarity measure. Our analytical framework is grounded in the multiplicative ergodic theorem. Furthermore, we demonstrate that adding residual connections effectively mitigates or prevents oversmoothing across several broad families of parameter distributions. The theoretical findings are strongly supported by numerical experiments.
[LG-36] Beyond Static Datasets: A Behavior-Driven Entity-Specific Simulation to Overcome Data Scarcity and Train Effective Crypto Anti-Money Laundering Models
链接: https://arxiv.org/abs/2501.00757
作者: Dinesh Srivasthav P,Manoj Apte
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:For different factors/reasons, ranging from inherent characteristics and features providing decentralization, enhanced privacy, ease of transactions, etc., to implied external hardships in enforcing regulations, contradictions in data sharing policies, etc., cryptocurrencies have been severely abused for carrying out numerous malicious and illicit activities including money laundering, darknet transactions, scams, terrorism financing, arm trades. However, money laundering is a key crime to be mitigated to also suspend the movement of funds from other illicit activities. Billions of dollars are annually being laundered. It is getting extremely difficult to identify money laundering in crypto transactions owing to many layering strategies available today, and rapidly evolving tactics, and patterns the launderers use to obfuscate the illicit funds. Many detection methods have been proposed ranging from naive approaches involving complete manual investigation to machine learning models. However, there are very limited datasets available for effectively training machine learning models. Also, the existing datasets are static and class-imbalanced, posing challenges for scalability and suitability to specific scenarios, due to lack of customization to varying requirements. This has been a persistent challenge in literature. In this paper, we propose behavior embedded entity-specific money laundering-like transaction simulation that helps in generating various transaction types and models the transactions embedding the behavior of several entities observed in this space. The paper discusses the design and architecture of the simulator, a custom dataset we generated using the simulator, and the performance of models trained on this synthetic data in detecting real addresses involved in money laundering.
[LG-37] FasterSTS: A Faster Spatio-Temporal Synchronous Graph Convolutional Networks for Traffic flow Forecasting
链接: https://arxiv.org/abs/2501.00756
作者: Ben-Ao Dai,Nengchao Lyu,Yongchao Miao
类目: Machine Learning (cs.LG)
*备注: 13pages,3 figures
Abstract:Accurate traffic flow prediction heavily relies on the spatio-temporal correlation of traffic flow data. Most current studies separately capture correlations in spatial and temporal dimensions, making it difficult to capture complex spatio-temporal heterogeneity, and often at the expense of increasing model complexity to improve prediction accuracy. Although there have been groundbreaking attempts in the field of spatio-temporal synchronous modeling, significant limitations remain in terms of performance and complexity this http URL study proposes a quicker and more effective spatio-temporal synchronous traffic flow forecast model to address these issues.
[LG-38] Experimental Demonstration of an Optical Neural PDE Solver via On-Chip PINN Training
链接: https://arxiv.org/abs/2501.00742
作者: Yequan Zhao,Xian Xiao,Antoine Descos,Yuan Yuan,Xinling Yu,Geza Kurczveil,Marco Fiorentino,Zheng Zhang,Raymond G. Beausoleil
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Optics (physics.optics)
*备注:
Abstract:Partial differential equation (PDE) is an important math tool in science and engineering. This paper experimentally demonstrates an optical neural PDE solver by leveraging the back-propagation-free on-photonic-chip training of physics-informed neural networks.
[LG-39] KAN KAN Buff Signed Graph Neural Networks?
链接: https://arxiv.org/abs/2501.00709
作者: Muhieddine Shebaro,Jelena Tešić
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph Representation Learning aims to create embeddings for nodes and edges, capturing their features and interconnections. Graph Neural Networks (GNNs) have excelled in this task, leveraging neural networks to model complex graph relationships. Recently, the Kolmogorov-Arnold Neural Network (KAN) emerged as an alternative to Multi-Layer Perceptron (MLP), showing improved accuracy and interpretability with fewer parameters. While KANs have been integrated into unsigned GNNs, their application in signed GNNs remains unexplored. This paper integrates KAN into Signed Graph Convolutional Networks (SGCNs) to evaluate its performance on signed graphs where edges have positive or negative signs. We empirically assess KAN-enhanced SGCNs (KASGCN) on downstream tasks such as signed community detection and link sign prediction to enhance the embedding quality in signed networks. Considering the variability in the results indicated by the relatively large standard deviation, KASGCN demonstrates competitive performance with, or similar to, the vanilla SGCN in the evaluated downstream tasks, and its effectiveness is context-dependent (signed graph and parameters…etc.).
[LG-40] Kolmogorov GAM Networks are all you need!
链接: https://arxiv.org/abs/2501.00704
作者: Sarah Polson,Vadim Sokolov
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:Kolmogorov GAM (K-GAM) networks are shown to be an efficient architecture for training and inference. They are an additive model with an embedding that is independent of the function of interest. They provide an alternative to the transformer architecture. They are the machine learning version of Kolmogorov’s Superposition Theorem (KST) which provides an efficient representations of a multivariate function. Such representations have use in machine learning for encoding dictionaries (a.k.a. “look-up” tables). KST theory also provides a representation based on translates of the Köppen function. The goal of our paper is to interpret this representation in a machine learning context for applications in Artificial Intelligence (AI). Our architecture is equivalent to a topological embedding which is independent of the function together with an additive layer that uses a Generalized Additive Model (GAM). This provides a class of learning procedures with far fewer parameters than current deep learning algorithms. Implementation can be parallelizable which makes our algorithms computationally attractive. To illustrate our methodology, we use the Iris data from statistical learning. We also show that our additive model with non-linear embedding provides an alternative to transformer architectures which from a statistical viewpoint are kernel smoothers. Additive KAN models therefore provide a natural alternative to transformers. Finally, we conclude with directions for future research.
[LG-41] NN-ResDMD: Learning Koopman Representations for Complex Dynamics with Spectral Residuals
链接: https://arxiv.org/abs/2501.00701
作者: Yuanchao Xu,Kaidi Shao,Nikos Logothetis,Zhongwei Shen
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:Analyzing long-term behaviors in high-dimensional nonlinear dynamical systems remains a significant challenge. The Koopman operator framework has emerged as a powerful tool to address this issue by providing a globally linear perspective on nonlinear dynamics. However, existing methods for approximating the Koopman operator and its spectral components, particularly in large-scale systems, often lack robust theoretical guarantees. Residual Dynamic Mode Decomposition (ResDMD) introduces a spectral residual measure to assess the convergence of the estimated Koopman spectrum, which helps filter out spurious spectral components. Nevertheless, it depends on pre-computed spectra, thereby inheriting their inaccuracies. To overcome its limitations, we introduce the Neural Network-ResDMD (NN-ResDMD), a method that directly estimates Koopman spectral components by minimizing the spectral residual. By leveraging neural networks, NN-ResDMD automatically identifies the optimal basis functions of the Koopman invariant subspace, eliminating the need for manual selection and improving the reliability of the analysis. Experiments on physical and biological systems demonstrate that NN-ResDMD significantly improves both accuracy and scalability, making it an effective tool for analyzing complex dynamical systems.
[LG-42] Cost and Reward Infused Metric Elicitation
链接: https://arxiv.org/abs/2501.00696
作者: Chethan Bhateja,Joseph O’Brien,Afnaan Hashmi,Eva Prakash
类目: Machine Learning (cs.LG)
*备注: Accompanying code at this https URL
Abstract:In machine learning, metric elicitation refers to the selection of performance metrics that best reflect an individual’s implicit preferences for a given application. Currently, metric elicitation methods only consider metrics that depend on the accuracy values encoded within a given model’s confusion matrix. However, focusing solely on confusion matrices does not account for other model feasibility considerations such as varied monetary costs or latencies. In our work, we build upon the multiclass metric elicitation framework of Hiranandani et al., extrapolating their proposed Diagonal Linear Performance Metric Elicitation (DLPME) algorithm to account for additional bounded costs and rewards. Our experimental results with synthetic data demonstrate our approach’s ability to quickly converge to the true metric.
[LG-43] Beyond Model Scale Limits: End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration
链接: https://arxiv.org/abs/2501.00693
作者: Zhiyuan Wu,Sheng Sun,Yuwei Wang,Min Liu,Ke Xu,Quyang Pan,Bo Gao,Tian Wen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 16 pages, 7 tables, 6 figures. arXiv admin note: text overlap with arXiv:2312.11489
Abstract:The rise of End-Edge-Cloud Collaboration (EECC) offers a promising paradigm for Artificial Intelligence (AI) model training across end devices, edge servers, and cloud data centers, providing enhanced reliability and reduced latency. Hierarchical Federated Learning (HFL) can benefit from this paradigm by enabling multi-tier model aggregation across distributed computing nodes. However, the potential of HFL is significantly constrained by the inherent heterogeneity and dynamic characteristics of EECC environments. Specifically, the uniform model structure bounded by the least powerful end device across all computing nodes imposes a performance bottleneck. Meanwhile, coupled heterogeneity in data distributions and resource capabilities across tiers disrupts hierarchical knowledge transfer, leading to biased updates and degraded performance. Furthermore, the mobility and fluctuating connectivity of computing nodes in EECC environments introduce complexities in dynamic node migration, further compromising the robustness of the training process. To address multiple challenges within a unified framework, we propose End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration (FedEEC), which is a novel EECC-empowered FL framework that allows the trained models from end, edge, to cloud to grow larger in size and stronger in generalization ability. FedEEC introduces two key innovations: (1) Bridge Sample Based Online Distillation Protocol (BSBODP), which enables knowledge transfer between neighboring nodes through generated bridge samples, and (2) Self-Knowledge Rectification (SKR), which refines the transferred knowledge to prevent suboptimal cloud model optimization. The proposed framework effectively handles both cross-tier resource heterogeneity and effective knowledge transfer between neighboring nodes, while satisfying the migration-resilient requirements of EECC.
[LG-44] Controlled Causal Hallucinations Can Estimate Phantom Nodes in Multiexpert Mixtures of Fuzzy Cognitive Maps
链接: https://arxiv.org/abs/2501.00673
作者: Akash Kumar Panda,Bart Kosko
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, The Ninth International Conference on Data Mining and Big Data 2024 (DMBD 2024), 13 December 2024
Abstract:An adaptive multiexpert mixture of feedback causal models can approximate missing or phantom nodes in large-scale causal models. The result gives a scalable form of \emphbig knowledge. The mixed model approximates a sampled dynamical system by approximating its main limit-cycle equilibria. Each expert first draws a fuzzy cognitive map (FCM) with at least one missing causal node or variable. FCMs are directed signed partial-causality cyclic graphs. They mix naturally through convex combination to produce a new causal feedback FCM. Supervised learning helps each expert FCM estimate its phantom node by comparing the FCM’s partial equilibrium with the complete multi-node equilibrium. Such phantom-node estimation allows partial control over these causal hallucinations and helps approximate the future trajectory of the dynamical system. But the approximation can be computationally heavy. Mixing the tuned expert FCMs gives a practical way to find several phantom nodes and thereby better approximate the feedback system’s true equilibrium behavior.
[LG-45] Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing
链接: https://arxiv.org/abs/2501.00658
作者: Peihao Wang,Ruisi Cai,Yuehao Wang,Jiajun Zhu,Pragya Srivastava,Zhangyang Wang,Pan Li
类目: Machine Learning (cs.LG)
*备注: 29 pages, 10 figures, 5 tables
Abstract:Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models’ ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at this https URL.
[LG-46] Finding Missed Code Size Optimizations in Compilers using LLM s
链接: https://arxiv.org/abs/2501.00655
作者: Davide Italiano,Chris Cummins
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted to appear in The International Conference on Compiler Construction (CC) 2025
Abstract:Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less attention has been paid to ensuring that the compiler produces performant code. In this work we adapt differential testing to the task of identifying missed optimization opportunities in compilers. We develop a novel testing approach which combines large language models (LLMs) with a series of differential testing strategies and use them to find missing code size optimizations in C / C++ compilers. The advantage of our approach is its simplicity. We offload the complex task of generating random code to an off-the-shelf LLM, and use heuristics and analyses to identify anomalous compiler behavior. Our approach requires fewer than 150 lines of code to implement. This simplicity makes it extensible. By simply changing the target compiler and initial LLM prompt we port the approach from C / C++ to Rust and Swift, finding bugs in both. To date we have reported 24 confirmed bugs in production compilers, and conclude that LLM-assisted testing is a promising avenue for detecting optimization bugs in real world compilers. Comments: Accepted to appear in The International Conference on Compiler Construction (CC) 2025 Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2501.00655 [cs.SE] (or arXiv:2501.00655v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.00655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Matrix factorization and prediction for high dimensional co-occurrence count data via shared parameter alternating zero inflated Gamma model
链接: https://arxiv.org/abs/2501.00628
作者: Taejoon Kim,Haiyan Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 5 figures
Abstract:High-dimensional sparse matrix data frequently arise in various applications. A notable example is the weighted word-word co-occurrence count data, which summarizes the weighted frequency of word pairs appearing within the same context window. This type of data typically contains highly skewed non-negative values with an abundance of zeros. Another example is the co-occurrence of item-item or user-item pairs in e-commerce, which also generates high-dimensional data. The objective is to utilize this data to predict the relevance between items or users. In this paper, we assume that items or users can be represented by unknown dense vectors. The model treats the co-occurrence counts as arising from zero-inflated Gamma random variables and employs cosine similarity between the unknown vectors to summarize item-item relevance. The unknown values are estimated using the shared parameter alternating zero-inflated Gamma regression models (SA-ZIG). Both canonical link and log link models are considered. Two parameter updating schemes are proposed, along with an algorithm to estimate the unknown parameters. Convergence analysis is presented analytically. Numerical studies demonstrate that the SA-ZIG using Fisher scoring without learning rate adjustment may fail to fi nd the maximum likelihood estimate. However, the SA-ZIG with learning rate adjustment performs satisfactorily in our simulation studies.
[LG-48] Global dense vector representations for words or items using shared parameter alternating Tweedie model
链接: https://arxiv.org/abs/2501.00623
作者: Taejoon Kim,Haiyan Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages 12 figures
Abstract:In this article, we present a model for analyzing the cooccurrence count data derived from practical fields such as user-item or item-item data from online shopping platform, cooccurring word-word pairs in sequences of texts. Such data contain important information for developing recommender systems or studying relevance of items or words from non-numerical sources. Different from traditional regression models, there are no observations for covariates. Additionally, the cooccurrence matrix is typically of so high dimension that it does not fit into a computer’s memory for modeling. We extract numerical data by defining windows of cooccurrence using weighted count on the continuous scale. Positive probability mass is allowed for zero observations. We present Shared parameter Alternating Tweedie (SA-Tweedie) model and an algorithm to estimate the parameters. We introduce a learning rate adjustment used along with the Fisher scoring method in the inner loop to help the algorithm stay on track of optimizing direction. Gradient descent with Adam update was also considered as an alternative method for the estimation. Simulation studies and an application showed that our algorithm with Fisher scoring and learning rate adjustment outperforms the other two methods. Pseudo-likelihood approach with alternating parameter update was also studied. Numerical studies showed that the pseudo-likelihood approach is not suitable in our shared parameter alternating regression models with unobserved covariates.
[LG-49] Predicting Barge Presence and Quantity on Inland Waterways using Vessel Tracking Data: A Machine Learning Approach
链接: https://arxiv.org/abs/2501.00615
作者: Geoffery Agorkua,Sarah Hernandez,Maria Falquez,Subhadipto Poddar,Shihao Pang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study presents a machine learning approach to predict the number of barges transported by vessels on inland waterways using tracking data from the Automatic Identification System (AIS). While AIS tracks the location of tug and tow vessels, it does not monitor the presence or number of barges transported by those vessels. Understanding the number and types of barges conveyed along river segments, between ports, and at ports is crucial for estimating the quantities of freight transported on the nation’s waterways. This insight is also valuable for waterway management and infrastructure operations impacting areas such as targeted dredging operations, and data-driven resource allocation. Labeled sample data was generated using observations from traffic cameras located along key river segments and matched to AIS data records. A sample of 164 vessels representing up to 42 barge convoys per vessel was used for model development. The methodology involved first predicting barge presence and then predicting barge quantity. Features derived from the AIS data included speed measures, vessel characteristics, turning measures, and interaction terms. For predicting barge presence, the AdaBoost model achieved an F1 score of 0.932. For predicting barge quantity, the Random Forest combined with an AdaBoost ensemble model achieved an F1 score of 0.886. Bayesian optimization was used for hyperparameter tuning. By advancing predictive modeling for inland waterways, this study offers valuable insights for transportation planners and organizations, which require detailed knowledge of traffic volumes, including the flow of commodities, their destinations, and the tonnage moving in and out of ports.
[LG-50] me-Varying Graph Learning for Data with Heavy-Tailed Distribution
链接: https://arxiv.org/abs/2501.00606
作者: Amirhossein Javaheri,Jiaxi Ying,Daniel P. Palomar,Farokh Marvasti
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph models provide efficient tools to capture the underlying structure of data defined over networks. Many real-world network topologies are subject to change over time. Learning to model the dynamic interactions between entities in such networks is known as time-varying graph learning. Current methodology for learning such models often lacks robustness to outliers in the data and fails to handle heavy-tailed distributions, a common feature in many real-world datasets (e.g., financial data). This paper addresses the problem of learning time-varying graph models capable of efficiently representing heavy-tailed data. Unlike traditional approaches, we incorporate graph structures with specific spectral properties to enhance data clustering in our model. Our proposed method, which can also deal with noise and missing values in the data, is based on a stochastic approach, where a non-negative vector auto-regressive (VAR) model captures the variations in the graph and a Student-t distribution models the signal originating from this underlying time-varying graph. We propose an iterative method to learn time-varying graph topologies within a semi-online framework where only a mini-batch of data is used to update the graph. Simulations with both synthetic and real datasets demonstrate the efficacy of our model in analyzing heavy-tailed data, particularly those found in financial markets.
[LG-51] Per Subject Complexity in Eye Movement Prediction
链接: https://arxiv.org/abs/2501.00597
作者: Kateryna Melnyk,Dmytro Katrychuk,Lee Friedman,Oleg Komogortsev
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 14 pages
Abstract:Eye movement prediction is a promising area of research to compensate for the latency introduced by eye-tracking systems in virtual reality devices. In this study, we comprehensively analyze the complexity of the eye movement prediction task associated with subjects. We use three fundamentally different models within the analysis: the lightweight Long Short-Term Memory network (LSTM), the transformer-based network for multivariate time series representation learning (TST), and the Oculomotor Plant Mathematical Model wrapped in the Kalman Filter framework (OPKF). Each solution is assessed following a sample-to-event evaluation strategy and employing the new event-to-subject metrics. Our results show that the different models maintained similar prediction performance trends pertaining to subjects. We refer to these outcomes as per-subject complexity since some subjects’ data pose a more significant challenge for models. Along with the detailed correlation analysis, this report investigates the source of the per-subject complexity and discusses potential solutions to overcome it.
[LG-52] Adaptive Tabu Dropout for Regularization of Deep Neural Network
链接: https://arxiv.org/abs/2501.00538
作者: Md. Tarek Hasan,Arifa Akter,Mohammad Nazmush Shamael,Md Al Emran Hossain,H. M. Mutasim Billah,Sumayra Islam,Swakkhar Shatabda
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dropout is an effective strategy for the regularization of deep neural networks. Applying tabu to the units that have been dropped in the recent epoch and retaining them for training ensures diversification in dropout. In this paper, we improve the Tabu Dropout mechanism for training deep neural networks in two ways. Firstly, we propose to use tabu tenure, or the number of epochs a particular unit will not be dropped. Different tabu tenures provide diversification to boost the training of deep neural networks based on the search landscape. Secondly, we propose an adaptive tabu algorithm that automatically selects the tabu tenure based on the training performances through epochs. On several standard benchmark datasets, the experimental results show that the adaptive tabu dropout and tabu tenure dropout diversify and perform significantly better compared to the standard dropout and basic tabu dropout mechanisms.
[LG-53] Rapid Learning in Constrained Minimax Games with Negative Momentum
链接: https://arxiv.org/abs/2501.00533
作者: Zijian Fang,Zongkai Liu,Chao Yu,Chaohao Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we delve into the utilization of the negative momentum technique in constrained minimax games. From an intuitive mechanical standpoint, we introduce a novel framework for momentum buffer updating, which extends the findings of negative momentum from the unconstrained setting to the constrained setting and provides a universal enhancement to the classic game-solver algorithms. Additionally, we provide theoretical guarantee of convergence for our momentum-augmented algorithms with entropy regularizer. We then extend these algorithms to their extensive-form counterparts. Experimental results on both Normal Form Games (NFGs) and Extensive Form Games (EFGs) demonstrate that our momentum techniques can significantly improve algorithm performance, surpassing both their original versions and the SOTA baselines by a large margin.
[LG-54] Stochastic Extragradient with Flip-Flop Shuffling Anchoring: Provable Improvements NEURIPS2024
链接: https://arxiv.org/abs/2501.00511
作者: Jiseok Chae,Chulhee Yun,Donghwan Kim
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 73+7 pages, 4 figures. Published in NeurIPS 2024
Abstract:In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based SEG in unconstrained finite-sum minimax problems, in search of convergent shuffling-based SEG. Our analysis reveals that both random reshuffling and the recently proposed flip-flop shuffling alone can suffer divergence in C-C problems. However, with an additional simple trick called anchoring, we develop the SEG with flip-flop anchoring (SEG-FFA) method which successfully converges in C-C problems. We also show upper and lower bounds in the strongly-convex-strongly-concave setting, demonstrating that SEG-FFA has a provably faster convergence rate compared to other shuffling-based methods.
[LG-55] Active Learning of General Halfspaces: Label Queries vs Membership Queries NEURIPS2024
链接: https://arxiv.org/abs/2501.00508
作者: Ilias Diakonikolas,Daniel M. Kane,Mingchen Ma
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024
Abstract:We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces under the Gaussian distribution on R^d in the presence of some form of query access. In the classical pool-based active learning model, where the algorithm is allowed to make adaptive label queries to previously sampled points, we establish a strong information-theoretic lower bound ruling out non-trivial improvements over the passive setting. Specifically, we show that any active learner requires label complexity of \tilde\Omega(d/(\log(m)\epsilon)) , where m is the number of unlabeled examples. Specifically, to beat the passive label complexity of \tildeO (d/\epsilon) , an active learner requires a pool of 2^poly(d) unlabeled samples. On the positive side, we show that this lower bound can be circumvented with membership query access, even in the agnostic model. Specifically, we give a computationally efficient learner with query complexity of \tildeO(\min\1/p, 1/\epsilon\ + d\cdot polylog(1/\epsilon)) achieving error guarantee of O(opt)+\epsilon . Here p \in [0, 1/2] is the bias and opt is the 0-1 loss of the optimal halfspace. As a corollary, we obtain a strong separation between the active and membership query models. Taken together, our results characterize the complexity of learning general halfspaces under Gaussian marginals in these models.
[LG-56] Score-Based Metropolis-Hastings Algorithms
链接: https://arxiv.org/abs/2501.00467
作者: Ahmed Aloui,Ali Hasan,Juncheng Dong,Zihao Wu,Vahid Tarokh
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:In this paper, we introduce a new approach for integrating score-based models with the Metropolis-Hastings algorithm. While traditional score-based diffusion models excel in accurately learning the score function from data points, they lack an energy function, making the Metropolis-Hastings adjustment step inaccessible. Consequently, the unadjusted Langevin algorithm is often used for sampling using estimated score functions. The lack of an energy function then prevents the application of the Metropolis-adjusted Langevin algorithm and other Metropolis-Hastings methods, limiting the wealth of other algorithms developed that use acceptance functions. We address this limitation by introducing a new loss function based on the \emphdetailed balance condition, allowing the estimation of the Metropolis-Hastings acceptance probabilities given a learned score function. We demonstrate the effectiveness of the proposed method for various scenarios, including sampling from heavy-tail distributions.
[LG-57] Dementia Detection using Multi-modal Methods on Audio Data
链接: https://arxiv.org/abs/2501.00465
作者: Saugat Kannojia,Anirudh Praveen,Danish Vasdev,Saket Nandedkar,Divyansh Mittal,Sarthak Kalankar,Shaurya Johari,Vipul Arora
类目: Machine Learning (cs.LG)
*备注: 4 pages
Abstract:Dementia is a neurodegenerative disease that causes gradual cognitive impairment, which is very common in the world and undergoes a lot of research every year to prevent and cure it. It severely impacts the patient’s ability to remember events and communicate clearly, where most variations of it have no known cure, but early detection can help alleviate symptoms before they become worse. One of the main symptoms of dementia is difficulty in expressing ideas through speech. This paper attempts to talk about a model developed to predict the onset of the disease using audio recordings from patients. An ASR-based model was developed that generates transcripts from the audio files using Whisper model and then applies RoBERTa regression model to generate an MMSE score for the patient. This score can be used to predict the extent to which the cognitive ability of a patient has been affected. We use the PROCESS_V1 dataset for this task, which is introduced through the PROCESS Grand Challenge 2025. The model achieved an RMSE score of 2.6911 which is around 10 percent lower than the described baseline.
[LG-58] Addressing Challenges in Data Quality and Model Generalization for Malaria Detection
链接: https://arxiv.org/abs/2501.00464
作者: Kiswendsida Kisito Kabore,Desire Guel
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 22 pages, 17 figures, 17 tables, In: Journal of Sensor Networks and Data Communications (JSNDC). ISSN: 2994-6433. DOI: https://doi.org/10.33140/JSNDC.04.03.09
Abstract:Malaria remains a significant global health burden, particularly in resource-limited regions where timely and accurate diagnosis is critical to effective treatment and control. Deep Learning (DL) has emerged as a transformative tool for automating malaria detection and it offers high accuracy and scalability. However, the effectiveness of these models is constrained by challenges in data quality and model generalization including imbalanced datasets, limited diversity and annotation variability. These issues reduce diagnostic reliability and hinder real-world applicability. This article provides a comprehensive analysis of these challenges and their implications for malaria detection performance. Key findings highlight the impact of data imbalances which can lead to a 20% drop in F1-score and regional biases which significantly hinder model generalization. Proposed solutions, such as GAN-based augmentation, improved accuracy by 15-20% by generating synthetic data to balance classes and enhance dataset diversity. Domain adaptation techniques, including transfer learning, further improved cross-domain robustness by up to 25% in sensitivity. Additionally, the development of diverse global datasets and collaborative data-sharing frameworks is emphasized as a cornerstone for equitable and reliable malaria diagnostics. The role of explainable AI techniques in improving clinical adoption and trustworthiness is also underscored. By addressing these challenges, this work advances the field of AI-driven malaria detection and provides actionable insights for researchers and practitioners. The proposed solutions aim to support the development of accessible and accurate diagnostic tools, particularly for resource-constrained populations. Comments: 22 pages, 17 figures, 17 tables, In: Journal of Sensor Networks and Data Communications (JSNDC). ISSN: 2994-6433. DOI: https://doi.org/10.33140/JSNDC.04.03.09 Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2501.00464 [cs.LG] (or arXiv:2501.00464v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.00464 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.33140/JSNDC.04.03.09 Focus to learn more DOI(s) linking to related resources
[LG-59] Unrolled Creative Adversarial Network For Generating Novel Musical Pieces
链接: https://arxiv.org/abs/2501.00452
作者: Pratik Nag
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Music generation has been established as a prominent topic in artificial intelligence and machine learning over recent years. In most recent works on RNN-based neural network methods have been applied for sequence generation. In contrast, generative adversarial networks (GANs) and their counterparts have been explored by very few researchersfor music generation. In this paper, a classical system was employed alongside a new system to generate creative music. Both systems were designed based on adversarial networks to generate music by learning from examples. The classical system was trained to learn a set of music pieces without differentiating between classes, whereas the new system was trained to learn the different composers and their styles to generate a creative music piece by deviating from the learned composers’ styles. The base structure utilized was generative adversarial networks (GANs), which are capable of generating novel outputs given a set of inputs to learn from and mimic their distribution. It has been shown in previous work that GANs are limited in their original design with respect to creative outputs. Building on the Creative Adversarial Networks (CAN) , this work applied them in the music domain rather than the visual art domain. Additionally, unrolled CAN was introduced to prevent mode collapse. Experiments were conducted on both GAN and CAN for generating music, and their capabilities were measured in terms of deviation from the input set. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2501.00452 [cs.SD] (or arXiv:2501.00452v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2501.00452 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Intuitive Analysis of the Quantization-based Optimization: From Stochastic and Quantum Mechanical Perspective NEURIPS2024
链接: https://arxiv.org/abs/2501.00436
作者: Jinwuk Seok,Changsik Cho
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: published in NeurIPS 2024 workshop OPT2024
Abstract:In this paper, we present an intuitive analysis of the optimization technique based on the quantization of an objective function. Quantization of an objective function is an effective optimization methodology that decreases the measure of a level set containing several saddle points and local minima and finds the optimal point at the limit level set. To investigate the dynamics of quantization-based optimization, we derive an overdamped Langevin dynamics model from an intuitive analysis to minimize the level set by iterative quantization. We claim that quantization-based optimization involves the quantities of thermodynamical and quantum mechanical optimization as the core methodologies of global optimization. Furthermore, on the basis of the proposed SDE, we provide thermodynamic and quantum mechanical analysis with Witten-Laplacian. The simulation results with the benchmark functions, which compare the performance of the nonlinear optimization, demonstrate the validity of the quantization-based optimization.
[LG-61] Outlier-Robust Linear System Identification Under Heavy-tailed Noise
链接: https://arxiv.org/abs/2501.00421
作者: Vinay Kanakeri,Aritra Mitra
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We consider the problem of estimating the state transition matrix of a linear time-invariant (LTI) system, given access to multiple independent trajectories sampled from the system. Several recent papers have conducted a non-asymptotic analysis of this problem, relying crucially on the assumption that the process noise is either Gaussian or sub-Gaussian, i.e., “light-tailed”. In sharp contrast, we work under a significantly weaker noise model, assuming nothing more than the existence of the fourth moment of the noise distribution. For this setting, we provide the first set of results demonstrating that one can obtain sample-complexity bounds for linear system identification that are nearly of the same order as under sub-Gaussian noise. To achieve such results, we develop a novel robust system identification algorithm that relies on constructing multiple weakly-concentrated estimators, and then boosting their performance using suitable tools from high-dimensional robust statistics. Interestingly, our analysis reveals how the kurtosis of the noise distribution, a measure of heavy-tailedness, affects the number of trajectories needed to achieve desired estimation error bounds. Finally, we show that our algorithm and analysis technique can be easily extended to account for scenarios where an adversary can arbitrarily corrupt a small fraction of the collected trajectory data. Our work takes the first steps towards building a robust statistical learning theory for control under non-ideal assumptions on the data-generating process.
[LG-62] KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning
链接: https://arxiv.org/abs/2501.00420
作者: Fangchen Yu,Ruilizhen Hu,Yidong Lin,Yuqi Ma,Zhenghao Huang,Wenye Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Kolmogorov-Arnold Network (KAN) has recently gained attention as an alternative to traditional multi-layer perceptrons (MLPs), offering improved accuracy and interpretability by employing learnable activation functions on edges. In this paper, we introduce the Kolmogorov-Arnold Auto-Encoder (KAE), which integrates KAN with autoencoders (AEs) to enhance representation learning for retrieval, classification, and denoising tasks. Leveraging the flexible polynomial functions in KAN layers, KAE captures complex data patterns and non-linear relationships. Experiments on benchmark datasets demonstrate that KAE improves latent representation quality, reduces reconstruction errors, and achieves superior performance in downstream tasks such as retrieval, classification, and denoising, compared to standard autoencoders and other KAN variants. These results suggest KAE’s potential as a useful tool for representation learning. Our code is available at \urlthis https URL.
[LG-63] oward Information Theoretic Active Inverse Reinforcement Learning NEURIPS2024
链接: https://arxiv.org/abs/2501.00381
作者: Ondrej Bajgar,Sid William Gould,Rohan Narayan Langford Mitta,Jonathon Liu,Oliver Newcombe,Jack Golden
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty
Abstract:As AI systems become increasingly autonomous, aligning their decision-making to human preferences is essential. In domains like autonomous driving or robotics, it is impossible to write down the reward function representing these preferences by hand. Inverse reinforcement learning (IRL) offers a promising approach to infer the unknown reward from demonstrations. However, obtaining human demonstrations can be costly. Active IRL addresses this challenge by strategically selecting the most informative scenarios for human demonstration, reducing the amount of required human effort. Where most prior work allowed querying the human for an action at one state at a time, we motivate and analyse scenarios where we collect longer trajectories. We provide an information-theoretic acquisition function, propose an efficient approximation scheme, and illustrate its performance through a set of gridworld experiments as groundwork for future work expanding to more general settings.
[LG-64] Federated Dropout: Convergence Analysis and Resource Allocation
链接: https://arxiv.org/abs/2501.00379
作者: Sijing Xie,Dingzhu Wen,Xiaonan Liu,Changsheng You,Tharmalingam Ratnarajah,Kaibin Huang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Federated Dropout is an efficient technique to overcome both communication and computation bottlenecks for deploying federated learning at the network edge. In each training round, an edge device only needs to update and transmit a sub-model, which is generated by the typical method of dropout in deep learning, and thus effectively reduces the per-round latency. \textcolorblueHowever, the theoretical convergence analysis for Federated Dropout is still lacking in the literature, particularly regarding the quantitative influence of dropout rate on convergence. To address this issue, by using the Taylor expansion method, we mathematically show that the gradient variance increases with a scaling factor of \gamma/(1-\gamma) , with \gamma \in [0, \theta) denoting the dropout rate and \theta being the maximum dropout rate ensuring the loss function reduction. Based on the above approximation, we provide the convergence analysis for Federated Dropout. Specifically, it is shown that a larger dropout rate of each device leads to a slower convergence rate. This provides a theoretical foundation for reducing the convergence latency by making a tradeoff between the per-round latency and the overall rounds till convergence. Moreover, a low-complexity algorithm is proposed to jointly optimize the dropout rate and the bandwidth allocation for minimizing the loss function in all rounds under a given per-round latency and limited network resources. Finally, numerical results are provided to verify the effectiveness of the proposed algorithm.
[LG-65] A New Dataset and Methodology for Malicious URL Classification
链接: https://arxiv.org/abs/2501.00356
作者: Ilan Schvartzman,Roei Sarussi,Maor Ashkenazi,Ido kringel,Yaniv Tocker,Tal Furman Shohet
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning’s promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model’s capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.
[LG-66] diffIRM: A Diffusion-Augmented Invariant Risk Minimization Framework for Spatiotemporal Prediction over Graphs
链接: https://arxiv.org/abs/2501.00305
作者: Zhaobin Mo,Haotian Xiang,Xuan Di
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spatiotemporal prediction over graphs (STPG) is challenging, because real-world data suffers from the Out-of-Distribution (OOD) generalization problem, where test data follow different distributions from training ones. To address this issue, Invariant Risk Minimization (IRM) has emerged as a promising approach for learning invariant representations across different environments. However, IRM and its variants are originally designed for Euclidean data like images, and may not generalize well to graph-structure data such as spatiotemporal graphs due to spatial correlations in graphs. To overcome the challenge posed by graph-structure data, the existing graph OOD methods adhere to the principles of invariance existence, or environment diversity. However, there is little research that combines both principles in the STPG problem. A combination of the two is crucial for efficiently distinguishing between invariant features and spurious ones. In this study, we fill in this research gap and propose a diffusion-augmented invariant risk minimization (diffIRM) framework that combines these two principles for the STPG problem. Our diffIRM contains two processes: i) data augmentation and ii) invariant learning. In the data augmentation process, a causal mask generator identifies causal features and a graph-based diffusion model acts as an environment augmentor to generate augmented spatiotemporal graph data. In the invariant learning process, an invariance penalty is designed using the augmented data, and then serves as a regularizer for training the spatiotemporal prediction model. The real-world experiment uses three human mobility datasets, i.e. SafeGraph, PeMS04, and PeMS08. Our proposed diffIRM outperforms baselines.
[LG-67] Solving Partial Differential Equations with Random Feature Models
链接: https://arxiv.org/abs/2501.00288
作者: Chunyang Liao
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: Submitted
Abstract:Machine learning based partial differential equations (PDEs) solvers have received great attention in recent years. Most progress in this area has been driven by deep neural networks such as physics-informed neural networks (PINNs) and kernel method. In this paper, we introduce a random feature based framework toward efficiently solving PDEs. Random feature method was originally proposed to approximate large-scale kernel machines and can be viewed as a shallow neural network as well. We provide an error analysis for our proposed method along with comprehensive numerical results on several PDE benchmarks. In contrast to the state-of-the-art solvers that face challenges with a large number of collocation points, our proposed method reduces the computational complexity. Moreover, the implementation of our method is simple and does not require additional computational resources. Due to the theoretical guarantee and advantages in computation, our approach is proven to be efficient for solving PDEs.
[LG-68] ReFormer: Generating Radio Fakes for Data Augmentation
链接: https://arxiv.org/abs/2501.00282
作者: Yagna Kaasaragadda,Silvija Kokalj-Filipovic
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:We present ReFormer, a generative AI (GAI) model that can efficiently generate synthetic radio-frequency (RF) data, or RF fakes, statistically similar to the data it was trained on, or with modified statistics, in order to augment datasets collected in real-world experiments. For applications like this, adaptability and scalability are important issues. This is why ReFormer leverages transformer-based autoregressive generation, trained on learned discrete representations of RF signals. By using prompts, such GAI can be made to generate the data which complies with specific constraints or conditions, particularly useful for training channel estimation and modeling. It may also leverage the data from a source system to generate training data for a target system. We show how different transformer architectures and other design choices affect the quality of generated RF fakes, evaluated using metrics such as precision and recall, classification accuracy and signal constellation diagrams.
[LG-69] owards Pattern-aware Data Augmentation for Temporal Knowledge Graph Completion
链接: https://arxiv.org/abs/2501.00252
作者: Jiasheng Zhang,Deqiang Ouyang,Shuang Liang,Jie Shao
类目: Machine Learning (cs.LG); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:
Abstract:Predicting missing facts for temporal knowledge graphs (TKGs) is a fundamental task, called temporal knowledge graph completion (TKGC). One key challenge in this task is the imbalance in data distribution, where facts are unevenly spread across entities and timestamps. This imbalance can lead to poor completion performance or long-tail entities and timestamps, and unstable training due to the introduction of false negative samples. Unfortunately, few previous studies have investigated how to mitigate these effects. Moreover, for the first time, we found that existing methods suffer from model preferences, revealing that entities with specific properties (e.g., recently active) are favored by different models. Such preferences will lead to error accumulation and further exacerbate the effects of imbalanced data distribution, but are overlooked by previous studies. To alleviate the impacts of imbalanced data and model preferences, we introduce Booster, the first data augmentation strategy for TKGs. The unique requirements here lie in generating new samples that fit the complex semantic and temporal patterns within TKGs, and identifying hard-learning samples specific to models. Therefore, we propose a hierarchical scoring algorithm based on triadic closures within TKGs. By incorporating both global semantic patterns and local time-aware structures, the algorithm enables pattern-aware validation for new samples. Meanwhile, we propose a two-stage training approach to identify samples that deviate from the model’s preferred patterns. With a well-designed frequency-based filtering strategy, this approach also helps to avoid the misleading of false negatives. Experiments justify that Booster can seamlessly adapt to existing TKGC models and achieve up to an 8.7% performance improvement.
[LG-70] Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes NEURIPS2024
链接: https://arxiv.org/abs/2501.00200
作者: Duo Zhou,Christopher Brix,Grani A Hanasusanto,Huan Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
*备注: Accepted by NeurIPS 2024. BICCOS is part of the alpha-beta-CROWN verifier, the VNN-COMP 2024 winner
Abstract:Recently, cutting-plane methods such as GCP-CROWN have been explored to enhance neural network verifiers and made significant advances. However, GCP-CROWN currently relies on generic cutting planes (cuts) generated from external mixed integer programming (MIP) solvers. Due to the poor scalability of MIP solvers, large neural networks cannot benefit from these cutting planes. In this paper, we exploit the structure of the neural network verification problem to generate efficient and scalable cutting planes specific for this problem setting. We propose a novel approach, Branch-and-bound Inferred Cuts with COnstraint Strengthening (BICCOS), which leverages the logical relationships of neurons within verified subproblems in the branch-and-bound search tree, and we introduce cuts that preclude these relationships in other subproblems. We develop a mechanism that assigns influence scores to neurons in each path to allow the strengthening of these cuts. Furthermore, we design a multi-tree search technique to identify more cuts, effectively narrowing the search space and accelerating the BaB algorithm. Our results demonstrate that BICCOS can generate hundreds of useful cuts during the branch-and-bound process and consistently increase the number of verifiable instances compared to other state-of-the-art neural network verifiers on a wide range of benchmarks, including large networks that previous cutting plane methods could not scale to. BICCOS is part of the \alpha,\beta -CROWN verifier, the VNN-COMP 2024 winner. The code is available at this http URL .
[LG-71] Urban Water Consumption Forecasting Using Deep Learning and Correlated District Metered Areas
链接: https://arxiv.org/abs/2501.00158
作者: Kleanthis Malialis,Nefeli Mavri,Stelios G. Vrachimis,Marios S. Kyriakou,Demetrios G. Eliades,Marios M. Polycarpou
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Keywords: urban water management, water consumption, time series forecasting
Abstract:Accurate water consumption forecasting is a crucial tool for water utilities and policymakers, as it helps ensure a reliable supply, optimize operations, and support infrastructure planning. Urban Water Distribution Networks (WDNs) are divided into District Metered Areas (DMAs), where water flow is monitored to efficiently manage resources. This work focuses on short-term forecasting of DMA consumption using deep learning and aims to address two key challenging issues. First, forecasting based solely on a DMA’s historical data may lack broader context and provide limited insights. Second, DMAs may experience sensor malfunctions providing incorrect data, or some DMAs may not be monitored at all due to computational costs, complicating accurate forecasting. We propose a novel method that first identifies DMAs with correlated consumption patterns and then uses these patterns, along with the DMA’s local data, as input to a deep learning model for forecasting. In a real-world study with data from five DMAs, we show that: i) the deep learning model outperforms a classical statistical model; ii) accurate forecasting can be carried out using only correlated DMAs’ consumption patterns; and iii) even when a DMA’s local data is available, including correlated DMAs’ data improves accuracy.
[LG-72] Dynamic Optimization of Storage Systems Using Reinforcement Learning Techniques
链接: https://arxiv.org/abs/2501.00068
作者: Chiyu Cheng,Chang Zhou,Yang Zhao,Jin Cao
类目: Operating Systems (cs.OS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The exponential growth of data-intensive applications has placed unprecedented demands on modern storage systems, necessitating dynamic and efficient optimization strategies. Traditional heuristics employed for storage performance optimization often fail to adapt to the variability and complexity of contemporary workloads, leading to significant performance bottlenecks and resource inefficiencies. To address these challenges, this paper introduces RL-Storage, a novel reinforcement learning (RL)-based framework designed to dynamically optimize storage system configurations. RL-Storage leverages deep Q-learning algorithms to continuously learn from real-time I/O patterns and predict optimal storage parameters, such as cache size, queue depths, and readahead settings[1]. The proposed framework operates within the storage kernel, ensuring minimal latency and low computational overhead. Through an adaptive feedback mechanism, RL-Storage dynamically adjusts critical parameters, achieving efficient resource utilization across a wide range of workloads. Experimental evaluations conducted on a range of benchmarks, including RocksDB and PostgreSQL, demonstrate significant improvements, with throughput gains of up to 2.6x and latency reductions of 43% compared to baseline heuristics. Additionally, RL-Storage achieves these performance enhancements with a negligible CPU overhead of 0.11% and a memory footprint of only 5 KB, making it suitable for seamless deployment in production environments. This work underscores the transformative potential of reinforcement learning techniques in addressing the dynamic nature of modern storage systems. By autonomously adapting to workload variations in real time, RL-Storage provides a robust and scalable solution for optimizing storage performance, paving the way for next-generation intelligent storage infrastructures.
[LG-73] Lungmix: A Mixup-Based Strategy for Generalization in Respiratory Sound Classification
链接: https://arxiv.org/abs/2501.00064
作者: Shijia Ge,Weixiang Zhang,Shuzhao Xie,Baixu Yan,Zhi Wang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 4pages, 3 figures, conference paper
Abstract:Respiratory sound classification plays a pivotal role in diagnosing respiratory diseases. While deep learning models have shown success with various respiratory sound datasets, our experiments indicate that models trained on one dataset often fail to generalize effectively to others, mainly due to data collection and annotation \emphinconsistencies. To address this limitation, we introduce \emphLungmix, a novel data augmentation technique inspired by Mixup. Lungmix generates augmented data by blending waveforms using loudness and random masks while interpolating labels based on their semantic meaning, helping the model learn more generalized representations. Comprehensive evaluations across three datasets, namely ICBHI, SPR, and HF, demonstrate that Lungmix significantly enhances model generalization to unseen data. In particular, Lungmix boosts the 4-class classification score by up to 3.55%, achieving performance comparable to models trained directly on the target dataset.
[LG-74] Efficient and Scalable Deep Reinforcement Learning for Mean Field Control Games
链接: https://arxiv.org/abs/2501.00052
作者: Nianli Peng,Yilin Wang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:
Abstract:Mean Field Control Games (MFCGs) provide a powerful theoretical framework for analyzing systems of infinitely many interacting agents, blending elements from Mean Field Games (MFGs) and Mean Field Control (MFC). However, solving the coupled Hamilton-Jacobi-Bellman and Fokker-Planck equations that characterize MFCG equilibria remains a significant computational challenge, particularly in high-dimensional or complex environments. This paper presents a scalable deep Reinforcement Learning (RL) approach to approximate equilibrium solutions of MFCGs. Building on previous works, We reformulate the infinite-agent stochastic control problem as a Markov Decision Process, where each representative agent interacts with the evolving mean field distribution. We use the actor-critic based algorithm from a previous paper (Angiuli this http URL., 2024) as the baseline and propose several versions of more scalable and efficient algorithms, utilizing techniques including parallel sample collection (batching); mini-batching; target network; proximal policy optimization (PPO); generalized advantage estimation (GAE); and entropy regularization. By leveraging these techniques, we effectively improved the efficiency, scalability, and training stability of the baseline algorithm. We evaluate our method on a linear-quadratic benchmark problem, where an analytical solution to the MFCG equilibrium is available. Our results show that some versions of our proposed approach achieve faster convergence and closely approximate the theoretical optimum, outperforming the baseline algorithm by an order of magnitude in sample efficiency. Our work lays the foundation for adapting deep RL to solve more complicated MFCGs closely related to real life, such as large-scale autonomous transportation systems, multi-firm economic competition, and inter-bank borrowing problems. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2501.00052 [cs.LG] (or arXiv:2501.00052v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.00052 Focus to learn more arXiv-issued DOI via DataCite
[LG-75] Learning in Multiple Spaces: Few-Shot Network Attack Detection with Metric-Fused Prototypical Networks AAAI-25
链接: https://arxiv.org/abs/2501.00050
作者: Fernando Martinez-Lopez,Lesther Santana,Mohamed Rahouti
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: The AAAI-25 Workshop on Artificial Intelligence for Cyber Security (AICS)
Abstract:Network intrusion detection systems face significant challenges in identifying emerging attack patterns, especially when limited data samples are available. To address this, we propose a novel Multi-Space Prototypical Learning (MSPL) framework tailored for few-shot attack detection. The framework operates across multiple metric spaces-Euclidean, Cosine, Chebyshev, and Wasserstein distances-integrated through a constrained weighting scheme to enhance embedding robustness and improve pattern recognition. By leveraging Polyak-averaged prototype generation, the framework stabilizes the learning process and effectively adapts to rare and zero-day attacks. Additionally, an episodic training paradigm ensures balanced representation across diverse attack classes, enabling robust generalization. Experimental results on benchmark datasets demonstrate that MSPL outperforms traditional approaches in detecting low-profile and novel attack types, establishing it as a robust solution for zero-day attack detection.
[LG-76] Numerical solutions of fixed points in two-dimensional Kuramoto-Sivashinsky equation expedited by reinforcement learning
链接: https://arxiv.org/abs/2501.00046
作者: Juncheng Jiang,Dongdong Wan,Mengqi Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a combined approach to enhancing the effectiveness of Jacobian-Free Newton-Krylov (JFNK) method by deep reinforcement learning (DRL) in identifying fixed points within the 2D Kuramoto-Sivashinsky Equation (KSE). JFNK approach entails a good initial guess for improved convergence when searching for fixed points. With a properly defined reward function, we utilise DRL as a preliminary step to enhance the initial guess in the converging process. We report new results of fixed points in the 2D KSE which have not been reported in the literature. Additionally, we explored control optimization for the 2D KSE to navigate the system trajectories between known fixed points, based on parallel reinforcement learning techniques. This combined method underscores the improved JFNK approach to finding new fixed-point solutions within the context of 2D KSE, which may be instructive for other high-dimensional dynamical systems.
[LG-77] Deep Discrete Encoders: Identifiable Deep Generative Models for Rich Data with Discrete Latent Layers
链接: https://arxiv.org/abs/2501.01414
作者: Seunghyun Lee,Yuqi Gu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:In the era of generative AI, deep generative models (DGMs) with latent representations have gained tremendous popularity. Despite their impressive empirical performance, the statistical properties of these models remain underexplored. DGMs are often overparametrized, non-identifiable, and uninterpretable black boxes, raising serious concerns when deploying them in high-stakes applications. Motivated by this, we propose an interpretable deep generative modeling framework for rich data types with discrete latent layers, called Deep Discrete Encoders (DDEs). A DDE is a directed graphical model with multiple binary latent layers. Theoretically, we propose transparent identifiability conditions for DDEs, which imply progressively smaller sizes of the latent layers as they go deeper. Identifiability ensures consistent parameter estimation and inspires an interpretable design of the deep architecture. Computationally, we propose a scalable estimation pipeline of a layerwise nonlinear spectral initialization followed by a penalized stochastic approximation EM algorithm. This procedure can efficiently estimate models with exponentially many latent components. Extensive simulation studies validate our theoretical results and demonstrate the proposed algorithms’ excellent performance. We apply DDEs to three diverse real datasets for hierarchical topic modeling, image representation learning, response time modeling in educational testing, and obtain interpretable findings.
[LG-78] Learning Spectral Methods by Transformers
链接: https://arxiv.org/abs/2501.01312
作者: Yihan He,Yuan Cao,Hong-Yu Chen,Dennis Wu,Jianqing Fan,Han Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 77 pages, 12 figures
Abstract:Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves and perform statistical estimation tasks given new instances. This learning paradigm is distinct from the in-context learning setup and is similar to the learning procedure of human brains where skills are learned through past experience. Theoretically, we prove that pre-trained Transformers can learn the spectral methods and use the classification of bi-class Gaussian mixture model as an example. Our proof is constructive using algorithmic design techniques. Our results are built upon the similarities of multi-layered Transformer architecture with the iterative recovery algorithms used in practice. Empirically, we verify the strong capacity of the multi-layered (pre-trained) Transformer on unsupervised learning through the lens of both the PCA and the Clustering tasks performed on the synthetic and real-world datasets.
[LG-79] Marketing Mix Modeling in Lemonade
链接: https://arxiv.org/abs/2501.01276
作者: Roy Ravid
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Marketing mix modeling (MMM) is a widely used method to assess the effectiveness of marketing campaigns and optimize marketing strategies. Bayesian MMM is an advanced approach that allows for the incorporation of prior information, uncertainty quantification, and probabilistic predictions (1). In this paper, we describe the process of building a Bayesian MMM model for the online insurance company Lemonade. We first collected data on Lemonade’s marketing activities, such as online advertising, social media, and brand marketing, as well as performance data. We then used a Bayesian framework to estimate the contribution of each marketing channel on total performance, while accounting for various factors such as seasonality, market trends, and macroeconomic indicators. To validate the model, we compared its predictions with the actual performance data from A/B-testing and sliding window holdout data (2). The results showed that the predicted contribution of each marketing channel is aligned with A/B test performance and is actionable. Furthermore, we conducted several scenario analyses using convex optimization to test the sensitivity of the model to different assumptions and to evaluate the impact of changes in the marketing mix on sales. The insights gained from the model allowed Lemonade to adjust their marketing strategy and allocate their budget more effectively. Our case study demonstrates the benefits of using Bayesian MMM for marketing attribution and optimization in a data-driven company like Lemonade. The approach is flexible, interpretable, and can provide valuable insights for decision-making.
[LG-80] Ultrasound Lung Aeration Map via Physics-Aware Neural Operators
链接: https://arxiv.org/abs/2501.01157
作者: Jiayun Wang,Oleksii Ostras,Masashi Sode,Bahareh Tolooshams,Zongyi Li,Kamyar Azizzadenesheli,Gianmarco Pinton,Anima Anandkumar
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:Lung ultrasound is a growing modality in clinics for diagnosing and monitoring acute and chronic lung diseases due to its low cost and accessibility. Lung ultrasound works by emitting diagnostic pulses, receiving pressure waves and converting them into radio frequency (RF) data, which are then processed into B-mode images with beamformers for radiologists to interpret. However, unlike conventional ultrasound for soft tissue anatomical imaging, lung ultrasound interpretation is complicated by complex reverberations from the pleural interface caused by the inability of ultrasound to penetrate air. The indirect B-mode images make interpretation highly dependent on reader expertise, requiring years of training, which limits its widespread use despite its potential for high accuracy in skilled hands. To address these challenges and democratize ultrasound lung imaging as a reliable diagnostic tool, we propose LUNA, an AI model that directly reconstructs lung aeration maps from RF data, bypassing the need for traditional beamformers and indirect interpretation of B-mode images. LUNA uses a Fourier neural operator, which processes RF data efficiently in Fourier space, enabling accurate reconstruction of lung aeration maps. LUNA offers a quantitative, reader-independent alternative to traditional semi-quantitative lung ultrasound scoring methods. The development of LUNA involves synthetic and real data: We simulate synthetic data with an experimentally validated approach and scan ex vivo swine lungs as real data. Trained on abundant simulated data and fine-tuned with a small amount of real-world data, LUNA achieves robust performance, demonstrated by an aeration estimation error of 9% in ex-vivo lung scans. We demonstrate the potential of reconstructing lung aeration maps from RF data, providing a foundation for improving lung ultrasound reproducibility and diagnostic utility. Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2501.01157 [eess.IV] (or arXiv:2501.01157v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.01157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-81] An Efficient Outlier Detection Algorithm for Data Streaming
链接: https://arxiv.org/abs/2501.01061
作者: Rui Hu, Luc (Zhilu)Chen,Yiwei Wang
类目: Computation (stat.CO); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 12 pages, 10 figures
Abstract:The nature of modern data is increasingly real-time, making outlier detection crucial in any data-related field, such as finance for fraud detection and healthcare for monitoring patient vitals. Traditional outlier detection methods, such as the Local Outlier Factor (LOF) algorithm, struggle with real-time data due to the need for extensive recalculations with each new data point, limiting their application in real-time environments. While the Incremental LOF (ILOF) algorithm has been developed to tackle the challenges of online anomaly detection, it remains computationally expensive when processing large streams of data points, and its detection performance may degrade after a certain threshold of points have streamed in. In this paper, we propose a novel approach to enhance the efficiency of LOF algorithms for online anomaly detection, named the Efficient Incremental LOF (EILOF) algorithm. The EILOF algorithm only computes the LOF scores of new points without altering the LOF scores of existing data points. Although exact LOF scores have not yet been computed for the existing points in the new algorithm, datasets often contain noise, and minor deviations in LOF score calculations do not necessarily degrade detection performance. In fact, such deviations can sometimes enhance outlier detection. We systematically tested this approach on both simulated and real-world datasets, demonstrating that EILOF outperforms ILOF as the volume of streaming data increases across various scenarios. The EILOF algorithm not only significantly reduces computational costs, but also systematically improves detection accuracy when the number of additional points increases compared to the ILOF algorithm.
[LG-82] On the Implementation of a Bayesian Optimization Framework for Interconnected Systems
链接: https://arxiv.org/abs/2501.00967
作者: Leonardo D. González,Victor M. Zavala
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 12 figures
Abstract:Bayesian optimization (BO) is an effective paradigm for the optimization of expensive-to-sample systems. Standard BO learns the performance of a system f(x) by using a Gaussian Process (GP) model; this treats the system as a black-box and limits its ability to exploit available structural knowledge (e.g., physics and sparse interconnections in a complex system). Grey-box modeling, wherein the performance function is treated as a composition of known and unknown intermediate functions f(x, y(x)) (where y(x) is a GP model) offers a solution to this limitation; however, generating an analytical probability density for f from the Gaussian density of y(x) is often an intractable problem (e.g., when f is nonlinear). Previous work has handled this issue by using sampling techniques or by solving an auxiliary problem over an augmented space where the values of y(x) are constrained by confidence intervals derived from the GP models; such solutions are computationally intensive. In this work, we provide a detailed implementation of a recently proposed grey-box BO paradigm, BOIS, that uses adaptive linearizations of f to obtain analytical expressions for the statistical moments of the composite function. We show that the BOIS approach enables the exploitation of structural knowledge, such as that arising in interconnected systems as well as systems that embed multiple GP models and combinations of physics and GP models. We benchmark the effectiveness of BOIS against standard BO and existing grey-box BO algorithms using a pair of case studies focused on chemical process optimization and design. Our results indicate that BOIS performs as well as or better than existing grey-box methods, while also being less computationally intensive.
[LG-83] ght Constraint Prediction of Six-Degree-of-Freedom Transformer-based Powered Descent Guidance
链接: https://arxiv.org/abs/2501.00930
作者: Julia Briden,Trey Gurga,Breanna Johnson,Abhishek Cauligi,Richard Linares
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: AIAA SCITECH 2025 Forum
Abstract:This work introduces Transformer-based Successive Convexification (T-SCvx), an extension of Transformer-based Powered Descent Guidance (T-PDG), generalizable for efficient six-degree-of-freedom (DoF) fuel-optimal powered descent trajectory generation. Our approach significantly enhances the sample efficiency and solution quality for nonconvex-powered descent guidance by employing a rotation invariant transformation of the sampled dataset. T-PDG was previously applied to the 3-DoF minimum fuel powered descent guidance problem, improving solution times by up to an order of magnitude compared to lossless convexification (LCvx). By learning to predict the set of tight or active constraints at the optimal control problem’s solution, Transformer-based Successive Convexification (T-SCvx) creates the minimal reduced-size problem initialized with only the tight constraints, then uses the solution of this reduced problem to warm-start the direct optimization solver. 6-DoF powered descent guidance is known to be challenging to solve quickly and reliably due to the nonlinear and non-convex nature of the problem, the discretization scheme heavily influencing solution validity, and reference trajectory initialization determining algorithm convergence or divergence. Our contributions in this work address these challenges by extending T-PDG to learn the set of tight constraints for the successive convexification (SCvx) formulation of the 6-DoF powered descent guidance problem. In addition to reducing the problem size, feasible and locally optimal reference trajectories are also learned to facilitate convergence from the initial guess. T-SCvx enables onboard computation of real-time guidance trajectories, demonstrated by a 6-DoF Mars powered landing application problem.
[LG-84] A Graphical Approach to State Variable Selection in Off-policy Learning
链接: https://arxiv.org/abs/2501.00854
作者: Joakim Blach Andersen,Qingyuan Zhao
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 25 pages (not including appendix and references), 10 figures, 2 tables
Abstract:Sequential decision problems are widely studied across many areas of science. A key challenge when learning policies from historical data - a practice commonly referred to as off-policy learning - is how to ``identify’’ the impact of a policy of interest when the observed data are not randomized. Off-policy learning has mainly been studied in two settings: dynamic treatment regimes (DTRs), where the focus is on controlling confounding in medical problems with short decision horizons, and offline reinforcement learning (RL), where the focus is on dimension reduction in closed systems such as games. The gap between these two well studied settings has limited the wider application of off-policy learning to many real-world problems. Using the theory for causal inference based on acyclic directed mixed graph (ADMGs), we provide a set of graphical identification criteria in general decision processes that encompass both DTRs and MDPs. We discuss how our results relate to the often implicit causal assumptions made in the DTR and RL literatures and further clarify several common misconceptions. Finally, we present a realistic simulation study for the dynamic pricing problem encountered in container logistics, and demonstrate how violations of our graphical criteria can lead to suboptimal policies.
[LG-85] A Distributional Evaluation of Generative Image Models
链接: https://arxiv.org/abs/2501.00744
作者: Edric Tam,Barbara E Engelhardt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generative models are ubiquitous in modern artificial intelligence (AI) applications. Recent advances have led to a variety of generative modeling approaches that are capable of synthesizing highly realistic samples. Despite these developments, evaluating the distributional match between the synthetic samples and the target distribution in a statistically principled way remains a core challenge. We focus on evaluating image generative models, where studies often treat human evaluation as the gold standard. Commonly adopted metrics, such as the Fréchet Inception Distance (FID), do not sufficiently capture the differences between the learned and target distributions, because the assumption of normality ignores differences in the tails. We propose the Embedded Characteristic Score (ECS), a comprehensive metric for evaluating the distributional match between the learned and target sample distributions, and explore its connection with moments and tail behavior. We derive natural properties of ECS and show its practical use via simulations and an empirical study.
[LG-86] Learning Weather Models from Data with WSINDy
链接: https://arxiv.org/abs/2501.00738
作者: Seth Minor,Daniel A. Messenger,Vanja Dukic,David M. Bortz
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:The multiscale and turbulent nature of Earth’s atmosphere has historically rendered accurate weather modeling a hard problem. Recently, there has been an explosion of interest surrounding data-driven approaches to weather modeling, which in many cases show improved forecasting accuracy and computational efficiency when compared to traditional methods. However, many of the current data-driven approaches employ highly parameterized neural networks, often resulting in uninterpretable models and limited gains in scientific understanding. In this work, we address the interpretability problem by explicitly discovering partial differential equations governing various weather phenomena, identifying symbolic mathematical models with direct physical interpretations. The purpose of this paper is to demonstrate that, in particular, the Weak form Sparse Identification of Nonlinear Dynamics (WSINDy) algorithm can learn effective weather models from both simulated and assimilated data. Our approach adapts the standard WSINDy algorithm to work with high-dimensional fluid data of arbitrary spatial dimension. Moreover, we develop an approach for handling terms that are not integrable-by-parts, such as advection operators.
[LG-87] Enhancing Unsupervised Feature Selection via Double Sparsity Constrained Optimization
链接: https://arxiv.org/abs/2501.00726
作者: Xianchao Xiu,Anning Yang,Chenyi Huang,Xinrong Li,Wanquan Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised feature selection (UFS) is widely applied in machine learning and pattern recognition. However, most of the existing methods only consider a single sparsity, which makes it difficult to select valuable and discriminative feature subsets from the original high-dimensional feature set. In this paper, we propose a new UFS method called DSCOFS via embedding double sparsity constrained optimization into the classical principal component analysis (PCA) framework. Double sparsity refers to using \ell_2,0 -norm and \ell_0 -norm to simultaneously constrain variables, by adding the sparsity of different types, to achieve the purpose of improving the accuracy of identifying differential features. The core is that \ell_2,0 -norm can remove irrelevant and redundant features, while \ell_0 -norm can filter out irregular noisy features, thereby complementing \ell_2,0 -norm to improve discrimination. An effective proximal alternating minimization method is proposed to solve the resulting nonconvex nonsmooth model. Theoretically, we rigorously prove that the sequence generated by our method globally converges to a stationary point. Numerical experiments on three synthetic datasets and eight real-world datasets demonstrate the effectiveness, stability, and convergence of the proposed method. In particular, the average clustering accuracy (ACC) and normalized mutual information (NMI) are improved by at least 3.34% and 3.02%, respectively, compared with the state-of-the-art methods. More importantly, two common statistical tests and a new feature similarity metric verify the advantages of double sparsity. All results suggest that our proposed DSCOFS provides a new perspective for feature selection.
[LG-88] Different thresholding methods on Nearest Shrunken Centroid algorithm
链接: https://arxiv.org/abs/2501.00632
作者: Mohammad Omar Sahtout,Haiyan Wang,Santosh Ghimire
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages 9 figures
Abstract:This article considers the impact of different thresholding methods to the Nearest Shrunken Centroid algorithm, which is popularly referred as the Prediction Analysis of Microarrays (PAM) for high-dimensional classification. PAM uses soft thresholding to achieve high computational efficiency and high classification accuracy but in the price of retaining too many features. When applied to microarray human cancers, PAM selected 2611 features on average from 10 multi-class datasets. Such a large number of features make it difficult to perform follow up study. One reason behind this problem is the soft thresholding, which is known to produce biased parameter estimate in regression analysis. In this article, we extend the PAM algorithm with two other thresholding methods, hard and order thresholding, and a deep search algorithm to achieve better thresholding parameter estimate. The modified algorithms are extensively tested and compared to the original one based on real data and Monte Carlo studies. In general, the modification not only gave better cancer status prediction accuracy, but also resulted in more parsimonious models with significantly smaller number of features.
[LG-89] Polynomial time sampling from log-smooth distributions in fixed dimension under semi-log-concavity of the forward diffusion with application to strongly dissipative distributions
链接: https://arxiv.org/abs/2501.00565
作者: Adrien Vacher,Omar Chehab,Anna Korba
类目: Computation (stat.CO); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:In this article we provide a stochastic sampling algorithm with polynomial complexity in fixed dimension that leverages the recent advances on diffusion models where it is shown that under mild conditions, sampling can be achieved via an accurate estimation of intermediate scores across the marginals (p_t)_t\ge 0 of the standard Ornstein-Uhlenbeck process started at the density we wish to sample from. The heart of our method consists into approaching these scores via a computationally cheap estimator and relating the variance of this estimator to the smoothness properties of the forward process. Under the assumption that the density to sample from is L -log-smooth and that the forward process is semi-log-concave: -\nabla^2 \log(p_t) \succeq -\beta I_d for some \beta \geq 0 , we prove that our algorithm achieves an expected \epsilon error in \textKL divergence in O(d^7L^d+2\epsilon^-2(d+3) (L+\beta)^2d^2(d+1)) time. In particular, our result allows to fully transfer the problem of sampling from a log-smooth distribution into a regularity estimate problem. As an application, we derive an exponential complexity improvement for the problem of sampling from a L -log-smooth distribution that is \alpha -strongly log-concave distribution outside some ball of radius R : after proving that such distributions verify the semi-log-concavity assumption, a result which might be of independent interest, we recover a poly(R,L,\alpha^-1, \epsilon^-1) complexity in fixed dimension which exponentially improves upon the previously known poly(e^RL^2, L,\alpha^-1, \log(\epsilon^-1)) complexity in the low precision regime.
[LG-90] Finding the Underlying Viscoelastic Constitutive Equation via Universal Differential Equations and Differentiable Physics
链接: https://arxiv.org/abs/2501.00556
作者: Elias C. Rodrigues,Roney L. Thompson,Dário A.B. Oliveira,Roberto F. Ausas
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:This research employs Universal Differential Equations (UDEs) alongside differentiable physics to model viscoelastic fluids, merging conventional differential equations, neural networks and numerical methods to reconstruct missing terms in constitutive models. This study focuses on analyzing four viscoelastic models: Upper Convected Maxwell (UCM), Johnson-Segalman, Giesekus, and Exponential Phan-Thien-Tanner (ePTT), through the use of synthetic datasets. The methodology was tested across different experimental conditions, including oscillatory and startup flows. While the UDE framework effectively predicts shear and normal stresses for most models, it demonstrates some limitations when applied to the ePTT model. The findings underscore the potential of UDEs in fluid mechanics while identifying critical areas for methodological improvement. Also, a model distillation approach was employed to extract simplified models from complex ones, emphasizing the versatility and robustness of UDEs in rheological modeling.
[LG-91] LASSE: Learning Active Sampling for Storm Tide Extremes in Non-Stationary Climate Regimes
链接: https://arxiv.org/abs/2501.00149
作者: Grace Jiang,Jiangchao Qiu,Sai Ravela
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Identifying tropical cyclones that generate destructive storm tides for risk assessment, such as from large downscaled storm catalogs for climate studies, is often intractable because it entails many expensive Monte Carlo hydrodynamic simulations. Here, we show that surrogate models are promising from accuracy, recall, and precision perspectives, and they ``generalize" to novel climate scenarios. We then present an informative online learning approach to rapidly search for extreme storm tide-producing cyclones using only a few hydrodynamic simulations. Starting from a minimal subset of TCs with detailed storm tide hydrodynamic simulations, a surrogate model selects informative data to retrain online and iteratively improves its predictions of damaging TCs. Results on an extensive catalog of downscaled TCs indicate a 100% precision retrieving the rare destructive storms training using less than 20% of the simulations as training. The informative sampling approach is efficient, scalable to large storm catalogs, and generalizable to climate scenarios.
[LG-92] Post Launch Evaluation of Policies in a High-Dimensional Setting
链接: https://arxiv.org/abs/2501.00119
作者: Shima Nassiri,Mohsen Bayati,Joe Cooprider
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 15 pages, 2 figures, 6 tables
Abstract:A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by “synthetic control” as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias – which arises from methods that trade off bias for variance reduction – can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.
[LG-93] Machine Learning Gravity Compactifications on Negatively Curved Manifolds
链接: https://arxiv.org/abs/2501.00093
作者: G. Bruno De Luca
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 42 pages, code available at this http URL
Abstract:Constructing the landscape of vacua of higher-dimensional theories of gravity by directly solving the low-energy (semi-)classical equations of motion is notoriously difficult. In this work, we investigate the feasibility of Machine Learning techniques as tools for solving the equations of motion for general warped gravity compactifications. As a proof-of-concept we use Neural Networks to solve the Einstein PDEs on non-trivial three manifolds obtained by filling one or more cusps of hyperbolic manifolds. While in three dimensions an Einstein metric is also locally hyperbolic, the generality and scalability of Machine Learning methods, the availability of explicit families of hyperbolic manifolds in higher dimensions, and the universality of the filling procedure strongly suggest that the methods and code developed in this work can be of broader applicability. Specifically, they can be used to tackle both the geometric problem of numerically constructing novel higher-dimensional negatively curved Einstein metrics, as well as the physical problem of constructing four-dimensional de Sitter compactifications of M-theory on the same manifolds.
[LG-94] Insights on Galaxy Evolution from Interpretable Sparse Feature Networks
链接: https://arxiv.org/abs/2501.00089
作者: John F. Wu
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Submitted to AAS Journals. 10 pages, 4 figures, 2 tables
Abstract:Galaxy appearances reveal the physics of how they formed and evolved. Machine learning models can now exploit galaxies’ information-rich morphologies to predict physical properties directly from image cutouts. Learning the relationship between pixel-level features and galaxy properties is essential for building a physical understanding of galaxy evolution, but we are still unable to explicate the details of how deep neural networks represent image features. To address this lack of interpretability, we present a novel neural network architecture called a Sparse Feature Network (SFNet). SFNets produce interpretable features that can be linearly combined in order to estimate galaxy properties like optical emission line ratios or gas-phase metallicity. We find that SFNets do not sacrifice accuracy in order to gain interpretability, and that they perform comparably well to cutting-edge models on astronomical machine learning tasks. Our novel approach is valuable for finding physical patterns in large datasets and helping astronomers interpret machine learning results.
[LG-95] High-Dimensional Markov-switching Ordinary Differential Processes
链接: https://arxiv.org/abs/2501.00087
作者: Katherine Tsai,Mladen Kolar,Sanmi Koyejo
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注:
Abstract:We investigate the parameter recovery of Markov-switching ordinary differential processes from discrete observations, where the differential equations are nonlinear additive models. This framework has been widely applied in biological systems, control systems, and other domains; however, limited research has been conducted on reconstructing the generating processes from observations. In contrast, many physical systems, such as human brains, cannot be directly experimented upon and rely on observations to infer the underlying systems. To address this gap, this manuscript presents a comprehensive study of the model, encompassing algorithm design, optimization guarantees, and quantification of statistical errors. Specifically, we develop a two-stage algorithm that first recovers the continuous sample path from discrete samples and then estimates the parameters of the processes. We provide novel theoretical insights into the statistical error and linear convergence guarantee when the processes are \beta -mixing. Our analysis is based on the truncation of the latent posterior processes and demonstrates that the truncated processes approximate the true processes under mixing conditions. We apply this model to investigate the differences in resting-state brain networks between the ADHD group and normal controls, revealing differences in the transition rate matrices of the two groups.
[LG-96] Magnetic Field Data Calibration with Transformer Model Using Physical Constraints: A Scalable Method for Satellite Missions Illustrated by Tianwen-1
链接: https://arxiv.org/abs/2501.00020
作者: Beibei Li(Deep Space Exploration Laboratory),Yutian Chi(Deep Space Exploration Laboratory),Yuming Wang(Deep Space Exploration Laboratory and School of Earth and Space Sciences University of Science and Technology of China)
类目: pace Physics (physics.space-ph); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:This study introduces a novel approach that integrates the magnetic field data correction from the Tianwen-1 Mars mission with a neural network architecture constrained by physical principles derived from Maxwell’s equation equations. By employing a Transformer based model capable of efficiently handling sequential data, the method corrects measurement anomalies caused by satellite dynamics, instrument interference, and environmental noise. As a result, it significantly improves both the accuracy and the physical consistency of the calibrated data. Compared to traditional methods that require long data segments and manual intervention often taking weeks or even months to complete this new approach can finish calibration in just minutes to hours, and predictions are made within seconds. This innovation not only accelerates the process of space weather modeling and planetary magnetospheric studies but also provides a robust framework for future planetary exploration and solar wind interaction research.
[LG-97] Energy-Efficient Sampling Using Stochastic Magnetic Tunnel Junctions
链接: https://arxiv.org/abs/2501.00015
作者: Nicolas Alder,Shivam Nitin Kajale,Milin Tunsiricharoengul,Deblina Sarkar,Ralf Herbrich
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 10 pages, 7 figures, preprint
Abstract:(Pseudo)random sampling, a costly yet widely used method in (probabilistic) machine learning and Markov Chain Monte Carlo algorithms, remains unfeasible on a truly large scale due to unmet computational requirements. We introduce an energy-efficient algorithm for uniform Float16 sampling, utilizing a room-temperature stochastic magnetic tunnel junction device to generate truly random floating-point numbers. By avoiding expensive symbolic computation and mapping physical phenomena directly to the statistical properties of the floating-point format and uniform distribution, our approach achieves a higher level of energy efficiency than the state-of-the-art Mersenne-Twister algorithm by a minimum factor of 9721 and an improvement factor of 5649 compared to the more energy-efficient PCG algorithm. Building on this sampling technique and hardware framework, we decompose arbitrary distributions into many non-overlapping approximative uniform distributions along with convolution and prior-likelihood operations, which allows us to sample from any 1D distribution without closed-form solutions. We provide measurements of the potential accumulated approximation errors, demonstrating the effectiveness of our method.
[LG-98] Machine learning models for Si nanoparticle growth in nonthermal plasma
链接: https://arxiv.org/abs/2501.00003
作者: Matt Raymond,Paolo Elvati,Jacob C. Saldinger,Jonathan Lin,Xuetao Shi,Angela Violi
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 10 SI figures, 8 SI tables, 2 SI algorithms
Abstract:Nanoparticles (NPs) formed in nonthermal plasmas (NTPs) can have unique properties and applications. However, modeling their growth in these environments presents significant challenges due to the non-equilibrium nature of NTPs, making them computationally expensive to describe. In this work, we address the challenges associated with accelerating the estimation of parameters needed for these models. Specifically, we explore how different machine learning models can be tailored to improve prediction outcomes. We apply these methods to reactive classical molecular dynamics data, which capture the processes associated with colliding silane fragments in NTPs. These reactions exemplify processes where qualitative trends are clear, but their quantification is challenging, hard to generalize, and requires time-consuming simulations. Our results demonstrate that good prediction performance can be achieved when appropriate loss functions are implemented and correct invariances are imposed. While the diversity of molecules used in the training set is critical for accurate prediction, our findings indicate that only a fraction (15-25%) of the energy and temperature sampling is required to achieve high levels of accuracy. This suggests a substantial reduction in computational effort is possible for similar systems.
信息检索
[IR-0] On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube
链接: https://arxiv.org/abs/2501.01333
作者: Simon Hachmeier,Robert Jäschke
类目: Multimedia (cs.MM); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: accepted for presentation at iConference 2025
Abstract:Recent advances in cover song identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover song database SecondHandSongs. It is unclear how well models perform on cover songs on online video platforms, which might exhibit alterations that are not expected. In this paper, we annotate a subset of songs from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art models. We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web.
[IR-1] Search Plurality
链接: https://arxiv.org/abs/2501.00987
作者: Shiran Dudy
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:
Abstract:In light of Phillips’ contention regarding the impracticality of Search Neutrality, asserting that non-epistemic factors presently dictate result prioritization, our objective in this study is to confront this constraint by questioning prevailing design practices in search engines. We posit that the concept of prioritization warrants scrutiny, along with the consistent hierarchical ordering that underlies this lack of neutrality. We introduce the term Search Plurality to encapsulate the idea of emphasizing the various means a query can be approached. This is demonstrated in a design that prioritizes the display of categories over specific search items, helping users grasp the breadth of their search. Whether a query allows for multiple interpretations or invites diverse opinions, the presentation of categories highlights the significance of organizing data based on relevance, importance, and relative significance, akin to traditional methods. However, unlike previous approaches, this method enriches our comprehension of the overall information landscape, countering the potential bias introduced by ranked lists.
[IR-2] S-Diff: An Anisotropic Diffusion Model for Collaborative Filtering in Spectral Domain WSDM2025
链接: https://arxiv.org/abs/2501.00384
作者: Rui Xia,Yanhua Cheng,Yongxiang Tang,Xiaocheng Liu,Xialong Liu,Lisong Wang,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WSDM 2025
Abstract:Recovering user preferences from user-item interaction matrices is a key challenge in recommender systems. While diffusion models can sample and reconstruct preferences from latent distributions, they often fail to capture similar users’ collective preferences effectively. Additionally, latent variables degrade into pure Gaussian noise during the forward process, lowering the signal-to-noise ratio, which in turn degrades performance. To address this, we propose S-Diff, inspired by graph-based collaborative filtering, better to utilize low-frequency components in the graph spectral domain. S-Diff maps user interaction vectors into the spectral domain and parameterizes diffusion noise to align with graph frequency. This anisotropic diffusion retains significant low-frequency components, preserving a high signal-to-noise ratio. S-Diff further employs a conditional denoising network to encode user interactions, recovering true preferences from noisy data. This method achieves strong results across multiple datasets.
[IR-3] Who Gets Recommended? Investigating Gender Race and Country Disparities in Paper Recommendations from Large Language Models
链接: https://arxiv.org/abs/2501.00367
作者: Yifan Tian,Yixin Liu,Yi Bu,Jiqun Liu
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Digital Libraries (cs.DL)
*备注:
Abstract:This paper investigates the performance of several representative large models in the tasks of literature recommendation and explores potential biases in research exposure. The results indicate that not only LLMs’ overall recommendation accuracy remains limited but also the models tend to recommend literature with greater citation counts, later publication date, and larger author teams. Yet, in scholar recommendation tasks, there is no evidence that LLMs disproportionately recommend male, white, or developed-country authors, contrasting with patterns of known human biases.
[IR-4] Crime Hotspot Analysis and Mapping Using Geospatial Technology in Dessie City Ethiopia
链接: https://arxiv.org/abs/2501.00036
作者: H.A.Kebede,M.M.Assen,M.A.Sharew
类目: Physics and Society (physics.soc-ph); Information Retrieval (cs.IR)
*备注:
Abstract:Over the past few decades, crime and delinquency rates have increased drastically in many countries; nevertheless, it is important to note that crime trends can differ significantly by geographic region. This study’s primary goal was to use geographic technology to map and analyze Dessie City’s crime patterns. To investigate the geographic clustering of crime, the researchers used semivariogram modeling and spatial autocorrelation analysis with Moran’sI. The neighborhoods of Hote, Arada, and Segno in Dessie’s central city were found to be crime-prone “hot spot” locations, as evidenced by statistically significant high Z-scores ranging from 0.037 to 4.608. On the other hand, low negative Z-scores ranging from -3.231 to -0.116 indicated “cold spot” concentrations of crime in the city’s north-central sub-cities of Menafesha and Bounbouwha. With an index of 0.027492 and a Z-score of 3.297616 (p0.01), the analysis overall showed a substantial positive spatial autocorrelation, suggesting a clustered pattern of crime in Dessie. The majority of crimes showed a north-south directionality, except for murder, which trended from northeast to southwest. The mean center of all crime types was found in the central Hote area. To address the complicated problem of rising crime rates in Dessie and other developing metropolitan areas, more focused and efficient enforcement techniques, and resource deployment can be informed through the knowledge acquired from the geospatial analysis.