本篇博文主要展示 2024-09-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上11:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2024-09-26)

今日共更新466篇论文,其中:

  • 自然语言处理74篇(Computation and Language (cs.CL))
  • 人工智能135篇(Artificial Intelligence (cs.AI))
  • 计算机视觉120篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习140篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

【速读】: 该论文试图解决当前多模态模型主要依赖于闭源的合成数据,导致社区缺乏从零开始构建高性能视觉语言模型(VLM)的基础知识的问题。解决方案的关键在于提出了Molmo系列模型,其核心创新是使用完全由人类标注者通过语音描述收集的高度详细的图像描述数据集。此外,论文还引入了多样化的数据混合集,包括自然环境中的问答和创新的2D指向数据,用于微调模型。成功之处在于精心选择的模型架构细节、优化的训练流程,以及高质量的新数据集,这些都将公开发布。Molmo系列中的72B模型不仅在开放权重和数据模型中表现优异,还与GPT-4o、Claude 3.5和Gemini 1.5等闭源系统相比表现出色。

链接: https://arxiv.org/abs/2409.17146
作者: Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Jen Dumas,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi
关键词-EN: Today most advanced, advanced multimodal models, multimodal models remain, advanced multimodal, models remain proprietary
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Today’s most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild QA and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.17146 [cs.CV] (or arXiv:2409.17146v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:当今最先进的多模态模型仍然是专有的。最强大的开源模型严重依赖于从专有的视觉语言模型 (VLM) 生成的合成数据,以实现良好的性能,实际上是将这些封闭模型提炼成开源模型。因此,社区仍然缺乏关于如何从头构建高性能 VLM 的基础知识。我们提出了 Molmo,这是一个新的 VLM 家族,在其开放性类别中处于最先进水平。我们的关键创新是一种全新的、高度详细的图像描述数据集,该数据集完全由人类标注者使用基于语音的描述收集。为了支持广泛的交互,我们还引入了一个多样化的微调数据集混合,包括自然环境中的问答和创新的二维指向数据。我们的方法的成功依赖于对模型架构细节的精心选择、经过良好调优的训练流程,以及最关键的,我们新收集的数据集的质量,所有这些都将被发布。Molmo 家族中最好的 72B 模型不仅在开源权重和数据模型类别中表现优于其他模型,而且在学术基准和人类评估中与 GPT-4o、Claude 3.5 和 Gemini 1.5 等专有系统相比也表现出色。我们将在不久的将来发布所有模型权重、标注和微调数据以及源代码。部分模型权重、推理代码和演示可在以下链接获取:https URL。

主题:计算机视觉与模式识别 (cs.CV);计算与语言 (cs.CL);机器学习 (cs.LG)
引用为:arXiv:2409.17146 [cs.CV](或 arXiv:2409.17146v1 [cs.CV] 用于此版本)
https://doi.org/10.48550/arXiv.2409.17146
通过 arXiv 发布的 DOI 通过 DataCite(待注册)

[NLP-1] FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression

【速读】: 该论文试图解决现代大型语言模型(LLMs)在实际文本压缩系统中应用的低效问题。解决方案的关键在于提出了一种名为FineZip的新型LLM文本压缩系统,通过结合在线记忆和动态上下文的方法,显著减少了压缩时间。具体来说,FineZip将原本需要9.5天压缩10MB文本的时间缩短至约4小时,实现了54倍的效率提升,并且在压缩比率上相比传统算法有约50%的改进。这一创新为实现基于LLMs的无损文本压缩迈出了重要一步。

链接: https://arxiv.org/abs/2409.17141
作者: Fazal Mittu,Yihuan Bu,Akshat Gupta,Ashok Devireddy,Alp Eren Ozdarendeli,Anant Singh,Gopala Anumanchipalli
关键词-EN: language modeling objective, text compression, compression, text compression systems, practical text compression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While the language modeling objective has been shown to be deeply connected with compression, it is surprising that modern LLMs are not employed in practical text compression systems. In this paper, we provide an in-depth analysis of neural network and transformer-based compression techniques to answer this question. We compare traditional text compression systems with neural network and LLM-based text compression methods. Although LLM-based systems significantly outperform conventional compression methods, they are highly impractical. Specifically, LLMZip, a recent text compression system using Llama3-8B requires 9.5 days to compress just 10 MB of text, although with huge improvements in compression ratios. To overcome this, we present FineZip - a novel LLM-based text compression system that combines ideas of online memorization and dynamic context to reduce the compression time immensely. FineZip can compress the above corpus in approximately 4 hours compared to 9.5 days, a 54 times improvement over LLMZip and comparable performance. FineZip outperforms traditional algorithmic compression methods with a large margin, improving compression ratios by approximately 50%. With this work, we take the first step towards making lossless text compression with LLMs a reality. While FineZip presents a significant step in that direction, LLMs are still not a viable solution for large-scale text compression. We hope our work paves the way for future research and innovation to solve this problem.
摘要:尽管语言建模目标已被证明与压缩技术有着深刻的联系,但令人惊讶的是,现代大语言模型 (LLM) 并未被应用于实际的文本压缩系统中。本文深入分析了基于神经网络和 Transformer 的压缩技术,以解答这一问题。我们比较了传统的文本压缩系统与基于神经网络和大语言模型的文本压缩方法。尽管基于大语言模型的系统在性能上显著优于传统的压缩方法,但它们在实际应用中极为不切实际。具体来说,最近使用 Llama3-8B 的文本压缩系统 LLMZip 需要 9.5 天才能压缩 10 MB 的文本,尽管其在压缩比率上有巨大的提升。为了克服这一问题,我们提出了 FineZip——一种结合了在线记忆和动态上下文概念的新型基于大语言模型的文本压缩系统,极大地减少了压缩时间。与 LLMZip 的 9.5 天相比,FineZip 可以在大约 4 小时内完成上述语料库的压缩,性能提升了 54 倍,且表现相当。FineZip 在压缩比率上大幅超越了传统的算法压缩方法,提升了约 50%。通过这项工作,我们迈出了使基于大语言模型的无损文本压缩成为现实的第一步。尽管 FineZip 在这方面迈出了重要的一步,但大语言模型仍不是大规模文本压缩的可行解决方案。我们希望我们的工作能为未来的研究和创新铺平道路,以解决这一问题。

[NLP-2] Assessing the Level of Toxicity Against Distinct Groups in Bangla Social Media Comments: A Comprehensive Investigation

【速读】: 该论文试图解决在孟加拉语社交媒体中识别针对特定群体(如跨性别者、原住民和移民)的毒性评论的问题。解决方案的关键在于构建一个包含多种毒性程度(高、中、低)的标注数据集,并利用预训练的变换器模型(如Bangla-BERT、bangla-bert-base、distil-BERT和Bert-base-multilingual-cased)进行分类。通过采用多样化的评估指标(如准确率、召回率、精确率和F1分数),研究发现Bangla-BERT在识别孟加拉语社交媒体中的毒性评论方面表现最佳,F1分数达到0.8903。

链接: https://arxiv.org/abs/2409.17130
作者: Mukaffi Bin Moin,Pronay Debnath,Usafa Akther Rifa,Rijeet Bin Anis
关键词-EN: Social media platforms, modern world, serving as conduits, conduits for communication, exchange of ideas
类目: Computation and Language (cs.CL)
备注: Accepted for publication in “18th International Conference on Information Technology and Applications (ICITA 2024)”

点击查看摘要

Abstract:Social media platforms have a vital role in the modern world, serving as conduits for communication, the exchange of ideas, and the establishment of networks. However, the misuse of these platforms through toxic comments, which can range from offensive remarks to hate speech, is a concerning issue. This study focuses on identifying toxic comments in the Bengali language targeting three specific groups: transgender people, indigenous people, and migrant people, from multiple social media sources. The study delves into the intricate process of identifying and categorizing toxic language while considering the varying degrees of toxicity: high, medium, and low. The methodology involves creating a dataset, manual annotation, and employing pre-trained transformer models like Bangla-BERT, bangla-bert-base, distil-BERT, and Bert-base-multilingual-cased for classification. Diverse assessment metrics such as accuracy, recall, precision, and F1-score are employed to evaluate the model’s effectiveness. The experimental findings reveal that Bangla-BERT surpasses alternative models, achieving an F1-score of 0.8903. This research exposes the complexity of toxicity in Bangla social media dialogues, revealing its differing impacts on diverse demographic groups.
摘要:社交媒体平台在现代社会中扮演着至关重要的角色,作为沟通、思想交流和网络建立的渠道。然而,通过有毒评论(从冒犯性言论到仇恨言论)滥用这些平台是一个令人担忧的问题。本研究聚焦于识别针对三个特定群体(跨性别者、原住民和移民)的孟加拉语有毒评论,这些评论来自多个社交媒体来源。研究深入探讨了识别和分类有毒语言的复杂过程,同时考虑了不同程度的有毒性:高、中、低。研究方法包括创建数据集、手动标注,并采用预训练的 Transformer 模型,如 Bangla-BERT、bangla-bert-base、distil-BERT 和 Bert-base-multilingual-cased 进行分类。采用多种评估指标,如准确率、召回率、精确率和 F1 分数,来评估模型的有效性。实验结果显示,Bangla-BERT 优于其他模型,达到了 0.8903 的 F1 分数。本研究揭示了孟加拉语社交媒体对话中有毒性的复杂性,揭示了其对不同人口群体的不同影响。

[NLP-3] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Handy Appetizer

【速读】: 该论文旨在解决人工智能、机器学习和深度学习在大数据分析与管理中的应用问题。解决方案的关键在于简化深度学习背后的复杂数学概念,通过直观的可视化和实际案例帮助读者理解神经网络及其相关技术(如卷积神经网络CNNs)的工作原理。论文介绍了多种经典模型和技术(如Transformer、GPT、ResNet、BERT和YOLO),并强调预训练模型在提升模型性能和准确性方面的重要性,同时提供了在实际场景中应用这些模型的指导。此外,论文还概述了大数据管理技术(如SQL和NoSQL数据库)及分布式计算框架(如Apache Hadoop和Spark),强调这些技术在处理和存储海量数据中的关键作用。

链接: https://arxiv.org/abs/2409.17120
作者: Benji Peng,Xuanhe Pan,Yizhu Wen,Ziqian Bi,Keyu Chen,Ming Li,Ming Liu,Qian Niu,Junyu Liu,Jinlang Wang,Sen Zhang,Jiawei Xu,Pohsun Feng
关键词-EN: Artificial Intelligence, Convolutional Neural Networks, Deep Learning, Machine Learning, role of Artificial
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This book contains 93 pages and 60 figures

点击查看摘要

Abstract:This book explores the role of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) in driving the progress of big data analytics and management. The book focuses on simplifying the complex mathematical concepts behind deep learning, offering intuitive visualizations and practical case studies to help readers understand how neural networks and technologies like Convolutional Neural Networks (CNNs) work. It introduces several classic models and technologies such as Transformers, GPT, ResNet, BERT, and YOLO, highlighting their applications in fields like natural language processing, image recognition, and autonomous driving. The book also emphasizes the importance of pre-trained models and how they can enhance model performance and accuracy, with instructions on how to apply these models in various real-world scenarios. Additionally, it provides an overview of key big data management technologies like SQL and NoSQL databases, as well as distributed computing frameworks such as Apache Hadoop and Spark, explaining their importance in managing and processing vast amounts of data. Ultimately, the book underscores the value of mastering deep learning and big data management skills as critical tools for the future workforce, making it an essential resource for both beginners and experienced professionals.
摘要:本书探讨了人工智能 (AI)、机器学习 (ML) 和深度学习 (DL) 在大数据分析与管理中的推动作用。本书着重于简化深度学习背后的复杂数学概念,通过直观的可视化和实际案例研究,帮助读者理解神经网络以及卷积神经网络 (CNNs) 等技术的工作原理。书中介绍了多个经典模型和技术,如 Transformer、GPT、ResNet、BERT 和 YOLO,重点介绍了它们在自然语言处理、图像识别和自动驾驶等领域的应用。此外,本书强调了预训练模型的重要性,以及它们如何提升模型性能和准确性,并提供了在各种实际场景中应用这些模型的指导。同时,本书还概述了关键的大数据管理技术,如 SQL 和 NoSQL 数据库,以及分布式计算框架,如 Apache Hadoop 和 Spark,解释了它们在管理和处理海量数据中的重要性。最终,本书强调了掌握深度学习和大数据管理技能作为未来劳动力关键工具的价值,使其成为初学者和经验丰富的专业人士的必备资源。

[NLP-4] Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

【速读】: 该论文试图解决传统大规模语言模型预训练中依赖人工专家制定规则来提升语料质量的问题,这些规则缺乏灵活性且难以针对每个示例进行定制。解决方案的关键在于引入了一个名为Programming Every Example (ProX)的新框架,该框架将数据精炼视为编程任务,使模型能够通过生成和执行细粒度的操作(如字符串归一化)来大规模地精炼每个单独的示例。实验结果表明,使用ProX精炼的数据预训练的模型在多个下游基准测试中表现优于原始数据或其他筛选方法,且在领域特定的持续预训练中表现出显著的效率提升。

链接: https://arxiv.org/abs/2409.17115
作者: Fan Zhou,Zengzhi Wang,Qian Liu,Junlong Li,Pengfei Liu
关键词-EN: numerous rules developed, Large language model, Large language, resulting in numerous, developed to date
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 45 pages, 13 figures, 34 tables

点击查看摘要

Abstract:Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with 100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: this https URL
摘要:传统上,大语言模型的预训练依赖于人类专家设计启发式方法来提升语料库质量,从而产生了大量的规则。然而,这些规则缺乏灵活性,无法有效应对每个示例的独特特征。同时,为每个示例应用定制规则对人类专家来说是不切实际的。本文展示了即使是参数数量仅为 0.3B 的小型语言模型,也能展现出与人类专家相当的数据精炼能力。我们提出了“编程每个示例”(ProX)这一新颖框架,将数据精炼视为编程任务,使模型能够通过生成和执行细粒度操作(如字符串归一化)来大规模精炼每个单独的示例。实验结果表明,在 ProX 精炼数据上预训练的模型在各种下游基准测试中,性能超过原始数据或其他筛选方法的数据至少 2%。其有效性涵盖了不同模型规模和预训练语料库,包括 C4、RedPajama-V2 和 FineWeb。此外,ProX 在特定领域的持续预训练中展现出显著潜力:在没有特定领域设计的情况下,使用 ProX 精炼的 OpenWebMath 训练的模型优于基于人类规则的方法,平均准确率在 Mistral-7B 上提高了 7.6%,在 Llama-2-7B 上提高了 14.6%,在 CodeLlama-7B 上提高了 20.3%,所有这些都在 10B Token 内,与在 200B Token 上训练的模型如 Llemma-7B 相当。进一步分析表明,ProX 显著节省了训练 FLOPs,为高效的大语言模型预训练提供了有前景的路径。我们开源了包含 100B 语料库的 ProX,以及模型和所有训练及实现细节,以促进可重复研究和未来创新。代码:this https URL

[NLP-5] Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

【速读】: 该论文试图解决大型视觉-语言模型(VLMs)在面对新的视觉空间任务时,是否能够仅通过视觉演示进行零样本学习的问题。研究的关键在于提出了一个新的基准测试——空间视觉模糊任务(SVAT),并通过课程学习的方法,逐步引入更简单的数据进行训练,从而提升模型在上下文学习(ICL)中的表现。研究发现,尽管VLMs在零样本学习中表现不佳,但通过课程学习引入简单数据后,其ICL性能得到了显著提升。

链接: https://arxiv.org/abs/2409.17080
作者: Bowen Zhao,Leo Parker Dirac,Paulina Varshavskaya
关键词-EN: Large vision-language models, popular adaptation strategy, computer vision tasks, Large vision-language, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 13 pages, 4 figures. Code released at this https URL

点击查看摘要

Abstract:Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.
摘要:大型视觉-语言模型 (VLMs) 已成为许多计算机视觉任务的最新技术,其中上下文学习 (ICL) 是一种流行的适应新任务的策略。但 VLMs 能否仅通过视觉演示学习新概念,还是它们仅限于适应 ICL 示例的输出格式?我们提出了一项新的基准,称为空间视觉模糊任务 (SVAT),该基准挑战最先进的 VLMs 在上下文中学习新的视觉空间任务。我们发现 VLMs 在零样本情况下无法完成此任务,有时在微调后仍然失败。然而,通过课程学习向训练中添加更简单的数据,可以提高 ICL 的性能。

[NLP-6] Enhancing Post-Hoc Attributions in Long Document Comprehension via Coarse Grained Answer Decomposition

【速读】: 该论文试图解决长文档问答系统中答案文本的准确归属问题,特别是如何将答案中的信息单元与源文档中的具体内容进行精细映射。解决方案的关键在于提出了一种基于模板和上下文学习的答案事实分解方法,通过在少样本上下文学习中引入负采样,增强对抽象和抽取答案的语义理解,从而提高答案归属的准确性和细粒度。

链接: https://arxiv.org/abs/2409.17073
作者: Pritika Ramu,Koustava Goswami,Apoorv Saxena,Balaji Vasan Srinivavsan
关键词-EN: Accurately attributing answer, Accurately attributing, reliable question-answering system, crucial for developing, developing a reliable
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately attributing answer text to its source document is crucial for developing a reliable question-answering system. However, attribution for long documents remains largely unexplored. Post-hoc attribution systems are designed to map answer text back to the source document, yet the granularity of this mapping has not been addressed. Furthermore, a critical question arises: What precisely should be attributed, with an emphasis on identifying the information units within an answer that necessitate grounding? In this paper, we propose and investigate a novel approach to the factual decomposition of generated answers for attribution, employing template-based in-context learning. To accomplish this, we utilize the question and integrate negative sampling during few-shot in-context learning for decomposition. This approach enhances the semantic understanding of both abstractive and extractive answers. We examine the impact of answer decomposition by providing a thorough examination of various attribution approaches, ranging from retrieval-based techniques to LLM-based attributors.
摘要:准确地将答案文本归因于其来源文档对于开发可靠的问答系统至关重要。然而,对于长文档的归因问题仍未得到充分探索。事后归因系统旨在将答案文本映射回源文档,但这种映射的粒度尚未得到解决。此外,一个关键问题浮现:究竟应该归因什么,重点在于识别答案中需要依据的信息单元?本文提出并研究了一种新颖的生成答案事实分解方法,用于归因,采用基于模板的上下文学习。为此,我们利用问题并在少样本上下文学习中结合负采样进行分解。这种方法增强了抽象答案和抽取答案的语义理解。我们通过详细考察各种归因方法(从基于检索的技术到基于大语言模型的归因器)来检验答案分解的影响。

[NLP-7] Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia

【速读】: 该论文试图解决Puskesmas中医生与患者互动时间过长的问题,特别是在语言多样性背景下,医生需要进行详细的诊断、治疗建议和病历记录,导致效率低下。解决方案的关键在于利用本地化的大型语言模型(LLM),通过Whisper模型进行实时转录,并使用GPT-3将对话内容总结为ePuskemas医疗记录格式。这一系统作为现有浏览器扩展的附加功能,允许医生在对话过程中填写患者表格,从而提高患者护理的周转时间,同时生成更详细和有洞察力的病历记录。该创新旨在减轻印度尼西亚医疗设施的拥挤状况和行政负担,帮助医生节省时间、提供更好的护理,并生成更准确的医疗记录。

链接: https://arxiv.org/abs/2409.17054
作者: Azmul Asmar Irfan,Nur Ahmad Khatim,Mansur M. Arief
关键词-EN: key issues contributing, inefficiency in Puskesmas, key issues, issues contributing, contributing to inefficiency
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:One of the key issues contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors need to conduct thorough consultations, which include diagnosing the patient’s condition, providing treatment advice, and transcribing detailed notes into medical records. In regions with diverse linguistic backgrounds, doctors often have to ask clarifying questions, further prolonging the process. While diagnosing is essential, transcription and summarization can often be automated using AI to improve time efficiency and help doctors enhance care quality and enable early diagnosis and intervention. This paper proposes a solution using a localized large language model (LLM) to transcribe, translate, and summarize doctor-patient conversations. We utilize the Whisper model for transcription and GPT-3 to summarize them into the ePuskemas medical records format. This system is implemented as an add-on to an existing web browser extension, allowing doctors to fill out patient forms while talking. By leveraging this solution for real-time transcription, translation, and summarization, doctors can improve the turnaround time for patient care while enhancing the quality of records, which become more detailed and insightful for future visits. This innovation addresses challenges like overcrowded facilities and the administrative burden on healthcare providers in Indonesia. We believe this solution will help doctors save time, provide better care, and produce more accurate medical records, representing a significant step toward modernizing healthcare and ensuring patients receive timely, high-quality care, even in resource-constrained settings.
摘要:Puskesmas 效率低下的关键问题之一是医生与患者互动的时间消耗。医生需要进行全面的咨询,包括诊断患者的病情、提供治疗建议以及将详细记录转录到医疗记录中。在语言背景多样的地区,医生通常需要提出澄清问题,进一步延长了这一过程。虽然诊断至关重要,但转录和总结通常可以使用 AI 自动化,以提高时间效率,帮助医生提升护理质量并实现早期诊断和干预。本文提出了一种解决方案,使用本地化的大语言模型 (LLM) 来转录、翻译和总结医生与患者的对话。我们利用 Whisper 模型进行转录,并使用 GPT-3 将其总结为 ePuskemas 医疗记录格式。该系统作为现有网页浏览器扩展的附加组件实现,允许医生在交谈时填写患者表格。通过利用这一解决方案进行实时转录、翻译和总结,医生可以提高患者护理的周转时间,同时提高记录质量,使其在未来访问中更加详细和有洞察力。这一创新解决了印尼医疗机构拥挤和医疗提供者行政负担等挑战。我们相信,这一解决方案将帮助医生节省时间,提供更好的护理,并生成更准确的医疗记录,这是向现代化医疗迈出的重要一步,即使在资源受限的环境中也能确保患者及时获得高质量的护理。

[NLP-8] Detecting Temporal Ambiguity in Questions EMNLP2024

【速读】: 该论文试图解决开放领域问答系统中时间模糊问题的检测与回答问题。解决方案的关键在于引入TEMPAMBIQA数据集,该数据集包含8,162个手动标注的时间模糊问答对,专注于捕捉时间模糊性。论文提出了一种基于问题去模糊化版本的多样化搜索策略,并引入了非搜索的竞争性基线方法,通过零样本和少样本方法来检测时间模糊性。

链接: https://arxiv.org/abs/2409.17046
作者: Bhawna Piryani,Abdelrahman Abdallah,Jamshid Mozafari,Adam Jatowt
关键词-EN: answering ambiguous questions, ambiguous questions, Temporally ambiguous questions, open-domain question answering, Temporally ambiguous
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Detecting and answering ambiguous questions has been a challenging task in open-domain question answering. Ambiguous questions have different answers depending on their interpretation and can take diverse forms. Temporally ambiguous questions are one of the most common types of such questions. In this paper, we introduce TEMPAMBIQA, a manually annotated temporally ambiguous QA dataset consisting of 8,162 open-domain questions derived from existing datasets. Our annotations focus on capturing temporal ambiguity to study the task of detecting temporally ambiguous questions. We propose a novel approach by using diverse search strategies based on disambiguated versions of the questions. We also introduce and test non-search, competitive baselines for detecting temporal ambiguity using zero-shot and few-shot approaches.
摘要:在开放领域问答中,检测和回答模糊问题一直是一项具有挑战性的任务。模糊问题根据其解释不同可能会有不同的答案,并且可以呈现多种形式。时间模糊问题是最常见的此类问题之一。本文中,我们引入了 TEMPAMBIQA,这是一个由 8,162 个从现有数据集中提取的开放领域问题组成的手动标注的时间模糊问答数据集。我们的标注重点在于捕捉时间模糊性,以研究检测时间模糊问题的任务。我们提出了一种新颖的方法,通过使用基于问题消歧版本的多样化搜索策略。此外,我们还引入了非搜索的竞争性基线,使用零样本和少样本方法来检测时间模糊性。

[NLP-9] How to Connect Speech Foundation Models and Large Language Models ? What Matters and What Does Not

【速读】: 该论文旨在解决在语音到文本(S2T)任务中,如何评估和优化语音基础模型(SFM)、适配器模块和大型语言模型(LLM)各组件对下游任务性能的影响。解决方案的关键在于通过实验评估5种适配器模块、2种LLM(Mistral和Llama)和2种SFM(Whisper和SeamlessM4T)在自动语音识别和语音翻译任务中的表现,结果表明SFM对下游性能起决定性作用,而适配器的选择对性能影响中等,且依赖于所选的SFM和LLM。

链接: https://arxiv.org/abs/2409.17044
作者: Francesco Verdini,Pierfrancesco Melucci,Stefano Perna,Francesco Cariaggi,Marco Gaido,Sara Papi,Szymon Mazurek,Marek Kasztelnik,Luisa Bentivogli,Sébastien Bratières,Paolo Merialdo,Simone Scardapane
关键词-EN: Large Language Models, Large Language, Speech Foundational Model, driven research efforts, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.
摘要:大语言模型 (Large Language Model, LLM) 的显著性能推动了研究者们利用它们处理各种任务和输入模态的努力。在语音转文本 (Speech-to-Text, S2T) 任务中,新兴的解决方案是通过一个适配器模块将语音基础模型 (Speech Foundational Model, SFM) 编码器的输出投影到大语言模型的嵌入空间中。然而,目前尚无研究探讨下游任务性能在多大程度上依赖于每个组件 (SFM、适配器、LLM),以及最佳适配器设计是否取决于所选的 SFM 和 LLM。为了填补这一空白,我们评估了 5 种适配器模块、2 种大语言模型 (Mistral 和 Llama) 以及 2 种 SFM (Whisper 和 SeamlessM4T) 在两种广泛应用的 S2T 任务(即自动语音识别和语音翻译)中的组合效果。我们的结果表明,SFM 在下游性能中起着关键作用,而适配器的选择影响适中,并取决于 SFM 和大语言模型。

[NLP-10] Counterfactual Token Generation in Large Language Models

【速读】: 该论文试图解决大型语言模型在生成文本时无法进行反事实推理的问题。解决方案的关键在于开发了一种基于Gumbel-Max结构因果模型的因果生成模型,使得大型语言模型能够在几乎不增加成本的情况下进行反事实令牌生成。该方法无需微调或提示工程,实现简单,并通过在Llama 3 8B-instruct模型上的实验,展示了其在偏见检测等应用中的潜力。

链接: https://arxiv.org/abs/2409.17027
作者: Ivi Chatzi,Nina Corvelo Benz,Eleni Straitouri,Stratis Tsirtsis,Manuel Gomez-Rodriguez
关键词-EN: Maelstrom Fury, Captain Lyra stood, trusty ship, endless sea, Captain Lyra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:“Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom’s Fury, gazing out at the endless sea. […] Lyra’s eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself.” Although this story, generated by a large language model, is captivating, one may wonder – how would the story have unfolded if the model had chosen “Captain Maeve” as the protagonist instead? We cannot know. State-of-the-art large language models are stateless – they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-instruct and conduct both qualitative and quantitative analyses of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.
摘要:“当然,我很乐意为你生成一个故事:Lyra 船长站在她可靠的船只‘Maelstrom’s Fury’的舵轮旁,凝视着无边无际的大海。[…] 当 Lyra 意识到这个苦涩的真相时,她的眼中涌出了泪水——她为了短暂的财富牺牲了一切,失去了船员的爱、家人的爱,以及她自己的爱。”尽管这个由大语言模型生成的故事引人入胜,但人们可能会好奇——如果模型选择“Maeve 船长”作为主角,故事会如何展开呢?我们无从得知。最先进的大语言模型是无状态的——它们不维护任何内部记忆或状态。给定一个提示,它们使用自回归过程生成一系列 Token 作为输出。因此,它们无法对过去生成的 Token 的反事实替代方案进行推理。在这项工作中,我们的目标是赋予它们这种功能。为此,我们开发了一种基于 Gumbel-Max 结构因果模型的 Token 生成因果模型。我们的模型允许任何大语言模型以几乎不增加成本的方式进行反事实 Token 生成,其实现非常简单,并且不需要任何微调或提示工程。我们在 Llama 3 8B-instruct 上实现了我们的模型,并对反事实生成的文本进行了定性和定量分析。我们最后展示了一个反事实 Token 生成的示范应用,用于偏差检测,揭示了大语言模型构建的世界模型中的有趣见解。

[NLP-11] LLM-CARD: Towards a Description and Landscape of Large Language Models

【速读】: 该论文试图解决自然语言处理领域中大型语言模型(LLMs)信息过载的问题,解决方案的关键在于开发一个自动化系统,通过命名实体识别(NER)和关系抽取(RE)方法,从学术论文中自动提取LLMs的关键信息,如模型许可证、模型名称和模型应用,从而生成模型卡片,帮助研究人员高效获取LLMs的相关信息。

链接: https://arxiv.org/abs/2409.17011
作者: Shengwei Tian,Lifeng Han,Erick Mendez Guzman,Goran Nenadic
关键词-EN: Natural Language Processing, diverse NLP tasks, NLP tasks, diverse NLP, Language Processing
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: ongoing work, 16 pages

点击查看摘要

Abstract:With the rapid growth of the Natural Language Processing (NLP) field, a vast variety of Large Language Models (LLMs) continue to emerge for diverse NLP tasks. As an increasing number of papers are presented, researchers and developers face the challenge of information overload. Thus, it is particularly important to develop a system that can automatically extract and organise key information about LLMs from academic papers (\textbfLLM model card). This work is to develop such a pioneer system by using Named Entity Recognition (\textbfNER) and Relation Extraction (\textbfRE) methods that automatically extract key information about large language models from the papers, helping researchers to efficiently access information about LLMs. These features include model \textitlicence, model \textitname, and model \textitapplication. With these features, we can form a model card for each paper. \textbfData-contribution wise, 106 academic papers were processed by defining three dictionaries - LLMs name, licence, and application. 11,051 sentences were extracted through dictionary lookup, and the dataset was constructed through manual review of the final selection of 129 sentences that have a link between the name and the licence, and 106 sentences that have a link between the model name and the application.
摘要:随着自然语言处理 (Natural Language Processing, NLP) 领域的迅速发展,大量的大语言模型 (Large Language Models, LLMs) 不断涌现,用于各种 NLP 任务。随着越来越多的论文被发表,研究人员和开发者面临着信息过载的挑战。因此,开发一个能够自动从学术论文中提取和组织关于 LLMs 关键信息的系统(即 LLM 模型卡片)显得尤为重要。本工作旨在通过使用命名实体识别 (Named Entity Recognition, NER) 和关系提取 (Relation Extraction, RE) 方法,开发这样一个开创性的系统,自动从论文中提取大语言模型的关键信息,帮助研究人员高效地获取关于 LLMs 的信息。这些特征包括模型许可证 (licence)、模型名称 (name) 和模型应用 (application)。通过这些特征,我们可以为每篇论文生成一个模型卡片。在数据贡献方面,通过定义三个字典——LLMs 名称、许可证和应用,处理了 106 篇学术论文。通过字典查找提取了 11,051 个句子,并通过手动审查最终选定的 129 个句子(这些句子中模型名称和许可证之间存在关联)和 106 个句子(这些句子中模型名称和应用之间存在关联)构建了数据集。

[NLP-12] Models Can and Should Embrace the Communicative Nature of Human-Generated Math

【速读】: 该论文试图解决的问题是如何更好地理解和处理数学表达中的沟通意图,而不仅仅是符号化的数学实体。解决方案的关键在于将数学视为一种情境化的语言交流,并利用语言模型来捕捉和表示这些隐含的沟通意图。通过实验,论文展示了语言模型在解释等号和偏好自然化证明顺序方面的能力,强调了AI系统应从人类生成的数学内容中学习并反映这些沟通意图的重要性。

链接: https://arxiv.org/abs/2409.17005
作者: Sasha Boguraev,Ben Lipkin,Leonie Weissweiler,Kyle Mahowald
关键词-EN: idealized mathematical entities, language corpora reflect, natural language corpora, corpora reflect, idealized mathematical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Math is constructed by people for people: just as natural language corpora reflect not just propositions but the communicative goals of language users, the math data that models are trained on reflects not just idealized mathematical entities but rich communicative intentions. While there are important advantages to treating math in a purely symbolic manner, we here hypothesize that there are benefits to treating math as situated linguistic communication and that language models are well suited for this goal, in ways that are not fully appreciated. We illustrate these points with two case studies. First, we ran an experiment in which we found that language models interpret the equals sign in a humanlike way – generating systematically different word problems for the same underlying equation arranged in different ways. Second, we found that language models prefer proofs to be ordered in naturalistic ways, even though other orders would be logically equivalent. We advocate for AI systems that learn from and represent the communicative intentions latent in human-generated math.
摘要:数学是由人类为人类构建的:正如自然语言语料库不仅反映了命题,还反映了语言使用者的交流目标,用于训练模型的数学数据不仅反映了理想化的数学实体,还蕴含了丰富的交流意图。尽管以纯粹符号的方式处理数学具有重要优势,但我们在此假设,将数学视为情境化的语言交流具有一定的好处,并且语言模型非常适合这一目标,尽管这一点尚未得到充分认识。我们通过两个案例研究来说明这些观点。首先,我们进行了一项实验,发现语言模型以类人的方式解释等号——为相同的基本方程式生成系统上不同的文字问题,即使这些方程式以不同的方式排列。其次,我们发现语言模型倾向于以自然的方式排列证明步骤,尽管其他顺序在逻辑上是等价的。我们主张开发从人类生成的数学中学习并表达其中潜在交流意图的 AI 系统。

[NLP-13] AXCEL: Automated eXplainable Consistency Evaluation using LLMs

【速读】: 该论文试图解决大语言模型(LLMs)生成文本的一致性评估问题,传统评估指标如ROUGE和BLEU与人类判断的相关性较弱,而基于自然语言推理(NLI)的复杂度高且缺乏可解释性。论文提出的解决方案是AXCEL(Automated eXplainable Consistency Evaluation using LLMs),这是一种基于提示的一致性评估指标,通过提供详细的推理和指出不一致的文本片段来增强可解释性,并且具有跨任务的通用性。AXCEL在多个任务中显著优于现有的非提示和提示基的SOTA指标,展示了其在一致性检测中的高效性和广泛适用性。

链接: https://arxiv.org/abs/2409.16984
作者: P Aditya Sreekar,Sahil Verma,Suransh Chopra,Sarik Ghazarian,Abhishek Persad,Narayanan Sadagopan
关键词-EN: Large Language Models, Large Language, Language Models, Natural Language Inference, generated text responses
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.
摘要:大语言模型 (LLMs) 在工业界和学术界被广泛应用于各种任务,然而评估生成文本响应的一致性仍然是一个挑战。传统的评估指标如 ROUGE 和 BLEU 与人类判断的相关性较弱。使用自然语言推理 (NLI) 的更复杂指标虽然显示出改进的相关性,但其实现复杂,由于跨领域泛化能力差而需要领域特定的训练,并且缺乏可解释性。最近,基于提示的评估指标使用 LLMs 作为评估器已经出现;尽管它们更容易实现,但仍然缺乏可解释性,并且依赖于任务特定的提示,这限制了它们的通用性。本研究引入了基于 LLMs 的自动化可解释一致性评估 (AXCEL),这是一种基于提示的一致性评估指标,通过提供详细的推理和指出不一致的文本片段来解释一致性评分。AXCEL 也是一种可泛化的指标,可以在不改变提示的情况下应用于多种任务。在检测不一致性方面,AXCEL 在摘要生成任务中比非提示和基于提示的最先进 (SOTA) 指标分别高出 8.7%,自由文本生成任务中高出 6.2%,数据到文本转换任务中高出 29.4%。我们还评估了底层 LLMs 对基于提示的评估指标性能的影响,并使用最新的 LLMs 对 SOTA 基于提示的指标进行了重新校准,以进行公平比较。此外,我们展示了 AXCEL 在使用开源 LLMs 时表现出强大的性能。

[NLP-14] Decoding Large-Language Models: A Systematic Overview of Socio-Technical Impacts Constraints and Emerging Questions

【速读】: 该论文旨在系统性地探讨大型语言模型(LLMs)的发展现状、影响及其局限性,并识别出LLM研究中的主要主题和未来方向。解决方案的关键在于全面分析LLM的研究目标、方法、局限性及未来发展方向,包括负责任的发展考量、算法改进、伦理挑战和社会影响。通过这种综合性的研究,论文不仅提供了当前LLM研究的严谨概述,还指出了未来发展的潜在方向,特别是那些可能对社会产生积极影响的领域及其相关的伦理考量。

链接: https://arxiv.org/abs/2409.16974
作者: Zeyneb N. Kaya,Souvick Ghosh
关键词-EN: large language models, natural language processing, language models, language processing, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures, preprint submitted to journal

点击查看摘要

Abstract:There have been rapid advancements in the capabilities of large language models (LLMs) in recent years, greatly revolutionizing the field of natural language processing (NLP) and artificial intelligence (AI) to understand and interact with human language. Therefore, in this work, we conduct a systematic investigation of the literature to identify the prominent themes and directions of LLM developments, impacts, and limitations. Our findings illustrate the aims, methodologies, limitations, and future directions of LLM research. It includes responsible development considerations, algorithmic improvements, ethical challenges, and societal implications of LLM development. Overall, this paper provides a rigorous and comprehensive overview of current research in LLM and identifies potential directions for future development. The article highlights the application areas that could have a positive impact on society along with the ethical considerations.
摘要:近年来,大语言模型 (LLM) 的能力取得了飞速的进展,极大地革新了自然语言处理 (NLP) 和人工智能 (AI) 领域,使其能够理解和与人类语言进行交互。因此,在本研究中,我们对文献进行了系统的调查,以识别 LLM 发展的主要主题、影响和局限性。我们的研究结果展示了 LLM 研究的宗旨、方法、局限性以及未来的发展方向。这包括负责任的发展考量、算法改进、伦理挑战以及 LLM 发展对社会的潜在影响。总体而言,本文提供了当前 LLM 研究的一个严谨且全面的概述,并指出了未来发展的潜在方向。文章还强调了那些可能对社会产生积极影响的应用领域,并考虑了相关的伦理问题。

[NLP-15] Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

【速读】: 该论文试图解决大型语言模型(LLMs)在个性化用户偏好方面的挑战,特别是在设备端应用中。解决方案的关键在于提出了自适应自监督学习策略(ASLS),通过自监督学习技术动态个性化LLMs。ASLS框架包括用户画像层和神经适应层,前者用于收集交互数据,后者用于实时模型微调。这种创新方法通过从用户反馈中持续学习,使模型能够生成更符合用户特定上下文的响应,同时减少了计算需求并提高了个性化效率。

链接: https://arxiv.org/abs/2409.16973
作者: Rafael Mendoza,Isabella Cruz,Richard Liu,Aarav Deshmukh,David Williams,Jesscia Peng,Rohan Iyer
关键词-EN: Large language models, Large language, interact with technology, significant challenge, individual user preferences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First ASLS

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.
摘要:大语言模型 (Large Language Models, LLMs) 已经彻底改变了我们与技术的互动方式,但如何根据个人用户偏好进行个性化定制仍然是一个重大挑战,尤其是在设备上的应用中。传统方法通常严重依赖标注数据集,并且资源消耗较大。为了解决这些问题,我们提出了自适应自监督学习策略 (Adaptive Self-Supervised Learning Strategies, ASLS),该策略利用自监督学习技术动态地个性化大语言模型。该框架包括一个用于收集交互数据的用户画像层和一个用于实时模型微调的神经适应层。这种创新方法使得模型能够从用户反馈中持续学习,从而生成与用户特定上下文高度一致的响应。ASLS 的自适应机制最小化了计算需求,并提高了个性化效率。在各种用户场景中的实验结果表明,ASLS 在提升用户参与度和满意度方面表现卓越,突显了其在设备上重新定义大语言模型为高度响应和上下文感知系统的潜力。

[NLP-16] Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition INTERSPEECH2024

【速读】: 该论文试图解决低资源语言在多语言自动语音识别(ASR)系统中的集成问题。解决方案的关键在于引入了一种加权交叉熵的新应用,通过语言加权动态交叉熵和数据增强技术,对预训练的多语言ASR模型(如Whisper模型)进行微调,从而显著降低了低资源语言的词错误率(WER),同时保持高资源语言的性能不受影响。

链接: https://arxiv.org/abs/2409.16954
作者: Andrés Piñeiro-Martín,Carmen García-Mateo,Laura Docío-Fernández,María del Carmen López-Pérez,Georg Rehm
关键词-EN: automatic speech recognition, multilingual automatic speech, Whisper multilingual ASR, integrating low-resource languages, multilingual ASR models
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure. Presented at Interspeech 2024

点击查看摘要

Abstract:This paper addresses the challenge of integrating low-resource languages into multilingual automatic speech recognition (ASR) systems. We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language, employing language-weighted dynamic cross-entropy and data augmentation. The results show a remarkable 6.69% word error rate (WER) reduction for the low-resource language compared to the fine-tuned model without applying our approach, and a 48.86% WER reduction compared to the original Whisper model. In addition, our approach yields an average WER reduction of 3.29% across the six languages, showing no degradation for the high-resource languages.
摘要:本文探讨了将低资源语言整合到多语言自动语音识别 (ASR) 系统中的挑战。我们提出了一种新颖的应用,即加权交叉熵 (weighted cross-entropy),通常用于不平衡数据集,以促进低资源语言在持续多语言学习背景下融入预训练的多语言 ASR 模型。我们对 Whisper 多语言 ASR 模型在五种高资源语言和一种低资源语言上进行了微调,采用了语言加权动态交叉熵和数据增强技术。结果显示,与未应用我们方法的微调模型相比,低资源语言的词错误率 (WER) 降低了 6.69%,与原始 Whisper 模型相比,WER 降低了 48.86%。此外,我们的方法在六种语言中平均降低了 3.29% 的 WER,且对高资源语言没有造成性能下降。

[NLP-17] Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

【速读】: 该论文试图解决Transformer架构中存在的OCR敏感神经元对历史文档命名实体识别(NER)性能的影响问题。解决方案的关键在于通过分析神经元对干净和噪声文本输入的激活模式,识别并中和这些OCR敏感神经元,从而提高模型在处理历史报纸和古典评论等噪声文本时的NER性能。实验基于Llama2和Mistral两个大型语言模型,验证了OCR敏感区域的存在,并展示了通过靶向神经元调节来提升模型性能的潜力。

链接: https://arxiv.org/abs/2409.16934
作者: Emanuela Boros,Maud Ehrmann
关键词-EN: named entity recognition, Transformer architecture, entity recognition, paper investigates, investigates the presence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models’ performance on noisy text.
摘要:本文研究了 Transformer 架构中存在的 OCR 敏感神经元及其对历史文档命名实体识别 (NER) 性能的影响。通过分析神经元对干净和噪声文本输入的激活模式,我们识别并中和了 OCR 敏感神经元,从而提高了模型性能。基于两个开放访问的大语言模型 (Llama2 和 Mistral),实验证明了 OCR 敏感区域的存在,并展示了在历史报纸和古典评论上的 NER 性能提升,突显了针对神经元进行调制以提高模型在噪声文本上性能的潜力。

[NLP-18] Zero-Shot Detection of LLM-Generated Text using Token Cohesiveness EMNLP2024

【速读】: 该论文试图解决大语言模型(LLM)生成文本的自动检测问题,特别是针对零样本检测器(zero-shot detectors)的改进。解决方案的关键在于识别并利用“token cohesiveness”这一新特征,即LLM生成文本相比人类书写文本表现出更高的token内聚性。基于这一观察,论文提出了TOCSIN,一种通用的双通道检测范式,通过将token cohesiveness作为即插即用模块来增强现有的零样本检测器。TOCSIN通过随机token删除和语义差异测量来计算token cohesiveness,适用于实际的黑箱场景,即无法访问生成文本的源模型。实验结果表明,该方法在多种数据集、源模型和评估设置下均表现出有效性和通用性。

链接: https://arxiv.org/abs/2409.16914
作者: Shixuan Ma,Quan Wang
关键词-EN: large language models, highlight the desirability, increasing capability, capability and widespread, widespread usage
类目: Computation and Language (cs.CL)
备注: To appear at the main conference of EMNLP 2024

点击查看摘要

Abstract:The increasing capability and widespread usage of large language models (LLMs) highlight the desirability of automatic detection of LLM-generated text. Zero-shot detectors, due to their training-free nature, have received considerable attention and notable success. In this paper, we identify a new feature, token cohesiveness, that is useful for zero-shot detection, and we demonstrate that LLM-generated text tends to exhibit higher token cohesiveness than human-written text. Based on this observation, we devise TOCSIN, a generic dual-channel detection paradigm that uses token cohesiveness as a plug-and-play module to improve existing zero-shot detectors. To calculate token cohesiveness, TOCSIN only requires a few rounds of random token deletion and semantic difference measurement, making it particularly suitable for a practical black-box setting where the source model used for generation is not accessible. Extensive experiments with four state-of-the-art base detectors on various datasets, source models, and evaluation settings demonstrate the effectiveness and generality of the proposed approach. Code available at: \urlthis https URL.
摘要:随着大语言模型 (LLM) 的能力不断提升和广泛应用,自动检测 LLM 生成文本的需求日益凸显。零样本检测器因其无需训练的特性,受到了广泛关注并取得了显著成功。本文中,我们识别了一种新的特征——Token 内聚性 (token cohesiveness),该特征对零样本检测具有重要意义,并证明 LLM 生成的文本相较于人类书写的文本,往往表现出更高的 Token 内聚性。基于这一发现,我们设计了 TOCSIN,一种通用的双通道检测范式,利用 Token 内聚性作为即插即用模块,以提升现有零样本检测器的性能。计算 Token 内聚性时,TOCSIN 仅需进行几轮随机 Token 删除和语义差异测量,这使其特别适用于实际的黑箱场景,即生成文本的源模型不可访问的情况。通过在多种数据集、源模型和评估设置下对四个最先进的基准检测器进行广泛实验,我们验证了所提出方法的有效性和通用性。代码可在以下链接获取:\urlthis https URL。

[NLP-19] Pruning Multilingual Large Language Models for Multilingual Inference EMNLP2024

【速读】: 该论文试图解决多语言大语言模型(MLLMs)在非英语语言上的零样本学习性能不足的问题。解决方案的关键在于利用MLLMs在英-非英语语言之间的高质量翻译能力,通过分析和保留在翻译过程中起关键作用的大幅度特征权重,并修剪其他权重,从而强制模型在非翻译任务中也依赖这些关键特征,以提升其在非英语语言上的表现。

链接: https://arxiv.org/abs/2409.16911
作者: Hwichan Kim,Jun Suzuki,Tosho Hirasawa,Mamoru Komachi
关键词-EN: multilingual balanced data, English-dominant data, trained on English-dominant, language models trained, large language models
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Multilingual large language models (MLLMs), trained on multilingual balanced data, demonstrate better zero-shot learning performance in non-English languages compared to large language models trained on English-dominant data. However, the disparity in performance between English and non-English languages remains a challenge yet to be fully addressed. A distinctive characteristic of MLLMs is their high-quality translation capabilities, indicating an acquired proficiency in aligning between languages. This study explores how to enhance the zero-shot performance of MLLMs in non-English languages by leveraging their alignment capability between English and non-English languages. To achieve this, we first analyze the behavior of MLLMs when performing translation and reveal that there are large magnitude features that play a critical role in the translation process. Inspired by these findings, we retain the weights associated with operations involving the large magnitude features and prune other weights to force MLLMs to rely on these features for tasks beyond translation. We empirically demonstrate that this pruning strategy can enhance the MLLMs’ performance in non-English language.
摘要:多语言大语言模型 (Multilingual Large Language Models, MLLMs) 在经过多语言平衡数据训练后,相比那些主要基于英语数据训练的大语言模型,在非英语语言的零样本学习性能上表现更佳。然而,英语与非英语语言之间的性能差异仍然是一个未完全解决的挑战。MLLMs 的一个显著特点是其高质量的翻译能力,这表明它们在语言间对齐方面具有较高的熟练度。本研究探讨了如何利用 MLLMs 在英语与非英语语言之间的对齐能力,来提升其在非英语语言中的零样本性能。为此,我们首先分析了 MLLMs 在进行翻译时的行为,并揭示了在翻译过程中起关键作用的大幅度特征。受此启发,我们保留了与这些大幅度特征相关的权重,并修剪了其他权重,以迫使 MLLMs 在翻译之外的任务中依赖这些特征。我们通过实验证明,这种修剪策略可以提升 MLLMs 在非英语语言中的性能。

[NLP-20] Enhancing Temporal Sensitivity and Reasoning for Time-Sensitive Question Answering EMNLP2024

【速读】: 该论文试图解决时间敏感问答(TSQA)中现有大型语言模型(LLMs)对时间信息敏感度不足和时间推理能力有限的问题。解决方案的关键在于提出了一种新的框架,通过时间信息感知嵌入(Temporal Information-Aware Embedding)和粒度对比强化学习(Granular Contrastive Reinforcement Learning)来增强模型对时间信息的敏感性和推理能力。实验结果表明,该框架在四个TSQA数据集上的表现显著优于现有的LLMs,推动了机器与人类在时间理解和推理能力上的差距缩小。

链接: https://arxiv.org/abs/2409.16909
作者: Wanqi Yang,Yanda Li,Meng Fang,Ling Chen
关键词-EN: Time-Sensitive Question Answering, address time-sensitive questions, encompassing multiple time-evolving, specific temporal contexts, multiple time-evolving facts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Time-Sensitive Question Answering (TSQA) demands the effective utilization of specific temporal contexts, encompassing multiple time-evolving facts, to address time-sensitive questions. This necessitates not only the parsing of temporal information within questions but also the identification and understanding of time-evolving facts to generate accurate answers. However, current large language models still have limited sensitivity to temporal information and their inadequate temporal reasoning this http URL this paper, we propose a novel framework that enhances temporal awareness and reasoning through Temporal Information-Aware Embedding and Granular Contrastive Reinforcement Learning. Experimental results on four TSQA datasets demonstrate that our framework significantly outperforms existing LLMs in TSQA tasks, marking a step forward in bridging the performance gap between machine and human temporal understanding and reasoning.
摘要:时间敏感问答 (Time-Sensitive Question Answering, TSQA) 要求有效利用特定的时间上下文,包括多个随时间演变的事实,以解答时间敏感的问题。这不仅需要解析问题中的时间信息,还需要识别和理解随时间演变的事实,以生成准确的答案。然而,当前的大语言模型 (Large Language Model, LLM) 对时间信息的敏感度仍然有限,且在时间推理方面存在不足。本文提出了一种新型框架,通过时间信息感知嵌入 (Temporal Information-Aware Embedding) 和细粒度对比强化学习 (Granular Contrastive Reinforcement Learning) 来增强时间感知和推理能力。在四个 TSQA 数据集上的实验结果表明,我们的框架在 TSQA 任务中显著优于现有的 LLM,标志着在弥合机器与人类在时间理解和推理能力差距方面迈出了重要一步。

[NLP-21] A Roadmap for Embodied and Social Grounding in LLMs

【速读】: 该论文试图解决大语言模型(LLMs)在机器人系统中的知识接地问题,即如何使LLMs能够理解和体验其操作语言的实际意义。解决方案的关键在于借鉴人类认知的三个必要元素:1) 以主动的身体系统作为体验环境的参考点;2) 通过时间结构化的体验实现与外部世界的一致、自我相关的互动;3) 通过社交技能获取共同基础的共享体验。这些元素共同构成了LLMs在机器人系统中有效接地的路线图。

链接: https://arxiv.org/abs/2409.16900
作者: Sara Incao,Carlo Mazzola,Giulia Belgiovine,Alessandra Sciutti
关键词-EN: Large Language Models, offering unparalleled capabilities, multimodal input handling, fusion of Large, Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted Version of a conference paper presented at Robophilosophy Conference 2024

点击查看摘要

Abstract:The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs’ representations to the external world with multimodal approaches or with robots’ bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.
摘要:大语言模型 (LLM) 与机器人系统的融合,为机器人领域带来了一种变革性的范式,不仅在通信领域提供了前所未有的能力,还在多模态输入处理、高级推理和计划生成等技能方面表现出色。将 LLM 的知识基础与经验世界相结合,被认为是利用 LLM 在机器人领域效率的关键途径。然而,仅通过多模态方法或机器人身体将 LLM 的表示与外部世界连接,并不足以使其理解其所操作语言的含义。借鉴人类经验,本文强调了智能体 (Agent) 理解和体验世界的三个必要元素。LLM 基础化的路线图设想了一个主动的身体系统作为体验环境的参考点,一个时间结构化的体验用于与外部世界进行连贯、自我相关的交互,以及社交技能以获取共同基础的共享体验。

[NLP-22] Robotic Backchanneling in Online Conversation Facilitation: A Cross-Generational Study

【速读】: 该论文试图解决日本老龄化社会中认知衰退和护理人员短缺的问题,提出使用具备社交能力的智能机器人作为解决方案。解决方案的关键在于机器人通过采用“回声式交流”(backchannelling),即模仿人类自然交流方式,来增强与老年人的互动效果,从而提高老年人对机器人及其引导的群体对话的接受度和享受度。研究结果表明,这种交流方式不仅使年轻参与者认为机器人更友善、可信和可接受,还能激发老年参与者进行非语言的回声式回应。

链接: https://arxiv.org/abs/2409.16899
作者: Sota Kobuki,Katie Seaborn,Seiki Tokunaga,Kosuke Fukumori,Shun Hidaka,Kazuhiro Tamura,Koji Inoue,Tatsuya Kawahara,Mihoko Otake-Mastuura
关键词-EN: including increasing rates, Japan faces, aging society, including increasing, shortage of caregivers
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Published at Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2023)

点击查看摘要

Abstract:Japan faces many challenges related to its aging society, including increasing rates of cognitive decline in the population and a shortage of caregivers. Efforts have begun to explore solutions using artificial intelligence (AI), especially socially embodied intelligent agents and robots that can communicate with people. Yet, there has been little research on the compatibility of these agents with older adults in various everyday situations. To this end, we conducted a user study to evaluate a robot that functions as a facilitator for a group conversation protocol designed to prevent cognitive decline. We modified the robot to use backchannelling, a natural human way of speaking, to increase receptiveness of the robot and enjoyment of the group conversation experience. We conducted a cross-generational study with young adults and older adults. Qualitative analyses indicated that younger adults perceived the backchannelling version of the robot as kinder, more trustworthy, and more acceptable than the non-backchannelling robot. Finally, we found that the robot’s backchannelling elicited nonverbal backchanneling in older participants.
摘要:日本面临着诸多与老龄化社会相关的挑战,包括人口认知能力下降率的增加以及护理人员的短缺。人们已经开始探索利用人工智能 (AI) 来解决这些问题,特别是那些能够与人类交流的社会化智能体和机器人。然而,关于这些智能体在各种日常情境中与老年人兼容性的研究却相对较少。为此,我们进行了一项用户研究,以评估一个作为小组对话协议促进者的机器人,该协议旨在预防认知能力下降。我们对机器人进行了修改,使其使用回声通道 (backchannelling),这是一种自然的、人类交流的方式,以提高机器人的接受度和小组对话体验的愉悦度。我们进行了一项跨代研究,包括年轻成年人和老年人。定性分析表明,年轻成年人认为使用回声通道的机器人版本更加友善、更值得信赖,且更易于接受,相比之下,非回声通道的机器人则不然。最后,我们发现机器人的回声通道引发了老年参与者中的非语言回声通道反应。

[NLP-23] Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification

【速读】: 该论文试图解决库尔德语方言Hawrami的文本分类问题,以应对其作为濒危语言所面临的数据稀缺和使用者逐渐减少的挑战。解决方案的关键在于利用自然语言处理技术,特别是文本分类模型,对6,854篇Hawrami文章进行分类,并评估不同分类方法(如K近邻、线性支持向量机、逻辑回归和决策树)的性能。研究结果表明,线性支持向量机(Linear SVM)在分类任务中表现最佳,准确率达到96%,显著优于其他方法。

链接: https://arxiv.org/abs/2409.16884
作者: Aram Khaksar,Hossein Hassani
关键词-EN: Kurdish, Natural Language Processing, gradual loss, Language Processing projects, Central Kurdish
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 tables, 14 figures

点击查看摘要

Abstract:Hawrami, a dialect of Kurdish, is classified as an endangered language as it suffers from the scarcity of data and the gradual loss of its speakers. Natural Language Processing projects can be used to partially compensate for data availability for endangered languages/dialects through a variety of approaches, such as machine translation, language model building, and corpora development. Similarly, NLP projects such as text classification are in language documentation. Several text classification studies have been conducted for Kurdish, but they were mainly dedicated to two particular dialects: Sorani (Central Kurdish) and Kurmanji (Northern Kurdish). In this paper, we introduce various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers. We use K-nearest Neighbor (KNN), Linear Support Vector Machine (Linear SVM), Logistic Regression (LR), and Decision Tree (DT) to evaluate how well those methods perform the classification task. The results indicate that the Linear SVM achieves a 96% of accuracy and outperforms the other approaches.
摘要:哈拉米语 (Hawrami) 是库尔德语的一种方言,由于数据稀缺和使用者逐渐减少,被归类为濒危语言。自然语言处理 (Natural Language Processing) 项目可以通过多种方法,如机器翻译 (Machine Translation)、语言模型构建 (Language Model Building) 和语料库开发 (Corpora Development),部分弥补濒危语言/方言的数据可用性。同样,文本分类 (Text Classification) 等 NLP 项目也用于语言记录。已有多个针对库尔德语的文本分类研究,但主要集中于两种特定方言:索拉尼语 (Sorani, 中库尔德语) 和库尔曼吉语 (Kurmanji, 北库尔德语)。本文介绍了使用由两位母语者标记为 15 个类别的 6,854 篇哈拉米语文章数据集,采用多种文本分类模型。我们使用 K 近邻 (K-nearest Neighbor, KNN)、线性支持向量机 (Linear Support Vector Machine, Linear SVM)、逻辑回归 (Logistic Regression, LR) 和决策树 (Decision Tree, DT) 来评估这些方法在分类任务中的表现。结果表明,线性 SVM 达到了 96% 的准确率,并优于其他方法。

[NLP-24] he Role of Language Models in Modern Healthcare: A Comprehensive Review

【速读】: 该论文旨在探讨大型语言模型(LLMs)在医疗健康领域的应用及其潜在影响,重点关注其在处理复杂医疗数据、辅助临床决策和改善患者互动方面的能力。解决方案的关键在于利用LLMs的自然语言理解和生成能力,同时解决数据隐私、模型偏见和伦理问题,以确保这些模型能够安全、有效地整合到医疗实践中,从而提升医疗服务的质量和效率。

链接: https://arxiv.org/abs/2409.16860
作者: Amna Khalid,Ayma Khalid,Umar Khalid
关键词-EN: gained significant attention, significant attention due, process complex medical, large language models, clinical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in healthcare has gained significant attention due to their ability to process complex medical data and provide insights for clinical decision-making. These models have demonstrated substantial capabilities in understanding and generating natural language, which is crucial for medical documentation, diagnostics, and patient interaction. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs, highlighting their strengths in healthcare applications and discussing challenges such as data privacy, bias, and ethical considerations. The potential of LLMs to enhance healthcare delivery is explored, alongside the necessary steps to ensure their ethical and effective integration into medical practice.
摘要:大语言模型 (LLMs) 在医疗领域的应用因其能够处理复杂医疗数据并为临床决策提供洞察而备受关注。这些模型在理解和生成自然语言方面展示了显著能力,这对医疗文档、诊断和患者互动至关重要。本文回顾了语言模型从早期阶段到当前最先进 LLMs 的发展轨迹,重点介绍了其在医疗应用中的优势,并讨论了数据隐私、偏见和伦理考虑等挑战。同时,探讨了 LLMs 提升医疗服务交付的潜力,以及确保其伦理和有效融入医疗实践的必要步骤。

[NLP-25] Exposing Assumptions in AI Benchmarks through Cognitive Modelling

【速读】: 该论文试图解决文化AI基准测试中隐含假设导致的模糊定义、效度不足及关系不明确的问题。解决方案的关键在于通过显式的认知模型,特别是结构方程模型(Structural Equation Models),来明确这些假设。该方法通过跨语言对齐转移的实例,展示了如何回答关键研究问题并识别缺失的数据集,从而在理论上为基准构建提供依据,并指导数据集开发以提高测量效度。通过增强透明度,该框架推动了更为严谨和累积的AI评估科学,促使研究人员对其评估基础进行批判性审视。

链接: https://arxiv.org/abs/2409.16849
作者: Jonathan H. Rystrøm,Kenneth C. Enevoldsen
关键词-EN: Structural Equation Models, leading to vague, unclear interrelations, rely on implicit, vague formulations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations. We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models. Using cross-lingual alignment transfer as an example, we show how this approach can answer key research questions and identify missing datasets. This framework grounds benchmark construction theoretically and guides dataset development to improve construct measurement. By embracing transparency, we move towards more rigorous, cumulative AI evaluation science, challenging researchers to critically examine their assessment foundations.
摘要:文化 AI 基准测试通常依赖于对测量构念的隐含假设,导致模糊的表述、较差的效度以及不明确的相互关系。我们提出通过将这些假设显式化为结构方程模型 (Structural Equation Models) 的认知模型来揭示它们。以跨语言对齐迁移为例,我们展示了这种方法如何回答关键研究问题并识别缺失的数据集。该框架在理论上奠定了基准构建的基础,并指导数据集开发以改进构念测量。通过拥抱透明性,我们朝着更严谨、累积的 AI 评估科学迈进,挑战研究人员批判性地审视其评估基础。

[NLP-26] CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow ACL2024

【速读】: 该论文试图解决代码生成任务中的数据集质量和模型评估问题。解决方案的关键在于引入了一个专门为代码生成设计的新数据集,该数据集包含3,409个由Python专家精心制作的示例,每个示例包括清晰的意图描述、相关的代码片段以及平均三个单元测试。数据集涵盖了多种常用库(如Pandas、Numpy、Regex)和超过70个Python标准库,旨在支持模型微调和独立评估。通过细粒度的分类和减少数据污染的处理,该数据集能够更准确地分析模型在特定编码任务中的优缺点,并验证了包括Mistral 7B、CodeLLaMa 13B和Starcoder 15B在内的多个领先模型的性能。

链接: https://arxiv.org/abs/2409.16819
作者: Nathanaël Beau,Benoît Crabbé
关键词-EN: aimed at aiding, aiding developers, developers in common, texttt, code generation
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It encompasses a range of libraries such as \textttPandas, \textttNumpy, and \textttRegex, along with more than 70 standard libraries in Python code derived from Stack Overflow. Comprising 3,409 crafted examples by Python experts, our dataset is designed for both model finetuning and standalone evaluation. To complete unit tests evaluation, we categorize examples in order to get more fine grained analysis, enhancing the understanding of models’ strengths and weaknesses in specific coding tasks. The examples have been refined to reduce data contamination, a process confirmed by the performance of three leading models: Mistral 7B, CodeLLaMa 13B, and Starcoder 15B. We further investigate data-contamination testing GPT-4 performance on a part of our dataset. The benchmark can be accessed at \urlthis https URL.
摘要:我们引入了一种专为代码生成设计的新型数据集,旨在帮助开发人员完成常见任务。该数据集提供了包含明确意图的示例、相关代码片段以及平均三个相关单元测试。涵盖了诸如 \textttPandas、\textttNumpy 和 \textttRegex 等库,以及从 Stack Overflow 中提取的超过 70 个 Python 标准库。由 Python 专家精心设计的 3,409 个示例组成,该数据集既适用于模型微调,也适用于独立评估。为了完成单元测试评估,我们将示例分类以进行更细致的分析,从而增强对模型在特定编码任务中优缺点的理解。示例经过精心筛选以减少数据污染,这一过程通过 Mistral 7B、CodeLLaMa 13B 和 Starcoder 15B 三个领先模型的表现得到了验证。我们进一步研究了 GPT-4 在部分数据集上的数据污染测试表现。基准测试可通过以下链接访问:\urlthis https URL。

[NLP-27] A Few Hypocrites: Few-Shot Learning and Subtype Definitions for Detecting Hypocrisy Accusations in Online Climate Change Debates

【速读】: 该论文试图解决在线气候讨论中虚伪指控的自动检测问题,并将其定义为自然语言处理中的一个独立任务。解决方案的关键在于构建了一个包含420条Reddit气候辩论评论的气候虚伪指控语料库(CHAC),并将其分为个人虚伪和政治虚伪两种类型。通过使用少样本上下文学习(6次)和三款指令微调的大型语言模型(LLMs),特别是GPT-4o和Llama-3模型,研究发现在检测虚伪指控方面具有潜力(F1值达到0.68,而之前的工作为0.44)。然而,模型在识别政治虚伪指控方面仍面临挑战,尤其是在复杂语义概念的上下文理解上。该研究为虚伪检测和气候变化话语分析提供了新的见解,并为大规模在线气候辩论中的虚伪指控分析奠定了基础。

链接: https://arxiv.org/abs/2409.16807
作者: Paulina Garcia Corral,Avishai Green,Hendrik Meyer,Anke Stoll,Xiaoyue Yan,Myrthe Reuver
关键词-EN: hypocrisy accusations, central rhetorical element, hypocrisy accusation detection, hypocrisy, Hypocrisy Accusation Corpus
类目: Computation and Language (cs.CL)
备注: cite the public version, published at CPSS 2024 @ KONVENS

点击查看摘要

Abstract:The climate crisis is a salient issue in online discussions, and hypocrisy accusations are a central rhetorical element in these debates. However, for large-scale text analysis, hypocrisy accusation detection is an understudied tool, most often defined as a smaller subtask of fallacious argument detection. In this paper, we define hypocrisy accusation detection as an independent task in NLP, and identify different relevant subtypes of hypocrisy accusations. Our Climate Hypocrisy Accusation Corpus (CHAC) consists of 420 Reddit climate debate comments, expert-annotated into two different types of hypocrisy accusations: personal versus political hypocrisy. We evaluate few-shot in-context learning with 6 shots and 3 instruction-tuned Large Language Models (LLMs) for detecting hypocrisy accusations in this dataset. Results indicate that the GPT-4o and Llama-3 models in particular show promise in detecting hypocrisy accusations (F1 reaching 0.68, while previous work shows F1 of 0.44). However, context matters for a complex semantic concept such as hypocrisy accusations, and we find models struggle especially at identifying political hypocrisy accusations compared to personal moral hypocrisy. Our study contributes new insights in hypocrisy detection and climate change discourse, and is a stepping stone for large-scale analysis of hypocrisy accusation in online climate debates.
摘要:气候危机是在线讨论中的一个突出问题,而虚伪指控是这些辩论中的核心修辞元素。然而,对于大规模文本分析而言,虚伪指控检测是一个研究不足的工具,通常被定义为谬误论证检测的一个较小子任务。在本文中,我们将虚伪指控检测定义为自然语言处理 (NLP) 中的一个独立任务,并识别出不同的相关虚伪指控子类型。我们的气候虚伪指控语料库 (CHAC) 包含 420 条 Reddit 气候辩论评论,由专家注释为两种不同的虚伪指控类型:个人虚伪与政治虚伪。我们评估了在 6 个样本的少样本上下文学习中,使用 3 个指令微调的大语言模型 (LLMs) 来检测该数据集中的虚伪指控。结果表明,GPT-4o 和 Llama-3 模型在检测虚伪指控方面显示出潜力(F1 达到 0.68,而先前的工作显示 F1 为 0.44)。然而,对于虚伪指控这样一个复杂的语义概念,上下文至关重要,我们发现模型在识别政治虚伪指控方面尤其困难,相比个人道德虚伪。我们的研究为虚伪检测和气候变化话语提供了新的见解,并且是进行大规模在线气候辩论中虚伪指控分析的基石。

[NLP-28] Mitigating the Bias of Large Language Model Evaluation

【速读】: 该论文试图解决现有大型语言模型(LLM)作为评判者时存在的偏见问题,即这些模型倾向于偏好表面质量较高(如冗长、流畅)的回答,而忽视了指令遵循能力。解决方案的关键在于对闭源评判模型进行校准,以减轻表面质量的影响,同时在概率和提示层面进行调整;对于开源评判模型,则通过对比训练,使用精心挑选的负样本(这些样本虽然表面质量较好但偏离指令)来减少偏见。实验结果表明,这些方法在保持较高评判准确性的同时,显著减少了偏见。

链接: https://arxiv.org/abs/2409.16788
作者: Hongli Zhou,Hui Huang,Yunfei Long,Bing Xu,Conghui Zhu,Hailong Cao,Muyun Yang,Tiejun Zhao
关键词-EN: Large Language Model, current output quality, Language Model, Large Language, leveraging another LLM
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, there has been a trend of evaluating the Large Language Model (LLM) quality in the flavor of LLM-as-a-Judge, namely leveraging another LLM to evaluate the current output quality. However, existing judges are proven to be biased, namely they would favor answers which present better superficial quality (such as verbosity, fluency) while ignoring the instruction following ability. In this work, we propose systematic research about the bias of LLM-as-a-Judge. Specifically, for closed-source judge models, we apply calibration to mitigate the significance of superficial quality, both on probability level and prompt level. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality. We apply our methods on the bias evaluation benchmark, and experiment results show our methods mitigate the bias by a large margin while maintaining a satisfactory evaluation accuracy.
摘要:近年来,出现了一种评估大语言模型 (LLM) 质量的趋势,即采用 LLM-as-a-Judge 的方式,利用另一个 LLM 来评估当前输出的质量。然而,现有的评估模型已被证明存在偏见,即它们倾向于偏好那些表面质量更好(如冗长性、流畅性)的答案,而忽视了指令遵循能力。在本研究中,我们提出了关于 LLM-as-a-Judge 偏见的系统性研究。具体而言,对于闭源的评估模型,我们通过校准来减轻表面质量的重要性,包括在概率层面和提示层面的校准。对于开源的评估模型,我们提出通过对比训练来减少偏见,使用精心挑选的负样本,这些样本虽然偏离指令但表面质量更好。我们将这些方法应用于偏见评估基准测试中,实验结果表明,我们的方法在保持令人满意的评估准确性的同时,显著减少了偏见。

[NLP-29] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction EMNLP2024

【速读】: 该论文试图解决现有自动化红队测试方法在大型语言模型(LLMs)中存在的两个主要问题:一是测试案例覆盖不全,二是未能有效捕捉多轮交互中的动态行为。解决方案的关键在于提出了HARM(Holistic Automated Red teaMing)框架,该框架通过基于可扩展、细粒度风险分类的顶层设计来扩大测试案例的多样性,并利用新型微调策略和强化学习技术,以类人的方式进行多轮对抗性探测,从而更系统地识别模型漏洞,并为对齐过程提供更有针对性的指导。

链接: https://arxiv.org/abs/2409.16783
作者: Jinchuan Zhang,Yan Zhou,Yaxin Liu,Ziming Li,Songlin Hu
关键词-EN: identifying misaligned behaviors, Automated red teaming, Holistic Automated Red, large language models, Automated red
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: EMNLP 2024 camera ready version

点击查看摘要

Abstract:Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine interactions. To overcome these limitations, we propose HARM (Holistic Automated Red teaMing), which scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance for the alignment process.
摘要:自动化红队测试是识别大语言模型 (LLM) 中行为偏差的一种有效方法。然而,现有方法往往主要关注提高攻击成功率,而忽视了全面测试案例覆盖的需求。此外,大多数这些方法局限于单轮红队测试,未能捕捉现实世界人机交互中的多轮动态。为了克服这些限制,我们提出了 HARM (整体自动化红队测试),该方法基于一种可扩展的、细粒度的风险分类法,采用自上而下的方法来扩大测试案例的多样性。我们的方法还利用了一种新颖的微调策略和强化学习技术,以类人的方式促进多轮对抗性探测。实验结果表明,我们的框架能够更系统地理解模型的脆弱性,并为对齐过程提供更有针对性的指导。

[NLP-30] E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL

【速读】: 该论文试图解决将自然语言查询转换为结构化查询语言(Text-to-SQL)时面临的挑战,特别是处理复杂数据库模式、解析用户查询中的歧义以及生成准确反映用户意图的复杂SQL查询。解决方案的关键在于引入E-SQL管道,通过直接模式链接和候选谓词增强来增强自然语言查询,将相关数据库项(如表、列和值)直接嵌入问题中,从而弥合查询与数据库结构之间的差距。此外,E-SQL利用候选谓词增强来纠正生成的SQL中的错误或不完整谓词,并通过实验证明模式过滤技术在高级大型语言模型中的效果有限。

链接: https://arxiv.org/abs/2409.16751
作者: Hasan Alp Caferoğlu,Özgür Ulusoy
关键词-EN: Translating Natural Language, Structured Query Language, critical task extensively, task extensively studied, natural language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Translating Natural Language Queries into Structured Query Language (Text-to-SQL or NLQ-to-SQL) is a critical task extensively studied by both the natural language processing and database communities, aimed at providing a natural language interface to databases (NLIDB) and lowering the barrier for non-experts. Despite recent advancements made through the use of Large Language Models (LLMs), significant challenges remain. These include handling complex database schemas, resolving ambiguity in user queries, and generating SQL queries with intricate structures that accurately reflect the user’s intent. In this work, we introduce E-SQL, a novel pipeline specifically designed to address these challenges through direct schema linking and candidate predicate augmentation. E-SQL enhances the natural language query by incorporating relevant database items (i.e., tables, columns, and values) and conditions directly into the question, bridging the gap between the query and the database structure. The pipeline leverages candidate predicate augmentation to mitigate erroneous or incomplete predicates in generated SQLs. We further investigate the impact of schema filtering, a technique widely explored in previous work, and demonstrate its diminishing returns when applied alongside advanced large language models. Comprehensive evaluations on the BIRD benchmark illustrate that E-SQL achieves competitive performance, particularly excelling in complex queries with a 66.29% execution accuracy on the test set. All code required to reproduce the reported results is publicly available on our GitHub repository.
摘要:将自然语言查询转换为结构化查询语言(Text-to-SQL 或 NLQ-to-SQL)是自然语言处理和数据库社区广泛研究的关键任务,旨在为数据库提供自然语言接口(NLIDB)并降低非专业人士的门槛。尽管通过使用大语言模型(LLMs)取得了近期进展,但仍存在重大挑战。这些挑战包括处理复杂的数据库模式、解析用户查询中的歧义以及生成准确反映用户意图的复杂结构 SQL 查询。在本研究中,我们引入了 E-SQL,这是一种专门设计的新型管道,通过直接模式链接和候选谓词增强来解决这些挑战。E-SQL 通过将相关数据库项(即表、列和值)和条件直接纳入问题中,增强了自然语言查询,弥合了查询与数据库结构之间的差距。该管道利用候选谓词增强来缓解生成的 SQL 中错误或不完整的谓词。我们进一步研究了模式过滤的影响,这是一种在前人工作中广泛探索的技术,并展示了在高级大语言模型伴随应用时其收益递减的现象。在 BIRD 基准上的综合评估表明,E-SQL 实现了具有竞争力的性能,特别是在复杂查询中表现出色,测试集上的执行准确率达到 66.29%。所有用于重现报告结果的代码均公开发布在我们的 GitHub 仓库中。

[NLP-31] RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems

【速读】: 该论文试图解决大语言模型驱动的角色扮演系统中存在的角色幻觉问题,即模型在生成响应时偏离预定义角色并产生与角色不一致的输出。解决方案的关键在于提出了RoleBreak框架,该框架识别了导致角色幻觉的两个核心机制:查询稀疏性和角色-查询冲突。基于此,论文构建了RoleBreakEval数据集来评估现有的幻觉缓解技术,并提出了新的防御策略——Narrator Mode。该模式通过生成补充的叙述性上下文来缓解角色-查询冲突,增强查询的泛化能力,从而显著减少幻觉,提高角色一致性和整体叙事连贯性。

链接: https://arxiv.org/abs/2409.16727
作者: Yihong Tang,Bo Wang,Xu Wang,Dongming Zhao,Jing Liu,Jijun Zhang,Ruifang He,Yuexian Hou
关键词-EN: Role-playing systems powered, emotional communication applications, large language models, Role-playing systems, communication applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms-query sparsity and role-query conflict-as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.
摘要: 基于大语言模型 (LLM) 的角色扮演系统在情感交流应用中变得越来越重要。然而,这些系统容易出现角色幻觉 (character hallucination),即模型偏离预定义的角色设定,生成与预期角色不一致的回复。本文首次从攻击角度对角色幻觉进行了系统分析,并提出了 RoleBreak 框架。我们的框架识别了两个核心机制——查询稀疏性 (query sparsity) 和角色-查询冲突 (role-query conflict)——作为驱动角色幻觉的关键因素。基于这些洞察,我们构建了一个新的数据集 RoleBreakEval,用于评估现有的幻觉缓解技术。我们的实验表明,即使经过训练以最小化幻觉的增强模型,在面对攻击时仍然容易受到攻击。为了解决这些脆弱性,我们提出了一种新的防御策略——叙述者模式 (Narrator Mode),通过叙述生成补充上下文,以缓解角色-查询冲突并提高查询的泛化能力。实验结果表明,叙述者模式在减少幻觉、增强角色和查询的忠实度以及提高整体叙事连贯性方面,显著优于传统的拒绝策略。

[NLP-32] PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

【速读】: 该论文试图解决低秩适应(LoRA)方法在处理复杂任务时面临的两个主要挑战:低秩假设的限制和初始化方法的次优性。解决方案的关键在于提出了一种名为PMSS(预训练矩阵骨架选择)的新方法,该方法通过从预训练权重矩阵中选择骨架,并仅学习一个小矩阵,从而在保持低成本的同时实现高秩更新。PMSS利用了预训练权重中固有的语义和语言信息,显著减少了可训练参数的数量,并在多个任务中表现优于LoRA和其他微调方法,特别是在处理复杂任务如DROP基准和数学推理时,效果尤为显著。

链接: https://arxiv.org/abs/2409.16722
作者: Qibin Wang,Xiaolin Hu,Weikai Xu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: avoid excessive inference, excessive inference costs, Low-rank adaptation, Matrices Skeleton Selection, variants have recently
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) and its variants have recently gained much interest due to their ability to avoid excessive inference costs. However, LoRA still encounters the following challenges: (1) Limitation of low-rank assumption; and (2) Its initialization method may be suboptimal. To this end, we propose PMSS(Pre-trained Matrices Skeleton Selection), which enables high-rank updates with low costs while leveraging semantic and linguistic information inherent in pre-trained weight. It achieves this by selecting skeletons from the pre-trained weight matrix and only learning a small matrix instead. Experiments demonstrate that PMSS outperforms LoRA and other fine-tuning methods across tasks with much less trainable parameters. We demonstrate its effectiveness, especially in handling complex tasks such as DROP benchmark(+3.4%/+5.9% on LLaMA2-7B/13B) and math reasoning(+12.89%/+5.61%/+3.11% on LLaMA2-7B, Mistral-7B and Gemma-7B of GSM8K). The code and model will be released soon.
摘要:低秩适应 (LoRA) 及其变体因其能够避免过高的推理成本而最近引起了广泛关注。然而,LoRA 仍面临以下挑战:(1) 低秩假设的局限性;(2) 其初始化方法可能并非最优。为此,我们提出了 PMSS (预训练矩阵骨架选择),该方法在低成本下实现高秩更新,同时利用预训练权重中固有的语义和语言信息。它通过从预训练权重矩阵中选择骨架,并仅学习一个小矩阵来实现这一点。实验表明,PMSS 在各项任务中均优于 LoRA 和其他微调方法,且可训练参数大幅减少。我们展示了其在处理复杂任务(如 DROP 基准测试在 LLaMA2-7B/13B 上分别提升 3.4% 和 5.9%,以及在 GSM8K 上对 LLaMA2-7B、Mistral-7B 和 Gemma-7B 分别提升 12.89%、5.61% 和 3.11%)中的有效性。代码和模型即将发布。

[NLP-33] Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP2024

【速读】: 该论文试图解决在少样本情况下对视觉-语言模型(VLMs)进行微调时,传统全参数微调可能导致预训练知识损失的问题。解决方案的关键在于提出了一种名为ClipFit的方法,通过仅微调特定的偏置项和归一化层,而不是所有参数,来提升零样本CLIP模型的性能,平均调和精度提高了7.27%。这种方法在不引入额外参数开销的情况下,有效地保留了预训练模型的知识,并通过实验分析揭示了微调对模型内部参数和表示的影响。

链接: https://arxiv.org/abs/2409.16718
作者: Ming Li,Jike Zhong,Chenxin Li,Liuzhuozheng Li,Nie Lin,Masashi Sugiyama
关键词-EN: Recent advances, classic model fine-tuning, fine-tuning Vision-Language Models, prompt tuning, adapter tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \urlthis https URL.
摘要:近年来,在微调视觉-语言模型 (Vision-Language Models, VLMs) 方面,提示调优 (prompt tuning) 和适配器调优 (adapter tuning) 取得了显著成功,而经典的模型参数微调似乎被忽视了。人们普遍认为,使用少样本 (few-shot) 样本微调 VLMs 的参数会破坏预训练的知识,因为即使微调 CLIP 模型也会导致性能下降。本文重新审视了这一观点,并提出了一种新的视角:微调特定参数而非全部参数,将揭示经典模型微调在 VLMs 上的潜力。通过我们的细致研究,我们提出了 ClipFit,一种简单而有效的微调方法,无需引入任何额外参数的开销。我们证明,仅微调特定的偏置项和归一化层,ClipFit 可以将零样本 (zero-shot) CLIP 的平均调和精度提高 7.27%。最后,为了理解 ClipFit 中的微调如何影响预训练模型,我们进行了广泛的实验分析,涉及内部参数和表示的变化。我们发现,低级文本偏置层和第一层归一化层的变化远大于其他层。代码可在 \urlthis https URL 获取。

[NLP-34] Beyond Turing Test: Can GPT-4 Sway Experts Decisions?

【速读】: 该论文试图解决的问题是如何评估大型语言模型(LLMs)生成的文本对读者决策的影响,特别是针对业余和专业受众。解决方案的关键在于通过读者的反应和决策来评估生成文本的说服力、逻辑连贯性和实用性,而不仅仅是其与人类生成内容的不可区分性。论文通过研究GPT-4生成的文本对不同受众的影响,发现其能够生成影响决策的说服性分析,并强调了基于读者反应的多维度评估方法的重要性。

链接: https://arxiv.org/abs/2409.16710
作者: Takehiro Takayanagi,Hiroya Takamura,Kiyoshi Izumi,Chung-Chi Chen
关键词-EN: involves assessing generated, large language models, evaluating large language, post-Turing era, involves assessing
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the post-Turing era, evaluating large language models (LLMs) involves assessing generated text based on readers’ reactions rather than merely its indistinguishability from human-produced content. This paper explores how LLM-generated text impacts readers’ decisions, focusing on both amateur and expert audiences. Our findings indicate that GPT-4 can generate persuasive analyses affecting the decisions of both amateurs and professionals. Furthermore, we evaluate the generated text from the aspects of grammar, convincingness, logical coherence, and usefulness. The results highlight a high correlation between real-world evaluation through audience reactions and the current multi-dimensional evaluators commonly used for generative models. Overall, this paper shows the potential and risk of using generated text to sway human decisions and also points out a new direction for evaluating generated text, i.e., leveraging the reactions and decisions of readers. We release our dataset to assist future research.
摘要:在后图灵时代,评估大语言模型 (LLM) 涉及基于读者反应来评价生成文本,而不仅仅是其与人类生成内容的不可区分性。本文探讨了 LLM 生成文本如何影响读者的决策,重点关注业余和专家受众。我们的研究结果表明,GPT-4 能够生成影响业余和专业人士决策的说服性分析。此外,我们从语法、说服力、逻辑连贯性和实用性等方面评估生成文本。结果显示,通过受众反应进行现实世界评估与当前用于生成模型的多维度评估器之间存在高度相关性。总体而言,本文展示了使用生成文本影响人类决策的潜力和风险,并指出了评估生成文本的新方向,即利用读者的反应和决策。我们发布了数据集以协助未来研究。

[NLP-35] Probing Omissions and Distortions in Transformer-based RDF-to-Text Models ACL

【速读】: 该论文试图解决自然语言生成(NLG)中输出文本中重要信息被遗漏的问题。解决方案的关键在于探索两种探测方法:一是基于余弦相似度计算的参数无关探测方法,用于分析RDF图嵌入与移除某些实体后的RDF图嵌入之间的相似性;二是基于二元分类的参数化探测方法,用于检测编码器嵌入中被遗漏的实体。研究还扩展到分析扭曲实体的情况,即在生成文本中未完全正确提及的实体。结果表明,这两种方法均能有效探测编码器输出嵌入中的遗漏和扭曲实体,暗示编码器对这些实体的信号较弱,从而导致信息丢失,并展示了探测方法在检测NLG模型输出错误中的应用潜力。

链接: https://arxiv.org/abs/2409.16707
作者: Juliette Faille,Albert Gatt,Claire Gardent
关键词-EN: Natural Language Generation, Natural Language, Language Generation, RDF graphs, encoder output
类目: Computation and Language (cs.CL)
备注: Accepted for publication in Transactions of the ACL (TACL)

点击查看摘要

Abstract:In Natural Language Generation (NLG), important information is sometimes omitted in the output text. To better understand and analyse how this type of mistake arises, we focus on RDF-to-Text generation and explore two methods of probing omissions in the encoder output of BART (Lewis et al, 2020) and of T5 (Raffel et al, 2019): (i) a novel parameter-free probing method based on the computation of cosine similarity between embeddings of RDF graphs and of RDF graphs in which we removed some entities and (ii) a parametric probe which performs binary classification on the encoder embeddings to detect omitted entities. We also extend our analysis to distorted entities, i.e. entities that are not fully correctly mentioned in the generated text (e.g. misspelling of entity, wrong units of measurement). We found that both omitted and distorted entities can be probed in the encoder’s output embeddings. This suggests that the encoder emits a weaker signal for these entities and therefore is responsible for some loss of information. This also shows that probing methods can be used to detect mistakes in the output of NLG models.
摘要:在自然语言生成 (Natural Language Generation, NLG) 中,输出文本有时会遗漏重要信息。为了更好地理解和分析这类错误产生的原因,我们专注于 RDF-to-Text 生成,并探索了两种方法来探测 BART (Lewis et al, 2020) 和 T5 (Raffel et al, 2019) 编码器输出中的遗漏:(i) 一种基于 RDF 图嵌入与移除某些实体后的 RDF 图嵌入之间余弦相似度计算的新型无参数探测方法;(ii) 一种对编码器嵌入进行二元分类以检测遗漏实体的参数化探测方法。我们还扩展了分析范围,涵盖了扭曲实体,即在生成文本中未完全正确提及的实体(例如实体拼写错误、测量单位错误)。我们发现,遗漏和扭曲的实体都可以在编码器的输出嵌入中被探测到。这表明编码器对这些实体发出了较弱的信号,因此是信息丢失的部分原因。这也表明探测方法可以用于检测 NLG 模型输出中的错误。

[NLP-36] A Survey of Low-bit Large Language Models : Basics Systems and Algorithms

【速读】: 该论文试图解决大型语言模型(LLMs)在实际部署中面临的内存和计算资源消耗过大的问题。解决方案的关键在于采用低比特量化技术,通过减少模型参数、激活值和梯度的比特宽度,从而降低内存使用和计算需求。论文系统性地综述了低比特量化方法的基本原理、系统实现和算法策略,涵盖了新的数据格式、框架和系统支持,以及针对LLMs的高效训练和推理技术,为未来通过低比特量化提升LLMs效率和适用性提供了宝贵的见解和指导。

链接: https://arxiv.org/abs/2409.16694
作者: Ruihao Gong,Yifu Ding,Zining Wang,Chengtao Lv,Xingyu Zheng,Jinyang Du,Haotong Qin,Jinyang Guo,Michele Magno,Xianglong Liu
关键词-EN: Large language models, natural language processing, showcasing exceptional performance, Large language, achieved remarkable advancements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Ruihao Gong leads the overall organization of the survey, with Yifu Ding and Jinyang Du contributing to Sections 2 and 3. Xingyu Zheng is responsible for authoring Section 4, while Chengtao Lv and Zining Wang collaborate on Section 5. Haotong Qin, Jinyang Guo, Michele Magno, and Xianglong Liu provide guidance during the whole process and assist in refining the final manuscript

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.
摘要:大语言模型 (LLM) 在自然语言处理领域取得了显著的进展,展示了在各种任务中的卓越性能。然而,高昂的内存和计算需求为其实际部署带来了重大挑战。低比特量化作为一种关键方法,通过减少模型参数、激活值和梯度的比特宽度,从而降低内存使用和计算需求,已成为缓解这些挑战的有效手段。本文对面向 LLM 的低比特量化方法进行了全面的综述,涵盖了基本原理、系统实现和算法策略。首先介绍了低比特 LLM 的基本概念和新数据格式,随后回顾了在各种硬件平台上支持低比特 LLM 的框架和系统。接着,我们对高效低比特训练和推理的 LLM 技术和工具包进行了分类和分析。最后,我们讨论了低比特 LLM 的未来趋势和潜在进展。从基础、系统和算法角度进行的系统性概述,可以为未来通过低比特量化提升 LLM 效率和适用性的工作提供宝贵的见解和指导。

[NLP-37] MSI-Agent : Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making

【速读】: 该论文试图解决长期记忆中洞察力的有效利用问题,特别是如何避免无关洞察的产生和缺乏通用洞察的困境。解决方案的关键在于提出了多尺度洞察代理(MSI-Agent),通过经验选择器、洞察生成器和洞察选择器的三部分流水线,生成任务特定的、高层次的洞察,并将其存储在数据库中,以便在决策时使用相关洞察。实验结果表明,MSI-Agent在GPT3.5的规划任务中表现优于其他洞察策略,并且在面对领域转移场景时表现出更好的鲁棒性。

链接: https://arxiv.org/abs/2409.16686
作者: Dayuan Fu,Biqing Qi,Yihuai Gao,Che Jiang,Guanting Dong,Bowen Zhou
关键词-EN: Long-term memory, insight, crucial role, memory is significant, play a crucial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-term memory is significant for agents, in which insights play a crucial role. However, the emergence of irrelevant insight and the lack of general insight can greatly undermine the effectiveness of insight. To solve this problem, in this paper, we introduce Multi-Scale Insight Agent (MSI-Agent), an embodied agent designed to improve LLMs’ planning and decision-making ability by summarizing and utilizing insight effectively across different scales. MSI achieves this through the experience selector, insight generator, and insight selector. Leveraging a three-part pipeline, MSI can generate task-specific and high-level insight, store it in a database, and then use relevant insight from it to aid in decision-making. Our experiments show that MSI outperforms another insight strategy when planning by GPT3.5. Moreover, We delve into the strategies for selecting seed experience and insight, aiming to provide LLM with more useful and relevant insight for better decision-making. Our observations also indicate that MSI exhibits better robustness when facing domain-shifting scenarios.
摘要:长期记忆对于智能体至关重要,其中洞察力扮演着关键角色。然而,无关洞察力的出现和普遍洞察力的缺乏会极大地削弱洞察力的有效性。为了解决这一问题,本文引入了多尺度洞察智能体 (Multi-Scale Insight Agent, MSI-Agent),这是一种实体智能体,旨在通过在不同尺度上有效总结和利用洞察力,提升大语言模型 (LLM) 的规划和决策能力。MSI 通过经验选择器、洞察力生成器和洞察力选择器实现这一目标。借助三部分流水线,MSI 能够生成任务特定的和高层级的洞察力,将其存储在数据库中,并从中提取相关洞察力以辅助决策。我们的实验表明,在 GPT3.5 的规划中,MSI 优于另一种洞察力策略。此外,我们深入探讨了选择种子经验和洞察力的策略,旨在为大语言模型提供更有用和相关的洞察力,以实现更好的决策。我们的观察还表明,MSI 在面对领域转移场景时表现出更好的鲁棒性。

[NLP-38] SynTQA: Synergistic Table-based Question Answering via Mixture of Text-to-SQL and E2E TQA EMNLP2024

【速读】: 该论文试图解决表格式问答任务中Text-to-SQL解析和端到端问答(E2E TQA)两种方法的协同问题。解决方案的关键在于提出了一种协同的表格式问答方法,通过答案选择机制将不同模型的优势结合起来,无论是基于特征的还是基于大型语言模型(LLM)的答案选择器,都能显著提升整体性能。

链接: https://arxiv.org/abs/2409.16682
作者: Siyue Zhang,Anh Tuan Luu,Chen Zhao
关键词-EN: Table-based Question Answering, Question Answering task, question answering, Synergistic Table-based Question, Table-based Question
类目: Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Text-to-SQL parsing and end-to-end question answering (E2E TQA) are two main approaches for Table-based Question Answering task. Despite success on multiple benchmarks, they have yet to be compared and their synergy remains unexplored. In this paper, we identify different strengths and weaknesses through evaluating state-of-the-art models on benchmark datasets: Text-to-SQL demonstrates superiority in handling questions involving arithmetic operations and long tables; E2E TQA excels in addressing ambiguous questions, non-standard table schema, and complex table contents. To combine both strengths, we propose a Synergistic Table-based Question Answering approach that integrate different models via answer selection, which is agnostic to any model types. Further experiments validate that ensembling models by either feature-based or LLM-based answer selector significantly improves the performance over individual models.
摘要:文本到 SQL 解析和端到端问答 (E2E TQA) 是基于表格问答任务的两种主要方法。尽管在多个基准测试中取得了成功,但它们之间的比较及其协同作用尚未得到充分探索。本文通过评估最先进的模型在基准数据集上的表现,识别了它们各自的优势和劣势:文本到 SQL 在处理涉及算术运算和长表格的问题方面表现出色;E2E TQA 在解决模糊问题、非标准表格模式和复杂表格内容方面表现优异。为了结合两者的优势,我们提出了一种协同的基于表格问答方法,通过答案选择机制整合不同的模型,该机制对任何模型类型都是不可知的。进一步的实验验证了通过基于特征或基于大语言模型的答案选择器集成模型,显著优于单个模型的性能。

[NLP-39] SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection CIKM2020

【速读】: 该论文试图解决在线社交网络中的仇恨言论检测问题。解决方案的关键在于提出了一个名为SWE2的新型仇恨言论检测框架,该框架仅依赖于消息内容,并能自动识别仇恨言论。SWE2框架结合了词级语义信息和子词知识,不仅在无对抗攻击的情况下表现出色,还在极端对抗攻击(50%消息被篡改)的情况下保持了高准确性和稳健性。

链接: https://arxiv.org/abs/2409.16673
作者: Guanyi Mou,Pengyi Ye,Kyumin Lee
关键词-EN: emerging hot topics, online social networks, Hate speech detection, Hate speech, hate speech makes
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in CIKM 2020

点击查看摘要

Abstract:Hate speech detection on online social networks has become one of the emerging hot topics in recent years. With the broad spread and fast propagation speed across online social networks, hate speech makes significant impacts on society by increasing prejudice and hurting people. Therefore, there are aroused attention and concern from both industry and academia. In this paper, we address the hate speech problem and propose a novel hate speech detection framework called SWE2, which only relies on the content of messages and automatically identifies hate speech. In particular, our framework exploits both word-level semantic information and sub-word knowledge. It is intuitively persuasive and also practically performs well under a situation with/without character-level adversarial attack. Experimental results show that our proposed model achieves 0.975 accuracy and 0.953 macro F1, outperforming 7 state-of-the-art baselines under no adversarial attack. Our model robustly and significantly performed well under extreme adversarial attack (manipulation of 50% messages), achieving 0.967 accuracy and 0.934 macro F1.
摘要:在线社交网络上的仇恨言论检测近年来已成为一个新兴的热门话题。随着仇恨言论在在线社交网络上的广泛传播和快速扩散速度,它通过增加偏见和伤害人们对社会产生了重大影响。因此,这一问题引起了业界和学术界的关注和担忧。本文针对仇恨言论问题,提出了一种名为 SWE2 的新型仇恨言论检测框架,该框架仅依赖于消息内容并自动识别仇恨言论。特别地,我们的框架利用了词级语义信息和子词知识。它在直观上具有说服力,并且在有无字符级对抗攻击的情况下都表现出色。实验结果表明,我们提出的模型在没有对抗攻击的情况下达到了 0.975 的准确率和 0.953 的宏 F1 分数,优于 7 个最先进的基线模型。我们的模型在极端对抗攻击(50% 消息被操纵)下也表现出色,达到了 0.967 的准确率和 0.934 的宏 F1 分数。

[NLP-40] opic-aware Causal Intervention for Counterfactual Detection EMNLP

【速读】: 该论文试图解决反事实检测(Counterfactual Detection, CFD)模型在缺乏线索短语提示时性能显著下降的问题,以及模型在预测时倾向于将反事实陈述误判为非反事实陈述的偏差问题。解决方案的关键在于将神经主题模型(Neural Topic Model)集成到CFD模型中,以捕捉输入陈述的全局语义,并通过因果干预隐藏表示来平衡类别标签的影响,从而提升模型在反事实检测及其他偏差敏感任务中的表现。

链接: https://arxiv.org/abs/2409.16668
作者: Thong Nguyen,Truc-My Nguyen
关键词-EN: numerous NLP applications, NLP applications, numerous NLP, CFD, CFD model
类目: Computation and Language (cs.CL)
备注: Accepted to the 4th EMNLP-NLP4DH 2024 workshop

点击查看摘要

Abstract:Counterfactual statements, which describe events that did not or cannot take place, are beneficial to numerous NLP applications. Hence, we consider the problem of counterfactual detection (CFD) and seek to enhance the CFD models. Previous models are reliant on clue phrases to predict counterfactuality, so they suffer from significant performance drop when clue phrase hints do not exist during testing. Moreover, these models tend to predict non-counterfactuals over counterfactuals. To address these issues, we propose to integrate neural topic model into the CFD model to capture the global semantics of the input statement. We continue to causally intervene the hidden representations of the CFD model to balance the effect of the class labels. Extensive experiments show that our approach outperforms previous state-of-the-art CFD and bias-resolving methods in both the CFD and other bias-sensitive tasks.
摘要:反事实陈述(Counterfactual statements)描述了未发生或无法发生的事件,对众多自然语言处理(NLP)应用有益。因此,我们考虑反事实检测(Counterfactual Detection, CFD)问题,并寻求提升CFD模型。以往模型依赖线索短语来预测反事实性,因此在测试时缺乏线索短语提示的情况下,性能显著下降。此外,这些模型倾向于将陈述预测为非反事实。为解决这些问题,我们提出将神经主题模型(Neural Topic Model)整合到CFD模型中,以捕捉输入陈述的全局语义。我们继续对CFD模型的隐藏表示进行因果干预,以平衡类别标签的影响。大量实验表明,我们的方法在CFD及其他偏差敏感任务中均优于先前的最先进CFD及偏差缓解方法。

[NLP-41] A Character-Centric Creative Story Generation via Imagination

【速读】: 该论文试图解决现有大型语言模型在生成创意故事时,故事元素多样性和角色细节不足的问题。解决方案的关键在于引入了一个名为CCI(Character-centric Creative story generation via Imagination)的新型故事生成框架,该框架包含两个创新模块:IG(Image-Guided Imagination)和MW(Multi-Writer model)。IG模块利用DALL-E 3生成关键故事元素的视觉表示,从而创造出比纯文本方法更独特和具体的人物、背景和主要情节。MW模块则使用IG生成的故事元素来生成多个主角描述候选,并选择最佳描述,从而将生动丰富的角色描述融入故事中。通过这种方法,CCI显著提升了故事的创意性,并实现了与用户的交互式多模态故事生成,为人类与语言模型的文化发展融合提供了新的可能性。

链接: https://arxiv.org/abs/2409.16667
作者: Kyeongman Park,Minbeom Kim,Kyomin Jung
关键词-EN: Creative story generation, story generation, large language models, Character-centric Creative story, Creative story
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Creative story generation with diverse and detailed story elements is a long-standing goal for large language models. While existing methodologies generate long and coherent stories, they fall significantly short of human capabilities in terms of diversity and character detail. To address this, we introduce a novel story generation framework called CCI (Character-centric Creative story generation via Imagination). CCI features two innovative modules for creative story generation: IG (Image-Guided Imagination) and MW (Multi-Writer model). In the IG module, we utilize DALL-E 3 to create visual representations of key story elements. The IG generates more novel and concrete characters, backgrounds, and main plots than text-only methods. The MW module uses these story elements created by IG to generate multiple description candidates for the protagonist and select the best one. This method incorporates vivid and rich character descriptions into the story. We compared the stories generated by CCI and baseline models through human evaluation and statistical analysis. The results showed significant improvements in the creativity. Furthermore, by enabling interactive multi-modal story generation with users, we have opened up possibilities for human-LLM integration in cultural development.
摘要:多样化和细节丰富的故事生成一直是大型语言模型(LLM)的长期目标。尽管现有方法能够生成冗长且连贯的故事,但在多样性和角色细节方面仍远不及人类的能力。为此,我们提出了一种名为 CCI(以角色为中心的创意故事生成通过想象力)的新型故事生成框架。CCI 包含两个创新的创意故事生成模块:IG(图像引导的想象力)和 MW(多作者模型)。在 IG 模块中,我们利用 DALL-E 3 创建关键故事元素的视觉表示。与仅使用文本的方法相比,IG 生成了更独特且具体的角色、背景和主线情节。MW 模块则利用 IG 创建的故事元素,为故事主角生成多个描述候选,并选择最佳描述。这种方法将生动丰富的角色描述融入故事中。我们通过人类评估和统计分析比较了 CCI 和基线模型生成的故事。结果显示,在创意方面有显著提升。此外,通过实现与用户互动的多模态故事生成,我们为人类与大语言模型在文化发展中的融合开辟了新的可能性。

[NLP-42] Pre-trained Language Models Return Distinguishable Probability Distributions to Unfaithfully Hallucinated Texts EMNLP2024

【速读】: 该论文旨在解决预训练语言模型在生成文本时产生的幻觉问题,即模型生成的文本与真实信息不符。解决方案的关键在于利用模型对幻觉文本的生成概率和不确定性分布的显著差异,提出了一种幻觉减少训练算法。该算法通过识别和利用这些统计显著的差异,在保持文本质量的同时,显著提高了生成文本的忠实度,从而超越了其他基线方法。

链接: https://arxiv.org/abs/2409.16658
作者: Taehun Cha,Donghun Lee
关键词-EN: distinguishable generation probability, pre-trained language models, generation probability, probability and uncertainty, size and structure
类目: Computation and Language (cs.CL)
备注: 10 pages, EMNLP 2024 Findings

点击查看摘要

Abstract:In this work, we show the pre-trained language models return distinguishable generation probability and uncertainty distribution to unfaithfully hallucinated texts, regardless of their size and structure. By examining 24 models on 6 data sets, we find out that 88-98% of cases return statistically significantly distinguishable generation probability and uncertainty distributions. Using this general phenomenon, we showcase a hallucination-reducing training algorithm. Our algorithm outperforms other baselines by achieving higher faithfulness metrics while maintaining sound general text quality measures.
摘要:在本研究中,我们展示了预训练语言模型对不忠实幻觉文本的生成概率和不确定性分布具有可区分的特征,无论其规模和结构如何。通过对 6 个数据集上的 24 个模型进行分析,我们发现 88-98% 的案例在统计上显著地表现出可区分的生成概率和不确定性分布。利用这一普遍现象,我们展示了一种减少幻觉的训练算法。该算法在保持良好文本质量的同时,实现了更高的忠实度指标,优于其他基线方法。

[NLP-43] Domain-Independent Automatic Generation of Descriptive Texts for Time-Series Data

【速读】: 该论文试图解决时间序列数据与描述性文本配对数据稀缺的问题,提出了一种系统生成领域无关描述性文本的方法。解决方案的关键在于识别并实施两种不同的生成方法:前向方法和后向方法,特别是通过创新的后向方法创建了Temporal Automated Captions for Observations (TACO)数据集。实验结果表明,基于对比学习的模型在TACO数据集上训练后,能够为新领域的时间序列数据生成描述性文本。

链接: https://arxiv.org/abs/2409.16647
作者: Kota Dohi,Aoi Ito,Harsh Purohit,Tomoya Nishida,Takashi Endo,Yohei Kawaguchi
关键词-EN: Due to scarcity, time-series data, descriptive texts, time-series data annotated, time-series
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to scarcity of time-series data annotated with descriptive texts, training a model to generate descriptive texts for time-series data is challenging. In this study, we propose a method to systematically generate domain-independent descriptive texts from time-series data. We identify two distinct approaches for creating pairs of time-series data and descriptive texts: the forward approach and the backward approach. By implementing the novel backward approach, we create the Temporal Automated Captions for Observations (TACO) dataset. Experimental results demonstrate that a contrastive learning based model trained using the TACO dataset is capable of generating descriptive texts for time-series data in novel domains.
摘要:由于带有描述性文本的时间序列数据稀缺,训练一个能够为时间序列数据生成描述性文本的模型具有挑战性。在本研究中,我们提出了一种方法,系统地从时间序列数据中生成领域无关的描述性文本。我们识别了两种创建时间序列数据与描述性文本对的不同方法:前向方法和后向方法。通过实施新颖的后向方法,我们创建了时间自动描述观察 (Temporal Automated Captions for Observations, TACO) 数据集。实验结果表明,基于对比学习的模型在使用 TACO 数据集进行训练后,能够为新领域的时间序列数据生成描述性文本。

[NLP-44] Cross-Lingual and Cross-Cultural Variation in Image Descriptions

【速读】: 该论文试图解决不同语言使用者在描述视觉信息时是否存在差异的问题。解决方案的关键在于通过大规模跨语言的图像描述数据集,结合多模态分析方法,准确识别并比较不同语言中提及的实体,揭示地理或遗传上相近的语言更倾向于提及相同实体,同时识别出在不同语言中显著性一致或差异较大的实体类别,从而验证了基本层次类别理论和环境感知模式假设,揭示了实体提及中的普遍性和文化特异性模式。

链接: https://arxiv.org/abs/2409.16646
作者: Uri Berger,Edoardo M. Ponti
关键词-EN: languages talk differently, talk differently, languages, languages talk, Abstract
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do speakers of different languages talk differently about what they see? Behavioural and cognitive studies report cultural effects on perception; however, these are mostly limited in scope and hard to replicate. In this work, we conduct the first large-scale empirical study of cross-lingual variation in image descriptions. Using a multimodal dataset with 31 languages and images from diverse locations, we develop a method to accurately identify entities mentioned in captions and present in the images, then measure how they vary across languages. Our analysis reveals that pairs of languages that are geographically or genetically closer tend to mention the same entities more frequently. We also identify entity categories whose saliency is universally high (such as animate beings), low (clothing accessories) or displaying high variance across languages (landscape). In a case study, we measure the differences in a specific language pair (e.g., Japanese mentions clothing far more frequently than English). Furthermore, our method corroborates previous small-scale studies, including 1) Rosch et al. (1976)'s theory of basic-level categories, demonstrating a preference for entities that are neither too generic nor too specific, and 2) Miyamoto et al. (2006)'s hypothesis that environments afford patterns of perception, such as entity counts. Overall, our work reveals the presence of both universal and culture-specific patterns in entity mentions.
摘要:不同语言的使用者是否以不同的方式描述他们所看到的事物?行为学和认知学的研究表明,文化对感知有影响;然而,这些研究大多范围有限且难以复制。在本研究中,我们首次进行了大规模的跨语言图像描述变异实证研究。利用包含31种语言和来自不同地点的图像的多模态数据集,我们开发了一种方法,能够准确识别在标题中提及并在图像中呈现的实体,并测量这些实体在不同语言间的变异情况。我们的分析显示,地理位置或基因上更接近的语言对倾向于更频繁地提及相同的实体。我们还识别出一些实体类别,其显著性在所有语言中普遍较高(如生物)、较低(如服饰配件)或显示出跨语言的高变异性(如景观)。在一个案例研究中,我们测量了特定语言对之间的差异(例如,日语比英语更频繁地提及服饰)。此外,我们的方法证实了先前的小规模研究,包括1) Rosch等人(1976年)的基本层次类别理论,表明对既不过于通用也不过于具体的实体的偏好,以及2) Miyamoto等人(2006年)的假设,即环境提供了感知模式,如实体计数。总体而言,我们的研究揭示了实体提及中普遍存在和文化特定的模式。

[NLP-45] raining Language Models to Win Debates with Self-Play Improves Judge Accuracy

【速读】: 该论文试图解决如何通过辩论方法实现可扩展监督的问题,解决方案的关键在于通过自对弈生成的数据训练模型进行辩论,并发现基于语言模型的评估者在判断经过辩论优化模型时能更准确地回答长文本阅读理解任务中的问题。论文通过定量和定性比较辩论模型与新的咨询基线模型,证明了辩论训练能促进更强有力和信息量更大的论证,显示出其在难以直接评估的任务中提供高质量监督的潜力。

链接: https://arxiv.org/abs/2409.16636
作者: Samuel Arnesen,David Rein,Julian Michael
关键词-EN: generated via self-play, test the robustness, method of scalable, scalable oversight, data generated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 48 pages, 12 figures; code at this https URL

点击查看摘要

Abstract:We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.
摘要:我们通过训练模型使用自对弈生成的数据进行辩论,测试了辩论作为一种可扩展监督方法的鲁棒性。在长上下文阅读理解任务中,我们发现基于语言模型的评估者在判断经过辩论优化以获胜的模型时,回答问题的准确性更高。相比之下,对于训练成在没有对手辩论者的情况下说服裁判的咨询模型,我们没有发现这种关系。在辩论模型与新颖的咨询基线之间的定量和定性比较中,我们发现辩论训练鼓励了更强有力和更具信息量的论点,表明辩论有望为难以直接评估的任务提供高质量的监督。

[NLP-46] Claim-Guided Textual Backdoor Attack for Practical Applications

【速读】: 该论文试图解决现有后门攻击方法在实际应用中的局限性,即需要模型分发后对输入进行操作才能激活后门的问题。解决方案的关键是引入了一种新型的Claim-Guided Backdoor Attack (CGBA),通过利用文本中的固有声明(claims)作为触发器,无需外部操作即可激活后门。CGBA通过声明提取、聚类和针对性训练,使模型在特定声明上表现异常,同时不影响其在干净数据上的性能,从而显著提升了实际应用中后门攻击的可行性和隐蔽性。

链接: https://arxiv.org/abs/2409.16618
作者: Minkyoo Song,Hanna Kim,Jaehan Kim,Youngjin Jin,Seungwon Shin
关键词-EN: natural language processing, Recent advances, large language models, security vulnerabilities, backdoor attacks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:Recent advances in natural language processing and the increased use of large language models have exposed new security vulnerabilities, such as backdoor attacks. Previous backdoor attacks require input manipulation after model distribution to activate the backdoor, posing limitations in real-world applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor Attack (CGBA), which eliminates the need for such manipulations by utilizing inherent textual claims as triggers. CGBA leverages claim extraction, clustering, and targeted training to trick models to misbehave on targeted claims without affecting their performance on clean data. CGBA demonstrates its effectiveness and stealthiness across various datasets and models, significantly enhancing the feasibility of practical backdoor attacks. Our code and data will be available at this https URL.
摘要:近年来,自然语言处理领域的进步以及大语言模型的广泛应用,暴露了新的安全漏洞,例如后门攻击。以往的后门攻击需要在模型分发后对输入进行操作以激活后门,这在实际应用中存在局限性。针对这一问题,我们提出了一种新颖的声明引导后门攻击 (Claim-Guided Backdoor Attack, CGBA),该方法通过利用文本声明作为触发器,消除了对输入操作的需求。CGBA 利用声明提取、聚类和目标训练,诱导模型在特定声明上表现异常,同时不影响其在干净数据上的性能。CGBA 在多种数据集和模型上展示了其有效性和隐蔽性,显著增强了实际后门攻击的可行性。我们的代码和数据将在以下链接提供:https URL。

[NLP-47] Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

【速读】: 该论文试图解决大语言模型(LLMs)在评估学术论文新颖性方面的能力问题。解决方案的关键在于引入了一个名为SchNovel的学术新颖性基准,该基准包含15000对来自arXiv数据集的论文,涵盖六个领域,时间跨度为2到10年。每对论文中较新的论文被假设为更具新颖性。此外,论文提出了RAG-Novelty方法,通过模拟人类评审过程,利用检索相似论文的方式来评估新颖性,实验结果表明RAG-Novelty优于现有的基线模型。

链接: https://arxiv.org/abs/2409.16605
作者: Ethan Lin,Zhiyuan Peng,Yi Fang
关键词-EN: large language models, evaluated the creativity, semantic perspective, cognitive science, studies have evaluated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs’ ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.
摘要:近期研究主要从语义角度评估大语言模型 (LLMs) 的创造性/新颖性,使用认知科学的基准进行测试。然而,在学术出版物中评估新颖性是一个尚未充分探索的领域。本文中,我们引入了一个学术新颖性基准 (SchNovel),用于评估大语言模型在学术论文中新颖性评估的能力。SchNovel 包含 15000 对来自 arXiv 数据集的论文,涵盖六个领域,出版日期相隔 2 至 10 年。在每一对中,较近期发表的论文被认为更具新颖性。此外,我们提出了 RAG-Novelty,通过模拟人类评审员检索相似论文的过程来评估新颖性。广泛的实验揭示了不同大语言模型评估新颖性的能力,并证明 RAG-Novelty 优于近期的基线模型。

[NLP-48] Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!” ACL

【速读】: 该论文旨在通过自然语言生成技术解决医疗领域中临床报告生成的自动化问题,特别是通过自动化生成放射报告和出院总结的部分内容来减轻医生的工作负担,减少职业倦怠。解决方案的关键在于开发和评估两个子任务:(1) 放射报告生成(RRG24),涉及根据胸部X光片生成报告的“发现”和“印象”部分;(2) 出院总结生成(“Discharge Me!”),涉及生成急诊入院患者的“简要住院过程”和“出院指导”部分。通过这些任务,研究团队旨在验证自然语言生成系统在医疗文档生成中的有效性,从而实现临床工作流程的优化。

链接: https://arxiv.org/abs/2409.16603
作者: Justin Xu,Zhihong Chen,Andrew Johnston,Louis Blankemeier,Maya Varma,Jason Hom,William J. Collins,Ankit Modi,Robert Lloyd,Benjamin Hopkins,Curtis Langlotz,Jean-Benoit Delbrouck
关键词-EN: Recent developments, natural language generation, Discharge Summary Generation, implications for healthcare, Discharge
类目: Computation and Language (cs.CL)
备注: ACL Proceedings. BioNLP workshop

点击查看摘要

Abstract:Recent developments in natural language generation have tremendous implications for healthcare. For instance, state-of-the-art systems could automate the generation of sections in clinical reports to alleviate physician workload and streamline hospital documentation. To explore these applications, we present a shared task consisting of two subtasks: (1) Radiology Report Generation (RRG24) and (2) Discharge Summary Generation (“Discharge Me!”). RRG24 involves generating the ‘Findings’ and ‘Impression’ sections of radiology reports given chest X-rays. “Discharge Me!” involves generating the ‘Brief Hospital Course’ and ‘Discharge Instructions’ sections of discharge summaries for patients admitted through the emergency department. “Discharge Me!” submissions were subsequently reviewed by a team of clinicians. Both tasks emphasize the goal of reducing clinician burnout and repetitive workloads by generating documentation. We received 201 submissions from across 8 teams for RRG24, and 211 submissions from across 16 teams for “Discharge Me!”.
摘要:自然语言生成领域的最新进展对医疗保健领域具有重大意义。例如,最先进的系统可以自动化生成临床报告中的部分内容,以减轻医生的工作负担并简化医院文档管理。为了探索这些应用,我们提出了一个包含两个子任务的共享任务:(1) 放射报告生成 (Radiology Report Generation, RRG24) 和 (2) 出院总结生成 (“Discharge Me!”)。RRG24 涉及在给定胸部 X 光片的情况下生成放射报告中的“发现”和“印象”部分。“Discharge Me!” 涉及为通过急诊部门入院的患者生成出院总结中的“简要住院过程”和“出院指导”部分。“Discharge Me!” 的提交内容随后由一组临床医生进行了评审。这两个任务都强调通过生成文档来减少临床医生的职业倦怠和重复性工作负担的目标。我们收到了来自 8 个团队的 201 份 RRG24 提交,以及来自 16 个团队的 211 份 “Discharge Me!” 提交。

[NLP-49] Disentangling Questions from Query Generation for Task-Adaptive Retrieval

【速读】: 该论文试图解决信息检索中适应未见任务的问题,特别是现有方法在生成合成查询时过于依赖将查询视为问题,从而无法适应广泛的搜索意图。论文提出的解决方案关键在于将查询生成任务重新概念化为将高级意图“编译”成任务适应性查询的过程。具体来说,论文提出了EGG查询生成器,通过明确指导语言模型(LM)理解搜索意图,从而在BeIR基准测试中更好地适应广泛的搜索意图,并在四个任务上超越了基线和现有模型,同时使用的查询生成器规模仅为之前最先进模型的1/47。

链接: https://arxiv.org/abs/2409.16570
作者: Yoonsang Lee,Minsoo Kim,Seung-won Hwang
关键词-EN: information retrieval, studies the problem, problem of information, query, query generator
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper studies the problem of information retrieval, to adapt to unseen tasks. Existing work generates synthetic queries from domain-specific documents to jointly train the retriever. However, the conventional query generator assumes the query as a question, thus failing to accommodate general search intents. A more lenient approach incorporates task-adaptive elements, such as few-shot learning with an 137B LLM. In this paper, we challenge a trend equating query and question, and instead conceptualize query generation task as a “compilation” of high-level intent into task-adaptive query. Specifically, we propose EGG, a query generator that better adapts to wide search intents expressed in the BeIR benchmark. Our method outperforms baselines and existing models on four tasks with underexplored intents, while utilizing a query generator 47 times smaller than the previous state-of-the-art. Our findings reveal that instructing the LM with explicit search intent is a key aspect of modeling an effective query generator.
摘要:本文研究了信息检索问题,旨在适应未见任务。现有工作通过从特定领域文档生成合成查询来联合训练检索器。然而,传统的查询生成器假设查询为问题形式,因此无法适应一般的搜索意图。一种更为宽松的方法引入了任务自适应元素,例如使用137B大语言模型进行少样本学习。本文挑战了将查询等同于问题的趋势,而是将查询生成任务概念化为将高级意图“编译”成任务自适应查询。具体而言,我们提出了EGG,一种能更好地适应BeIR基准中广泛搜索意图的查询生成器。我们的方法在四个任务上超越了基线和现有模型,这些任务涉及未充分探索的意图,同时使用的查询生成器规模仅为先前最先进模型的1/47。我们的研究发现,明确指示语言模型进行搜索意图是构建有效查询生成器的关键方面。

[NLP-50] Understanding the Cognitive Complexity in Language Elicited by Product Images

【速读】: 该论文试图解决如何量化和验证由产品图像引发的语言描述的认知复杂度问题。解决方案的关键在于提出了一种基于自然语言模型的方法,通过组合多种语言模型来近似捕捉认知复杂度的构造,从而实现对人类及由大型语言模型(LLMs)模拟的虚拟受访者的认知过程的理解。该方法具有最小化的人工监督和良好的可扩展性,即使在复杂度评估有限的情况下也能有效应用。

链接: https://arxiv.org/abs/2409.16521
作者: Yan-Ying Chen,Shabnam Hakimi,Monica Van,Francine Chen,Matthew Hong,Matt Klenk,Charlene Wu
关键词-EN: surface-level perceptual attributes, consumer-reported features expressed, including surface-level perceptual, Product images, perceptual attributes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Product images (e.g., a phone) can be used to elicit a diverse set of consumer-reported features expressed through language, including surface-level perceptual attributes (e.g., “white”) and more complex ones, like perceived utility (e.g., “battery”). The cognitive complexity of elicited language reveals the nature of cognitive processes and the context required to understand them; cognitive complexity also predicts consumers’ subsequent choices. This work offers an approach for measuring and validating the cognitive complexity of human language elicited by product images, providing a tool for understanding the cognitive processes of human as well as virtual respondents simulated by Large Language Models (LLMs). We also introduce a large dataset that includes diverse descriptive labels for product images, including human-rated complexity. We demonstrate that human-rated cognitive complexity can be approximated using a set of natural language models that, combined, roughly capture the complexity construct. Moreover, this approach is minimally supervised and scalable, even in use cases with limited human assessment of complexity.
摘要:产品图像(例如,一部手机)可以用来引出一系列通过语言表达的消费者报告特征,包括表面层次的感知属性(例如,“白色”)和更复杂的属性,如感知效用(例如,“电池”)。引出语言的认知复杂性揭示了认知过程的本质以及理解这些过程所需的环境;认知复杂性还预测了消费者随后的选择。这项工作提供了一种测量和验证由产品图像引发的人类语言认知复杂性的方法,为理解人类以及由大语言模型(LLMs)模拟的虚拟受访者的认知过程提供了一个工具。我们还引入了一个大型数据集,其中包括产品图像的多样化描述标签,包括人类评定的复杂性。我们证明,人类评定的认知复杂性可以通过一组自然语言模型来近似,这些模型结合起来大致捕捉了复杂性结构。此外,这种方法在最小监督下是可扩展的,即使在复杂性评估有限的使用案例中也能应用。

[NLP-51] A Unified Hallucination Mitigation Framework for Large Vision-Language Models

【速读】: 该论文试图解决大型视觉语言模型(LVLMs)在长文本生成中常见的幻觉问题,即生成的文本部分内容与图像内容不一致。解决方案的关键在于提出一个统一的框架——Dentist,通过首先对查询进行分类,然后根据分类结果执行不同的幻觉缓解处理,类似于牙医先观察牙齿再制定治疗计划。该框架能够有效分类查询为感知或推理类型,并在实验中展示了在MMbench上对Image Quality、Coarse Perception等视觉问答任务的显著性能提升,分别比基线模型InstructBLIP、LLaVA和VisualGLM提高了13.44%、10.2%和15.8%的准确率。

链接: https://arxiv.org/abs/2409.16494
作者: Yue Chang,Liqiang Jing,Xiaopeng Zhang,Yue Zhang
关键词-EN: Large Vision-Language Models, problem for Large, Large Vision-Language, difficult to eradicate, common problem
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by TMLR

点击查看摘要

Abstract:Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.
摘要:幻觉是大型视觉语言模型 (Large Vision-Language Models, LVLMs) 在长生成过程中常见的问题,难以完全消除。带有幻觉的生成内容部分与图像内容不一致。为了缓解幻觉问题,当前的研究要么集中在模型推理过程,要么关注模型生成结果,但它们设计的解决方案有时无法妥善处理各种类型的查询及其对应的生成幻觉。为了准确应对各种幻觉,我们提出了一个统一的框架——Dentist,用于幻觉缓解。核心步骤是首先对查询进行分类,然后根据分类结果执行不同的幻觉缓解过程,就像牙医首先观察牙齿,然后制定治疗计划一样。在简单的部署中,Dentist 可以将查询分类为感知或推理,并轻松缓解答案中潜在的幻觉,这在我们的实验中得到了验证。在 MMbench 上,我们在图像质量、粗略感知视觉问答 (VQA) 任务上分别比基线 InstructBLIP/LLaVA/VisualGLM 提高了 13.44%/10.2%/15.8% 的准确率。

[NLP-52] Exploring Knowledge Tracing in Tutor-Student Dialogues

【速读】: 该论文试图解决在人工智能辅助教学对话中如何进行知识追踪(Knowledge Tracing, KT)的问题。解决方案的关键在于提出了一种基于大语言模型(LLM)的提示方法,用于识别对话中涉及的知识组件/技能,并诊断学生对辅导者的回答是否正确。通过专家人工评估验证了LLM的有效性后,论文进一步应用多种KT方法对标记数据进行处理,以追踪学生在整个对话过程中的知识水平。实验结果表明,基于LLM的KT方法(LLMKT)在预测学生对话中回答正确性方面显著优于现有的KT方法。

链接: https://arxiv.org/abs/2409.16490
作者: Alexander Scarlatos,Andrew Lan
关键词-EN: high-quality personalized education, providing broad access, powered tutoring chatbots, Recent advances, large language models
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to the development of artificial intelligence (AI)-powered tutoring chatbots, showing promise in providing broad access to high-quality personalized education. Existing works have primarily studied how to make LLMs follow tutoring principles but not how to model student behavior in dialogues. However, analyzing student dialogue turns can serve as a formative assessment, since open-ended student discourse may indicate their knowledge levels and reveal specific misconceptions. In this work, we present a first attempt at performing knowledge tracing (KT) in tutor-student dialogues. We propose LLM prompting methods to identify the knowledge components/skills involved in each dialogue turn and diagnose whether the student responds correctly to the tutor, and verify the LLM’s effectiveness via an expert human evaluation. We then apply a range of KT methods on the resulting labeled data to track student knowledge levels over an entire dialogue. We conduct experiments on two tutoring dialogue datasets, and show that a novel yet simple LLM-based method, LLMKT, significantly outperforms existing KT methods in predicting student response correctness in dialogues. We perform extensive qualitative analyses to highlight the challenges in dialogue KT and outline multiple avenues for future work.
摘要:近年来,大语言模型 (LLM) 的进步推动了人工智能 (AI) 驱动的辅导聊天机器人的发展,显示出为广泛人群提供高质量个性化教育的潜力。现有研究主要集中在如何使 LLM 遵循辅导原则,而未深入探讨如何在对话中建模学生行为。然而,分析学生对话轮次可以作为一种形成性评估,因为开放式的学生话语可能表明其知识水平并揭示特定的误解。在此研究中,我们首次尝试在辅导员与学生对话中进行知识追踪 (KT)。我们提出 LLM 提示方法,以识别每个对话轮次中涉及的知识组件/技能,并诊断学生是否正确回应辅导员,并通过专家人工评估验证 LLM 的有效性。随后,我们在生成的标注数据上应用一系列 KT 方法,以追踪学生在整个对话过程中的知识水平。我们在两个辅导对话数据集上进行了实验,结果表明,一种新颖但简单的基于 LLM 的方法,即 LLMKT,在预测学生对话回应正确性方面显著优于现有的 KT 方法。我们进行了广泛的定性分析,以突出对话 KT 中的挑战,并概述了未来工作的多个方向。

[NLP-53] Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices

【速读】: 该论文旨在解决端到端自动语音识别(ASR)模型在识别个人或罕见短语时的困难,特别是针对基于连接主义时间分类(CTC)的Transformer模型在非自回归、上下文无关的束搜索中产生的噪声假设。解决方案的关键在于提出了一种有限状态转换器(FST)技术,用于重写由Transformer-based CTC模型生成的词片格子(wordpiece lattices)。该算法通过直接从词片进行字素到音素的转换(G2P),避免了显式的词表示,并利用了CTC格子的丰富性。这种方法无需重新训练或修改ASR模型,能够在包含上下文相关实体的测试集上实现高达15.2%的相对句子错误率(SER)降低。

链接: https://arxiv.org/abs/2409.16469
作者: Leonid Velikovich,Christopher Li,Diamantino Caseiro,Shankar Kumar,Pat Rondon,Kandarp Joshi,Xavier Velez
关键词-EN: Automatic Speech Recognition, Automatic Speech, Speech Recognition, recognizing personal, personal or rare
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:For end-to-end Automatic Speech Recognition (ASR) models, recognizing personal or rare phrases can be hard. A promising way to improve accuracy is through spelling correction (or rewriting) of the ASR lattice, where potentially misrecognized phrases are replaced with acoustically similar and contextually relevant alternatives. However, rewriting is challenging for ASR models trained with connectionist temporal classification (CTC) due to noisy hypotheses produced by a non-autoregressive, context-independent beam search. We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations and exploiting the richness of the CTC lattice. Our approach requires no retraining or modification of the ASR model. We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities. Comments: 8 pages, 7 figures Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2409.16469 [cs.CL] (or arXiv:2409.16469v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.16469 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:对于端到端自动语音识别 (ASR) 模型,识别个人或罕见短语可能较为困难。提高准确性的一个有前景的方法是通过拼写校正(或重写)ASR 格网,其中潜在的误识别短语被替换为声学上相似且上下文相关的替代短语。然而,由于非自回归、上下文无关的束搜索产生的噪声假设,重写对于使用连接主义时间分类 (CTC) 训练的 ASR 模型来说具有挑战性。我们提出了一种有限状态转换器 (FST) 技术,用于重写基于 Transformer 的 CTC 模型生成的词片格网。我们的算法直接从词片进行字素到音素 (G2P) 转换,避免了显式的词表示,并利用了 CTC 格网的丰富性。我们的方法不需要重新训练或修改 ASR 模型。我们在包含上下文相关实体的测试集上实现了高达 15.2% 的相对句子错误率 (SER) 降低。

评论:8 页,7 图 主题:计算与语言 (cs.CL);声音 (cs.SD);音频与语音处理 (eess.AS) 引用为:arXiv:2409.16469 [cs.CL] (或 arXiv:2409.16469v1 [cs.CL] 用于此版本) https://doi.org/10.48550/arXiv.2409.16469 了解更多 通过 DataCite 发布的 arXiv DOI(待注册)

[NLP-54] Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation Incremental Fine-Tuning and Verification

【速读】: 该论文试图解决大语言模型(LLMs)在逻辑推理任务中生成一阶逻辑(FOL)翻译时出现的错误问题。解决方案的关键在于通过创建高质量的FOL标注数据集ProofFOL,并采用数据增强和验证的增量框架,来提升较小语言模型(如LLaMA-2 13B和Mistral 7B)的FOL翻译质量。具体方法包括利用GPT-4生成FOL标注数据,通过数据增强技术扩展训练数据,以及训练一个验证器来纠正FOL翻译中的语法和语义错误,从而在有限的标注数据基础上实现显著的性能提升。

链接: https://arxiv.org/abs/2409.16461
作者: Ramya Keerthy Thatikonda,Jiuzhou Han,Wray Buntine,Ehsan Shareghi
关键词-EN: Logical reasoning, presents significant challenges, natural language processing, symbolic logical reasoning, FOL
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Logical reasoning is a fundamental task in natural language processing that presents significant challenges to Large Language Models (LLMs). The inherent characteristics of logical reasoning makes it well-suited for symbolic representations such as first-order logic (FOL). Research in symbolic logical reasoning explored FOL generation using state-of-the-art LLMs (i.e., GPT-4) to produce FOL translations of natural language (NL) statements, but errors in translation are usually not the focus. We address this by categorizing the translation errors in FOL statements generated by LLMs. To make progress towards improving the quality of FOL translations for smaller language models such as LLaMA-2 13B and Mistral 7B, we create ProofFOL, a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o. The models fine-tuned on this silver standard data achieve a significant gain in performance when compared to larger language models such as LLaMA-2 70B. In addition to improving the model using large data, we also tackle the issue of data scarcity and introduce an incremental framework encompassing of data augmentation and verification steps. In the augmentation process, a single pair of (premises, conclusion) is split into multiple new instances based on the predicates and FOLs. This data is used for fine-tuning, and the inference on this model generates FOLs with fewer errors over the model trained on the original data. Our investigation on the translation errors leads to generation of a perturbation dataset, which is used to train a verifier that corrects potential syntactic and semantic FOL translation errors. We demonstrate an efficient method for making the most of a limited existing human-annotated dataset. Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using ProofFOL on LLaMA-2 and Mistral models.
摘要:逻辑推理是自然语言处理中的一个基本任务,对大语言模型 (LLM) 提出了重大挑战。逻辑推理的内在特性使其非常适合符号表示,如一阶逻辑 (FOL)。在符号逻辑推理的研究中,利用最先进的 LLM (如 GPT-4) 生成 FOL 来翻译自然语言 (NL) 语句,但翻译错误通常不是研究的重点。我们通过分类 LLM 生成的 FOL 语句中的翻译错误来解决这一问题。为了提高 LLaMA-2 13B 和 Mistral 7B 等较小语言模型的 FOL 翻译质量,我们创建了 ProofFOL,这是一个使用 GPT-4o 标注的高质量 FOL 子集,基于 ProofWriter 数据集。在银标准数据上微调的模型与 LLaMA-2 70B 等较大语言模型相比,性能显著提升。除了通过大数据改进模型外,我们还解决了数据稀缺问题,并引入了一个包含数据增强和验证步骤的增量框架。在增强过程中,一对 (前提,结论) 根据谓词和 FOL 被拆分为多个新实例。这些数据用于微调,该模型在推理过程中生成的 FOL 错误少于在原始数据上训练的模型。我们对翻译错误的调查导致生成了一个扰动数据集,用于训练一个校正潜在语法和语义 FOL 翻译错误的验证器。我们展示了一种有效利用有限现有人类标注数据集的方法。我们的结果显示,在使用 ProofFOL 的 LLaMA-2 和 Mistral 模型上,ProofWriter 和 ProntoQA 数据集的性能达到了最先进水平。

[NLP-55] FMDLlama: Financial Misinformation Detection based on Large Language Models

【速读】: 该论文试图解决金融领域中的错误信息检测(FMD)问题,关键在于缺乏针对FMD任务的指令调优数据集和评估基准。解决方案的核心是提出了FMDLlama,这是首个基于Llama3.1微调的开源指令跟随型大语言模型,用于FMD任务。同时,论文还引入了首个多任务FMD指令数据集(FMDID)和全面的FMD评估基准(FMD-B),后者包含分类和解释生成任务,用于测试大语言模型在FMD能力上的表现。通过这些创新,论文成功提升了模型在FMD任务上的性能,超越了其他开源大语言模型及ChatGPT。

链接: https://arxiv.org/abs/2409.16452
作者: Zhiwei Liu,Xin Zhang,Kailai Yang,Qianqian Xie,Jimin Huang,Sophia Ananiadou
关键词-EN: made financial misinformation, FMD, emergence of social, social media, misinformation easier
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:The emergence of social media has made the spread of misinformation easier. In the financial domain, the accuracy of information is crucial for various aspects of financial market, which has made financial misinformation detection (FMD) an urgent problem that needs to be addressed. Large language models (LLMs) have demonstrated outstanding performance in various fields. However, current studies mostly rely on traditional methods and have not explored the application of LLMs in the field of FMD. The main reason is the lack of FMD instruction tuning datasets and evaluation benchmarks. In this paper, we propose FMDLlama, the first open-sourced instruction-following LLMs for FMD task based on fine-tuning Llama3.1 with instruction data, the first multi-task FMD instruction dataset (FMDID) to support LLM instruction tuning, and a comprehensive FMD evaluation benchmark (FMD-B) with classification and explanation generation tasks to test the FMD ability of LLMs. We compare our models with a variety of LLMs on FMD-B, where our model outperforms all other open-sourced LLMs as well as ChatGPT.
摘要:社交媒体的出现使得错误信息的传播变得更加容易。在金融领域,信息的准确性对金融市场各个方面至关重要,这使得金融错误信息检测 (FMD) 成为一个亟待解决的问题。大语言模型 (LLMs) 在多个领域展示了卓越的性能。然而,当前的研究大多依赖传统方法,并未深入探讨 LLMs 在 FMD 领域的应用。主要原因是缺乏 FMD 指令调优数据集和评估基准。本文提出 FMDLlama,这是首个基于 Llama3.1 微调并结合指令数据的 FMD 任务开源指令跟随 LLM,首个支持 LLM 指令调优的多任务 FMD 指令数据集 (FMDID),以及一个包含分类和解释生成任务的综合 FMD 评估基准 (FMD-B),用于测试 LLMs 的 FMD 能力。我们在 FMD-B 上比较了我们的模型与其他多种 LLMs,结果显示我们的模型优于所有其他开源 LLMs 以及 ChatGPT。

[NLP-56] A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions

【速读】: 该论文试图解决大型语言模型(LLMs)中存在的偏见问题,解决方案的关键在于系统地分类、识别、评估和缓解这些偏见。论文通过全面调查LLMs中的偏见类型、来源、影响及其缓解策略,为研究人员、实践者和政策制定者提供了一个基础资源,以理解和应对LLMs中的偏见问题。关键在于综合现有研究成果,讨论偏见在实际应用中的影响,并提出未来研究方向以增强LLMs的公平性和公正性。

链接: https://arxiv.org/abs/2409.16430
作者: Rajesh Ranjan,Shailja Gupta,Surya Narayan Singh
关键词-EN: Large Language Models, natural language processing, Large Language, unprecedented text generation, providing unprecedented text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 2 Tables, 1 Figure

点击查看摘要

Abstract:Large Language Models(LLMs) have revolutionized various applications in natural language processing (NLP) by providing unprecedented text generation, translation, and comprehension capabilities. However, their widespread deployment has brought to light significant concerns regarding biases embedded within these models. This paper presents a comprehensive survey of biases in LLMs, aiming to provide an extensive review of the types, sources, impacts, and mitigation strategies related to these biases. We systematically categorize biases into several dimensions. Our survey synthesizes current research findings and discusses the implications of biases in real-world applications. Additionally, we critically assess existing bias mitigation techniques and propose future research directions to enhance fairness and equity in LLMs. This survey serves as a foundational resource for researchers, practitioners, and policymakers concerned with addressing and understanding biases in LLMs.
摘要:大语言模型 (Large Language Models, LLMs) 通过提供前所未有的文本生成、翻译和理解能力,彻底改变了自然语言处理 (Natural Language Processing, NLP) 中的各种应用。然而,这些模型的广泛部署也揭示了其中嵌入的显著偏见问题。本文对大语言模型中的偏见进行了全面综述,旨在提供一个关于这些偏见的类型、来源、影响及缓解策略的广泛回顾。我们系统地将偏见分类为几个维度。本综述综合了当前的研究成果,并讨论了偏见在实际应用中的影响。此外,我们批判性地评估了现有的偏见缓解技术,并提出了未来研究方向,以增强大语言模型中的公平性和公正性。本综述为关注并理解大语言模型中偏见问题的研究人员、实践者和政策制定者提供了一个基础资源。

[NLP-57] Revisiting Acoustic Features for Robust ASR ICASSP2025

【速读】: 该论文试图解决自动语音识别(ASR)系统在面对现实世界中的多种噪声(包括环境噪声、房间脉冲响应、特殊效果以及对抗性攻击)时的鲁棒性问题。解决方案的关键在于重新审视并评估基于生物听觉感知启发的声学特征,特别是提出了两种新的声学特征:频率掩蔽频谱图(FreqMask)和伽马通差异频谱图(DoGSpec),以模拟神经心理学现象中的频率掩蔽和侧向抑制。实验结果表明,DoGSpec在保持较高准确率的同时,显著提升了对未见噪声和对抗性攻击的鲁棒性,而GammSpec则在非对抗性噪声方面表现出色,但在对抗性攻击方面略逊于DoGSpec。

链接: https://arxiv.org/abs/2409.16399
作者: Muhammad A. Shah,Bhiksha Raj
关键词-EN: Automatic Speech Recognition, room impulse response, real-world environments including, environments including environmental, Deep Neural Networks
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: submitted to ICASSP 2025

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems must be robust to the myriad types of noises present in real-world environments including environmental noise, room impulse response, special effects as well as attacks by malicious actors (adversarial attacks). Recent works seek to improve accuracy and robustness by developing novel Deep Neural Networks (DNNs) and curating diverse training datasets for them, while using relatively simple acoustic features. While this approach improves robustness to the types of noise present in the training data, it confers limited robustness against unseen noises and negligible robustness to adversarial attacks. In this paper, we revisit the approach of earlier works that developed acoustic features inspired by biological auditory perception that could be used to perform accurate and robust ASR. In contrast, Specifically, we evaluate the ASR accuracy and robustness of several biologically inspired acoustic features. In addition to several features from prior works, such as gammatone filterbank features (GammSpec), we also propose two new acoustic features called frequency masked spectrogram (FreqMask) and difference of gammatones spectrogram (DoGSpec) to simulate the neuro-psychological phenomena of frequency masking and lateral suppression. Experiments on diverse models and datasets show that (1) DoGSpec achieves significantly better robustness than the highly popular log mel spectrogram (LogMelSpec) with minimal accuracy degradation, and (2) GammSpec achieves better accuracy and robustness to non-adversarial noises from the Speech Robust Bench benchmark, but it is outperformed by DoGSpec against adversarial attacks.
摘要:自动语音识别 (Automatic Speech Recognition, ASR) 系统必须能够应对现实环境中存在的多种噪声,包括环境噪声、房间脉冲响应、特殊效果以及恶意攻击者发起的对抗攻击 (adversarial attacks)。近期研究通过开发新型深度神经网络 (Deep Neural Networks, DNNs) 并为其构建多样化的训练数据集,同时使用相对简单的声学特征,来提升系统的准确性和鲁棒性。尽管这种方法提高了对训练数据中存在的噪声类型的鲁棒性,但对于未见过的噪声和对抗攻击的鲁棒性提升有限。本文重新审视了早期研究中基于生物听觉感知启发开发的声学特征,这些特征能够实现准确且鲁棒的 ASR。具体而言,我们评估了几种生物启发声学特征的 ASR 准确性和鲁棒性。除了来自先前工作的几种特征,如伽马通滤波器组特征 (GammSpec),我们还提出了两种新的声学特征:频率掩蔽频谱图 (FreqMask) 和伽马通差分频谱图 (DoGSpec),以模拟频率掩蔽和侧抑制的神经心理学现象。在多样化的模型和数据集上的实验表明:(1) DoGSpec 在保持最小准确率下降的情况下,显著优于广受欢迎的对数梅尔频谱图 (LogMelSpec) 的鲁棒性;(2) GammSpec 在 Speech Robust Bench 基准测试中对非对抗性噪声的准确性和鲁棒性表现更好,但在对抗攻击方面被 DoGSpec 超越。

[NLP-58] RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation

【速读】: 该论文试图解决语言模型(LLMs)在谜题解决任务中表现不佳的问题,特别是其在需要高级推理和抽象思维能力时的局限性。解决方案的关键在于引入了一种名为RISCORE(RIddle Solving with COntext REcontruciton)的新型全自动提示方法,该方法通过生成并利用上下文重构的句子型谜题,结合原始示例创建少样本范例,从而显著提升语言模型在垂直和横向思维任务中的表现,超越了传统的范例选择策略。

链接: https://arxiv.org/abs/2409.16383
作者: Ioannis Panagiotopoulos,Giorgos Filandrianos,Maria Lymperaiou,Giorgos Stamou
关键词-EN: Riddle-solving requires advanced, advanced reasoning skills, requires advanced reasoning, diverse reasoning skills, reasoning skills
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Riddle-solving requires advanced reasoning skills, pushing LLMs to engage in abstract thinking and creative problem-solving, often revealing limitations in their cognitive abilities. In this paper, we examine the riddle-solving capabilities of LLMs using a multiple-choice format, exploring how different prompting techniques impact performance on riddles that demand diverse reasoning skills. To enhance results, we introduce RISCORE (RIddle Solving with COntext REcontruciton) a novel fully automated prompting method that generates and utilizes contextually reconstructed sentence-based puzzles in conjunction with the original examples to create few-shot exemplars. Our experiments demonstrate that RISCORE significantly improves the performance of language models in both vertical and lateral thinking tasks, surpassing traditional exemplar selection strategies across a variety of few-shot settings.
摘要:谜题解决需要高级推理技能,推动大语言模型 (LLM) 进行抽象思维和创造性问题解决,通常揭示了其认知能力的局限性。本文中,我们通过多选题格式考察了大语言模型的谜题解决能力,探讨了不同提示技术如何影响需要多样推理技能的谜题表现。为了提升结果,我们引入了 RISCORE (RIddle Solving with COntext REcontruciton),这是一种新颖的全自动提示方法,它生成并利用上下文重构的基于句子的谜题,与原始示例结合创建少样本示例。我们的实验表明,RISCORE 显著提升了语言模型在垂直和横向思维任务中的表现,超越了传统示例选择策略在各种少样本设置中的表现。

[NLP-59] Do the Right Thing Just Debias! Multi-Category Bias Mitigation Using LLMs

【速读】: 该论文试图解决构建鲁棒且具有泛化能力的语言偏见缓解模型的问题。解决方案的关键在于引入了一个名为ANUBIS的新数据集,该数据集包含1507个精心挑选的句子对,涵盖九种社会偏见类别。通过使用T5等先进模型,结合监督微调(SFT)、强化学习(PPO, DPO)和上下文学习(ICL)等方法,论文评估了这些模型在多类别社会偏见减少、跨数据集泛化能力以及训练模型的环境影响方面的表现。ANUBIS数据集及其研究结果为构建更加公平的AI系统提供了宝贵的资源,并推动了负责任且无偏见技术的开发。

链接: https://arxiv.org/abs/2409.16371
作者: Amartya Roy,Danush Khanna,Devanshu Mahapatra,Vasanthakumar,Avirup Das,Kripabandhu Ghosh
关键词-EN: generalizable bias mitigation, paper tackles, tackles the challenge, robust and generalizable, Reinforcement Learning
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 Figures

点击查看摘要

Abstract:This paper tackles the challenge of building robust and generalizable bias mitigation models for language. Recognizing the limitations of existing datasets, we introduce ANUBIS, a novel dataset with 1507 carefully curated sentence pairs encompassing nine social bias categories. We evaluate state-of-the-art models like T5, utilizing Supervised Fine-Tuning (SFT), Reinforcement Learning (PPO, DPO), and In-Context Learning (ICL) for effective bias mitigation. Our analysis focuses on multi-class social bias reduction, cross-dataset generalizability, and environmental impact of the trained models. ANUBIS and our findings offer valuable resources for building more equitable AI systems and contribute to the development of responsible and unbiased technologies with broad societal impact.
摘要:本文针对构建稳健且具有泛化能力的语言偏见缓解模型这一挑战展开研究。鉴于现有数据集的局限性,我们引入了 ANUBIS,这是一个包含 1507 对精心筛选的句子对的新型数据集,涵盖九种社会偏见类别。我们评估了如 T5 等最先进的模型,利用监督微调 (Supervised Fine-Tuning, SFT)、强化学习 (Reinforcement Learning, PPO, DPO) 以及上下文学习 (In-Context Learning, ICL) 来实现有效的偏见缓解。我们的分析着重于多类别社会偏见的减少、跨数据集的泛化能力以及训练模型的环境影响。ANUBIS 数据集及我们的研究成果为构建更加公平的 AI 系统提供了宝贵的资源,并推动了具有广泛社会影响的负责任且无偏见技术的开发。

[NLP-60] Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

【速读】: 该论文试图解决大型语言模型(LLMs)在使用外部工具时训练数据质量评估的问题。解决方案的关键在于提出了两种评估数据可靠性的方法:一是基于人类定义的直观正确性标准,二是通过模型驱动的上下文评估。通过在这两种方法上对数据质量进行全面评估,并展示数据质量对模型性能的影响,论文证明了高质量训练数据的重要性,即使数据量较少,也能显著提升模型的表现。

链接: https://arxiv.org/abs/2409.16341
作者: Shadi Iskander,Nachshon Cohen,Zohar Karnin,Ori Shapira,Sofia Tolmach
关键词-EN: rapidly expanding field, recent research focusing, Training large language, generating synthetic data, large language models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) for external tool usage is a rapidly expanding field, with recent research focusing on generating synthetic data to address the shortage of available data. However, the absence of systematic data quality checks poses complications for properly training and testing models. To that end, we propose two approaches for assessing the reliability of data for training LLMs to use external tools. The first approach uses intuitive, human-defined correctness criteria. The second approach uses a model-driven assessment with in-context evaluation. We conduct a thorough evaluation of data quality on two popular benchmarks, followed by an extrinsic evaluation that showcases the impact of data quality on model performance. Our results demonstrate that models trained on high-quality data outperform those trained on unvalidated data, even when trained with a smaller quantity of data. These findings empirically support the significance of assessing and ensuring the reliability of training data for tool-using LLMs.
摘要:为外部工具使用训练大语言模型 (LLMs) 是一个快速扩展的领域,近期研究主要集中在生成合成数据以解决可用数据短缺的问题。然而,缺乏系统的数据质量检查给模型的正确训练和测试带来了复杂性。为此,我们提出了两种评估训练 LLMs 使用外部工具数据可靠性的方法。第一种方法使用直观的人类定义的正确性标准。第二种方法采用模型驱动的评估,通过上下文内评估进行。我们对两个流行的基准进行了全面的数据质量评估,随后进行了外在评估,展示了数据质量对模型性能的影响。我们的结果表明,即使在使用较少数据量的情况下,高质量数据训练的模型也优于未经验证数据训练的模型。这些发现实证支持了评估和确保工具使用 LLMs 训练数据可靠性的重要性。

[NLP-61] Exploring the traditional NMT model and Large Language Model for chat translation

【速读】: 该论文旨在提升华为翻译服务中心(HW-TSC)在WMT24聊天翻译共享任务中英德双向(en-de)翻译的性能。解决方案的关键在于通过微调模型使用聊天数据,并探索多种策略,包括最小贝叶斯风险(MBR)解码和自训练方法。实验结果显示,MBR自训练方法在某些方向上显著提升了翻译性能,成为最有效的解决方案。此外,论文还讨论了聊天翻译领域面临的挑战和未来研究的方向。

链接: https://arxiv.org/abs/2409.16331
作者: Jinlong Yang,Hengchao Shang,Daimeng Wei,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Zhiqiang Rao,Shaojun Li,Yuhao Xie,Yuanchang Luo,Jiawei Zheng,Bin Wei,Hao Yang
关键词-EN: Translation Services Center, Huawei Translation Services, Services Center, translation shared task, Minimum Bayesian Risk
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 Tables, WMT24

点击查看摘要

Abstract:This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English \leftrightarrow Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.
摘要:本文介绍了华为翻译服务中心 (HW-TSC) 在 WMT24 聊天翻译共享任务中,针对英语与德语 (en-de) 双向翻译的提交内容。实验涉及使用聊天数据对模型进行微调,并探索了多种策略,包括最小贝叶斯风险 (MBR) 解码和自训练。结果显示,在某些方向上性能有显著提升,其中 MBR 自训练方法取得了最佳效果。此外,本文还讨论了聊天翻译领域面临的挑战以及未来研究的潜在方向。

[NLP-62] DeepScore: A Comprehensive Approach to Measuring Quality in AI-Generated Clinical Documentation

【速读】: 该论文试图解决医疗从业者在采用生成式AI进行临床文档记录时,如何有效评估AI生成文档质量的问题。解决方案的关键在于提出了DeepScribe的评估和管理方法,包括使用多种指标和综合评分“DeepScore”,以确保文档质量的准确性和整体性,从而通过责任制和持续改进来提升患者护理文档的质量。

链接: https://arxiv.org/abs/2409.16307
作者: Jon Oleson
关键词-EN: rapidly adopting generative, significant time savings, Medical practitioners, leading to significant, reduced stress
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Medical practitioners are rapidly adopting generative AI solutions for clinical documentation, leading to significant time savings and reduced stress. However, evaluating the quality of AI-generated documentation is a complex and ongoing challenge. This paper presents an overview of DeepScribe’s methodologies for assessing and managing note quality, focusing on various metrics and the composite “DeepScore”, an overall index of quality and accuracy. These methodologies aim to enhance the quality of patient care documentation through accountability and continuous improvement.
摘要:医疗从业者正迅速采用生成式 AI (Generative AI) 解决方案用于临床文档记录,从而显著节省时间和减轻压力。然而,评估 AI 生成文档的质量是一个复杂且持续的挑战。本文概述了 DeepScribe 评估和管理笔记质量的方法,重点介绍了多种指标以及综合评分“DeepScore”,这是一个关于质量和准确性的总体指数。这些方法旨在通过责任制和持续改进来提高患者护理文档的质量。

[NLP-63] Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement EMNLP2024

【速读】: 该论文试图解决大语言模型代理在复杂交互任务中由于缺乏过程监督信号而导致的错误或次优动作问题。解决方案的关键在于引入迭代步骤级过程优化(Iterative step-level Process Refinement, IPR)框架,通过蒙特卡洛方法估计步骤级奖励,并在每次迭代中对比代理动作与专家轨迹的对应步骤,生成对比动作对作为训练数据,从而提高代理的动作效率和泛化能力。

链接: https://arxiv.org/abs/2406.11176
作者: Weimin Xiong,Yifan Song,Xiutian Zhao,Wenhao Wu,Xun Wang,Ke Wang,Cheng Li,Wei Peng,Sujian Li
关键词-EN: Large language model, Large language, exhibited exceptional performance, exhibited exceptional, step-level Process Refinement
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2024 (Main Conference)

点击查看摘要

Abstract:Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
摘要:大语言模型智能体在多种复杂交互任务中展现了卓越的表现。近期方法通过使用专家轨迹进行调优来提升智能体性能,但它们主要集中在结果奖励上,这可能导致由于缺乏过程监督信号而产生错误或次优动作。本文中,我们引入了迭代步骤级过程优化 (Iterative step-level Process Refinement, IPR) 框架,该框架提供详细的步骤指导以增强智能体训练。具体来说,我们采用蒙特卡洛方法来估计步骤级奖励。在每次迭代中,智能体沿着专家轨迹进行探索并生成新动作。这些动作随后通过步骤级奖励与专家轨迹的相应步骤进行评估。这种比较有助于识别差异,产生对比动作对作为智能体的训练数据。我们在三个复杂智能体任务上的实验表明,我们的框架优于多种强基线方法。此外,我们的分析结果突显了 IPR 在提升动作效率及其对多种模型的适用性方面的有效性。

[NLP-64] Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

【速读】: 该论文试图解决语音分类任务中标注数据不足的问题,特别是在需要大量主观评估的认知状态分类任务中。解决方案的关键在于提出了一种半监督学习(SSL)框架,引入了多视角伪标签方法。该方法通过声学和语言学特征来选择最可信的数据进行模型训练。声学上,使用Frechet音频距离比较未标注数据与已标注数据,并通过多个音频编码器生成嵌入;语言学上,利用大型语言模型修正自动语音识别转录并预测标签。高置信度数据在两种来源的伪标签一致时被识别,不一致的则视为低置信度数据。随后,训练一个双模态分类器迭代标注低置信度数据,直到达到预设标准。实验结果表明,该方法在仅使用30%标注数据的情况下,性能与全监督学习相当,并显著优于两个选定的基线方法。

链接: https://arxiv.org/abs/2409.16937
作者: Yuanchao Li,Zixing Zhang,Jing Han,Peter Bell,Catherine Lai
关键词-EN: extensive subjective assessment, requiring extensive subjective, cognitive state classification, subjective assessment, common challenge
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.
摘要:在语音分类任务中,缺乏标注数据是一个常见挑战,尤其是在需要广泛主观评估的任务中,如认知状态分类。本文提出了一种半监督学习 (Semi-Supervised Learning, SSL) 框架,引入了一种新颖的多视角伪标签方法,该方法利用声学和语言特征来选择最可信的数据用于训练分类模型。在声学方面,未标注数据通过 Frechet 音频距离与标注数据进行比较,该距离由多个音频编码器生成的嵌入计算得出。在语言方面,大语言模型被提示修订自动语音识别转录并基于我们提出的任务特定知识预测标签。当来自两个来源的伪标签一致时,数据被识别为高置信度数据,而不一致的数据则被视为低置信度数据。随后,训练一个双模态分类器,迭代地标注低置信度数据,直到达到预定义的标准。我们在情感识别和痴呆检测任务上评估了我们的 SSL 框架。实验结果表明,与仅使用 30% 标注数据的完全监督学习相比,我们的方法实现了具有竞争力的性能,并显著优于两个选定的基线方法。

[NLP-65] Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models

【速读】: 该论文试图解决跨语言情境下的语音情感识别(Speech Emotion Recognition, SER)问题,特别是探讨自监督学习(Self-Supervised Learning, SSL)模型在跨语言环境中的表现。解决方案的关键在于通过逐层分析和参数高效的微调策略,研究模型在单语言、跨语言和迁移学习场景下的表现,并与人类的表现进行对比。研究还探讨了方言对跨语言SER的影响,并发现通过适当的知识迁移,模型能够适应目标语言并达到与母语者相当的性能。此外,研究揭示了模型和人类在不同情感识别上的行为差异,为跨语言SER提供了新的见解。

链接: https://arxiv.org/abs/2409.16920
作者: Zhichen Han,Tianqi Geng,Hui Feng,Jiahong Yuan,Korin Richmond,Yuanchao Li
关键词-EN: Speech Emotion Recognition, Utilizing Self-Supervised Learning, Utilizing Self-Supervised, explored cross-lingual scenarios, Emotion Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background. Moreover, both humans and models exhibit distinct behaviors across different emotions. These results offer new insights into the cross-lingual SER capabilities of SSL models, underscoring both their similarities to and differences from human emotion perception.
摘要:利用自监督学习 (Self-Supervised Learning, SSL) 模型进行语音情感识别 (Speech Emotion Recognition, SER) 已被证明是有效的,但关于跨语言场景的研究仍较为有限。本研究通过逐层分析和探索参数高效的微调策略,对单语言、跨语言和迁移学习情境下的人类表现与 SSL 模型进行了比较分析。我们进一步比较了模型和人类在话语级和分段级上的 SER 能力。此外,我们通过人类评估研究了方言对跨语言 SER 的影响。研究结果表明,在适当的知识迁移下,模型能够适应目标语言并达到与母语者相当的性能。我们还展示了方言对没有先前语言和副语言背景的个体 SER 的显著影响。此外,人类和模型在不同情感表现上展现出不同的行为。这些结果为 SSL 模型的跨语言 SER 能力提供了新的见解,突显了它们在情感感知上与人类相似和不同之处。

[NLP-66] Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions ICASSP2025

【速读】: 该论文试图解决当前情感文本到语音(TTS)系统在模拟广泛人类情感时面临的挑战,主要由于情感的内在复杂性以及情感语音数据集和模型的局限性。解决方案的关键在于提出一个TTS框架,该框架能够控制愉悦、唤醒和支配三个情感维度,并能在不使用任何情感语音数据进行TTS训练的情况下合成多样化的情感风格。具体实现方法包括训练一个仅使用语音数据分类标签的情感属性预测器,结合心理研究并采用基于自监督学习(SSL)特征的锚定降维技术。该框架通过自回归语言模型将文本输入转换为音素标记,并利用伪情感维度指导细粒度声学细节的并行预测,从而在LibriTTS数据集上实验证明,即使不包含任何情感语音数据,也能合成具有增强自然性和多样情感风格的语音。

链接: https://arxiv.org/abs/2409.16681
作者: Kun Zhou,You Zhang,Shengkui Zhao,Hao Wang,Zexu Pan,Dianwen Ng,Chong Zhang,Chongjia Ni,Yukun Ma,Trung Hieu Nguyen,Jia Qi Yip,Bin Ma
关键词-EN: systems face challenges, human emotions due, Current emotional, systems face, human emotions
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: submitted to ICASSP 2025

点击查看摘要

Abstract:Current emotional text-to-speech (TTS) systems face challenges in mimicking a broad spectrum of human emotions due to the inherent complexity of emotions and limitations in emotional speech datasets and models. This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance, and can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training. We train an emotional attribute predictor using only categorical labels from speech data, aligning with psychological research and incorporating anchored dimensionality reduction on self-supervised learning (SSL) features. The TTS framework converts text inputs into phonetic tokens via an autoregressive language model and uses pseudo-emotional dimensions to guide the parallel prediction of fine-grained acoustic details. Experiments conducted on the LibriTTS dataset demonstrate that our framework can synthesize speech with enhanced naturalness and a variety of emotional styles by effectively controlling emotional dimensions, even without the inclusion of any emotional speech during TTS training.
摘要:当前的情感文本到语音 (TTS) 系统在模仿广泛的人类情感方面面临挑战,这是由于情感本身的复杂性以及情感语音数据集和模型的局限性。本文提出了一种 TTS 框架,该框架能够控制愉悦度、唤醒度和支配度,并能够在 TTS 训练期间不使用任何情感语音数据的情况下合成多种情感风格。我们仅使用语音数据的分类标签训练情感属性预测器,这与心理学研究相一致,并结合了基于自监督学习 (SSL) 特征的锚定降维技术。TTS 框架通过自回归语言模型将文本输入转换为音素 Token,并使用伪情感维度来指导细粒度声学细节的并行预测。在 LibriTTS 数据集上进行的实验表明,我们的框架能够通过有效控制情感维度合成具有增强自然度和多种情感风格的语音,即使在 TTS 训练期间不包含任何情感语音。

[NLP-67] Speech Recognition Rescoring with Large Speech-Text Foundation Models

【速读】: 该论文试图解决自动语音识别(ASR)系统在有限标注语音数据下的性能提升问题。解决方案的关键在于利用多模态大型语言模型(LLM)进行二次评分(rescoring),并通过跨模态知识转移和判别训练来增强基础模型的评分性能。具体来说,论文提出了一种新颖的技术,将语音和文本基础模型结合,利用大量未标注和标注的语音及文本数据,实现对ASR结果的二次评分,从而在Whisper大型ASR模型和纯文本LLM的基础上分别实现了高达20%和15%的相对性能提升。

链接: https://arxiv.org/abs/2409.16654
作者: Prashanth Gurunath Shivakumar,Jari Kolehmainen,Aditya Gourav,Yi Gu,Ankur Gandhe,Ariya Rastrow,Ivan Bulyko
关键词-EN: LLM, ability to understand, leveraging large amount, Large, language
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large language models (LLM) have demonstrated the ability to understand human language by leveraging large amount of text data. Automatic speech recognition (ASR) systems are often limited by available transcribed speech data and benefit from a second pass rescoring using LLM. Recently multi-modal large language models, particularly speech and text foundational models have demonstrated strong spoken language understanding. Speech-Text foundational models leverage large amounts of unlabelled and labelled data both in speech and text modalities to model human language. In this work, we propose novel techniques to use multi-modal LLM for ASR rescoring. We also explore discriminative training to further improve the foundational model rescoring performance. We demonstrate cross-modal knowledge transfer in speech-text LLM can benefit rescoring. Our experiments demonstrate up-to 20% relative improvements over Whisper large ASR and up-to 15% relative improvements over text-only LLM.
摘要:大语言模型 (LLM) 通过利用大量文本数据展示了理解人类语言的能力。自动语音识别 (ASR) 系统通常受限于可用的转录语音数据,并受益于使用 LLM 进行二次重评分。最近,多模态大语言模型,特别是语音和文本基础模型,展示了强大的口语理解能力。语音-文本基础模型利用大量未标注和标注的语音和文本数据来建模人类语言。在这项工作中,我们提出了使用多模态 LLM 进行 ASR 重评分的新技术。我们还探索了判别训练以进一步提高基础模型重评分的性能。我们展示了跨模态知识转移在语音-文本 LLM 中可以有益于重评分。我们的实验表明,与 Whisper 大型 ASR 相比,相对改进高达 20%,与仅文本的 LLM 相比,相对改进高达 15%。

[NLP-68] Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation ICASSP2025

【速读】: 该论文试图解决多维度语音质量评估的问题,特别是如何利用单一模型同时评估语音的平均意见得分(MOS)、说话者相似度(SIM)以及A/B测试结果。解决方案的关键在于利用最近引入的听觉大语言模型(LLMs),通过任务特定的提示进行微调,使其能够预测这些评估指标,并生成自然语言描述来评估语音的噪声、失真、不连续性和整体质量,从而提供更可解释的输出。实验结果表明,这些听觉LLMs在MOS和SIM预测方面与最先进的任务特定小模型表现相当,同时在A/B测试和自然语言描述任务中也表现出良好的潜力。

链接: https://arxiv.org/abs/2409.16644
作者: Siyin Wang,Wenyi Yu,Yudong Yang,Changli Tang,Yixuan Li,Jimin Zhuang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Chao Zhang
关键词-EN: assessment typically requires, Speech quality assessment, typically requires evaluating, requires evaluating audio, quality assessment typically
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: submitted to ICASSP 2025

点击查看摘要

Abstract:Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints will be released upon acceptance.
摘要:语音质量评估通常需要从多个方面进行评价,例如平均意见得分 (MOS) 和说话者相似度 (SIM) 等,这些方面对于一个为单一任务设计的小模型来说可能难以全面覆盖。本文提出利用最近引入的听觉大语言模型 (LLM) 进行自动语音质量评估。通过采用任务特定的提示,听觉 LLM 被微调以预测 MOS、SIM 和 A/B 测试结果,这些结果通常用于评估文本到语音系统。此外,微调后的听觉 LLM 能够生成自然语言描述,评估噪声、失真、不连续性和整体质量等方面,提供更具可解释性的输出。我们在 NISQA、BVCC、SOMOS 和 VoxSim 语音质量数据集上进行了广泛的实验,使用了开源的听觉 LLM,如 SALMONN、Qwen-Audio 和 Qwen2-Audio。对于自然语言描述任务,还评估了商业模型 Google Gemini 1.5 Pro。结果表明,听觉 LLM 在预测 MOS 和 SIM 方面达到了与最先进的任务特定小模型相当的性能,同时在 A/B 测试和自然语言描述方面也取得了有前景的结果。我们的数据处理脚本和微调模型检查点将在接受后发布。

[NLP-69] owards Within-Class Variation in Alzheimers Disease Detection from Spontaneous Speech

【速读】: 该论文试图解决阿尔茨海默病(AD)检测中的两个关键问题:类内变异和实例级不平衡。类内变异指的是AD患者在认知障碍程度上的差异,而实例级不平衡则指不同严重程度样本在数据集中的分布不均。论文提出了两种解决方案:Soft Target Distillation(SoTD)和Instance-level Re-balancing(InRe)。SoTD通过利用多个组件模型的优势来提高检测精度,而InRe则通过重新平衡实例级数据分布来缓解模型过拟合问题。实验结果表明,这两种方法显著提升了AD检测的准确性,并为构建更稳健和可靠的AD检测模型提供了新的思路。

链接: https://arxiv.org/abs/2409.16322
作者: Jiawen Kang,Dongrui Han,Lingwei Meng,Jingyan Zhou,Jinchao Li,Xixin Wu,Helen Meng
关键词-EN: Alzheimer Disease, promising research area, employs machine learning, machine learning classification, promising research
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detection tasks lack fine-grained labels, simplistic binary classification may overlook two crucial aspects: within-class differences and instance-level imbalance. The former compels the model to map AD samples with varying degrees of impairment to a single diagnostic label, disregarding certain changes in cognitive function. While the latter biases the model towards overrepresented severity levels. This work presents early efforts to address these challenges. We propose two novel methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Experiments on the ADReSS and ADReSSo datasets demonstrate that the proposed methods significantly improve detection accuracy. Further analysis reveals that SoTD effectively harnesses the strengths of multiple component models, while InRe substantially alleviates model over-fitting. These findings provide insights for developing more robust and reliable AD detection models.
摘要:阿尔茨海默病 (Alzheimer’s Disease, AD) 检测已成为一个前景广阔的研究领域,利用机器学习分类模型来区分 AD 患者与非患者。与传统分类任务不同,我们识别出类内变异是 AD 检测中的一个关键挑战:AD 患者表现出不同程度的认知障碍。鉴于许多 AD 检测任务缺乏细粒度标签,简单的二元分类可能忽略两个关键方面:类内差异和实例级不平衡。前者迫使模型将不同程度受损的 AD 样本映射到单一诊断标签,忽视了认知功能的某些变化。后者则使模型偏向于过度代表的严重程度级别。本文介绍了早期应对这些挑战的努力。我们提出了两种新方法:软目标蒸馏 (Soft Target Distillation, SoTD) 和实例级再平衡 (Instance-level Re-balancing, InRe),分别针对这两个问题。在 ADReSS 和 ADReSSo 数据集上的实验表明,所提出的方法显著提高了检测准确性。进一步分析显示,SoTD 有效利用了多个组件模型的优势,而 InRe 则显著缓解了模型过拟合问题。这些发现为开发更稳健和可靠的 AD 检测模型提供了见解。

[NLP-70] A Literature Review of Keyword Spotting Technologies for Urdu

【速读】: 该论文试图解决乌尔都语(Urdu)这一低资源语言在关键词识别(KWS)技术中的独特挑战,特别是其复杂的音韵特性。解决方案的关键在于从基础的高斯混合模型(GMM)发展到先进的神经网络架构(如深度神经网络和变换器),并结合多任务学习和自监督方法,以利用未标记数据提升KWS系统在多语言和资源受限环境中的性能。论文强调了针对乌尔都语等语言的上下文特定研究的重要性,以解决其固有的复杂性,并推动更包容的语音技术发展。

链接: https://arxiv.org/abs/2409.16317
作者: Syed Muhammad Aqdas Rizvi
关键词-EN: Pakistan low-resource language, Pakistan low-resource, literature review surveys, keyword spotting, specifically focusing
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:This literature review surveys the advancements of keyword spotting (KWS) technologies, specifically focusing on Urdu, Pakistan’s low-resource language (LRL), which has complex phonetics. Despite the global strides in speech technology, Urdu presents unique challenges requiring more tailored solutions. The review traces the evolution from foundational Gaussian Mixture Models to sophisticated neural architectures like deep neural networks and transformers, highlighting significant milestones such as integrating multi-task learning and self-supervised approaches that leverage unlabeled data. It examines emerging technologies’ role in enhancing KWS systems’ performance within multilingual and resource-constrained settings, emphasizing the need for innovations that cater to languages like Urdu. Thus, this review underscores the need for context-specific research addressing the inherent complexities of Urdu and similar URLs and the means of regions communicating through such languages for a more inclusive approach to speech technology.
摘要:本文综述了关键词识别 (Keyword Spotting, KWS) 技术的进展,特别关注乌尔都语,这是巴基斯坦的低资源语言 (Low-Resource Language, LRL),具有复杂的语音特性。尽管全球语音技术取得了显著进步,乌尔都语仍呈现出独特的挑战,需要更为定制化的解决方案。综述追溯了从基础的高斯混合模型到复杂的神经网络架构(如深度神经网络和 Transformer)的演变过程,突出了诸如集成多任务学习和利用未标注数据的自监督方法等重要里程碑。它探讨了新兴技术在提升多语言和资源受限环境下 KWS 系统性能中的作用,强调了针对乌尔都语等语言的创新需求。因此,本综述强调了针对乌尔都语及其类似低资源语言固有复杂性进行特定语境研究的重要性,以及通过这些语言进行交流的地区实现更包容性语音技术的途径。

[NLP-71] How Redundant Is the Transformer Stack in Speech Representation Models?

【速读】: 该论文试图解决自监督语音表示模型中Transformer架构的层间冗余问题,关键解决方案是通过层相似性分析和模型剪枝技术,显著减少Transformer层的数量,同时保持模型的高预测能力。研究采用了三种相似性度量方法来分析层间相似性,发现存在高相似性的块状结构,表明层间存在显著冗余。通过剪枝和知识蒸馏技术,论文成功地将Transformer层数减少了40%,并在不显著影响性能的情况下,将网络规模和推理时间分别减少了95-98%和94%。

链接: https://arxiv.org/abs/2409.16302
作者: Teresa Dorszewski,Albert Kjøller Jacobsen,Lenka Tětková,Lars Kai Hansen
关键词-EN: Self-supervised speech representation, speech representation models, speech representation, representation models, Self-supervised speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model’s predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
摘要:自监督语音表示模型,特别是那些利用 Transformer 架构的模型,在诸如语音识别、说话人识别和情感检测等各种任务中表现出色。最近对 Transformer 模型的研究表明,层与层之间存在高度冗余,并且有可能进行显著的剪枝,我们将在此对基于 Transformer 的语音表示模型进行研究。我们使用三种相似性度量(余弦相似性、中心核对齐和互最近邻对齐)对语音表示模型中的层相似性进行了详细分析。我们的研究结果揭示了高相似性的块状结构,表明存在两个主要的处理步骤和层间显著的冗余。我们展示了在不进行后训练的情况下剪枝基于 Transformer 的语音表示模型的有效性,实现了 Transformer 层高达 40% 的减少,同时保持了模型预测能力的 95% 以上。此外,我们采用了一种知识蒸馏方法,用模仿层替代整个 Transformer 堆栈,将网络规模减少了 95-98%,推理时间减少了高达 94%。这种显著的计算负荷减少并未伴随显著的性能损失,这表明对于语音表示模型的下游应用,Transformer 堆栈几乎是完全冗余的。

[NLP-72] Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

【速读】: 该论文试图解决在有限的计算预算下如何高效训练语音基础模型的问题。解决方案的关键在于优化自监督学习(SSL)中的关键因素,包括模型架构、模型大小和数据大小。研究结果表明,在相同的计算和参数预算下,更精简的模型架构表现优于常见的小型架构,并且预训练数据的大小仍然至关重要,即使在使用数据增强的情况下,有限的迭代数据也会影响性能。此外,研究还揭示了模型大小和数据大小之间的权衡关系,强调了在给定计算预算下存在一个最佳的模型大小。

链接: https://arxiv.org/abs/2409.16295
作者: Andy T. Liu,Yi-Cheng Lin,Haibin Wu,Stefan Winkler,Hung-yi Lee
关键词-EN: remains computationally costly, computationally costly, foundation models, speech foundation models, SSL
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: To appear in SLT 2024

点击查看摘要

Abstract:Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward understanding the training dynamics of speech foundation models. We benchmark SSL objectives in an entirely comparable setting and find that other factors contribute more significantly to the success of SSL. Our results show that slimmer model architectures outperform common small architectures under the same compute and parameter budget. We demonstrate that the size of the pre-training data remains crucial, even with data augmentation during SSL training, as performance suffers when iterating over limited data. Finally, we identify a trade-off between model size and data size, highlighting an optimal model size for a given compute budget.
摘要:尽管基础模型在训练中取得了显著的成功,但其训练过程仍然计算成本高昂。本文探讨了如何在有限的计算预算下,通过自监督学习 (Self-Supervised Learning, SSL) 高效训练语音基础模型。我们研究了影响预算的关键因素,包括模型架构、模型大小和数据大小。我们的目标是朝着理解语音基础模型的训练动态迈出分析性的一步。我们在完全可比较的设置中对 SSL 目标进行了基准测试,并发现其他因素对 SSL 的成功贡献更为显著。我们的结果表明,在相同的计算和参数预算下,更精简的模型架构优于常见的小型架构。我们证明了预训练数据的大小仍然至关重要,即使在 SSL 训练期间进行数据增强,当迭代有限数据时,性能也会受到影响。最后,我们识别了模型大小和数据大小之间的权衡,强调了在给定计算预算下的最佳模型大小。

[NLP-73] Unsupervised Word Discovery: Boundary Detection with Clustering vs. Dynamic Programming

【速读】: 该论文试图解决未标注语音的词级分割与聚类问题,构建一个词汇库。解决方案的关键在于使用自监督特征之间的不相似性来预测词边界,并通过聚类这些预测的片段来构建词汇库。相较于传统的基于评分模型和动态规划的方法,该论文提出的简单策略在五种语言的ZeroSpeech基准测试中,不仅达到了相似的最先进结果,而且速度提高了近五倍。

链接: https://arxiv.org/abs/2409.14486
作者: Simon Malan,Benjamin van Niekerk,Herman Kamper
关键词-EN: segmenting unlabeled speech, long-standing problem, problem of segmenting, segmenting unlabeled, unlabeled speech
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 3 figures, 3 tables

点击查看摘要

Abstract:We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.
摘要:我们探讨了长期以来将未标记语音分割成类似单词的片段并将其聚类成词典的问题。先前的方法通常使用评分模型结合动态规划来寻找最优分割。在此,我们提出了一种更为简便的策略:我们通过相邻自监督特征之间的差异来预测词边界,然后将预测的片段进行聚类以构建词典。为了公平比较,我们使用更好的特征和边界约束更新了旧的 ES-KMeans 动态规划方法。在五种语言的 ZeroSpeech 基准测试中,我们的简单方法与新的 ES-KMeans+ 方法相比,达到了相似的最新技术水平,同时速度提升了近五倍。

人工智能

[AI-0] Differential Privacy Regularization: Protecting Training Data Through Loss Function Regularization

链接: https://arxiv.org/abs/2409.17144
作者: Francisco Aguilera-Martínez,Fernando Berzal
关键词-EN: Training machine learning, networks requires large, neural networks requires, machine learning models, learning models based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Training machine learning models based on neural networks requires large datasets, which may contain sensitive information. The models, however, should not expose private information from these datasets. Differentially private SGD [DP-SGD] requires the modification of the standard stochastic gradient descent [SGD] algorithm for training new models. In this short paper, a novel regularization strategy is proposed to achieve the same goal in a more efficient manner.

[AI-1] Attention Prompting on Image for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.17143
作者: Runpeng Yu,Weihao Yu,Xinchao Wang
关键词-EN: Large Language Models, Large Vision-Language Models, Large Language, demonstrating impressive performance, Compared with Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Website, see this https URL

点击查看摘要

Abstract:Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs’ capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models’ ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.

[AI-2] FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression

链接: https://arxiv.org/abs/2409.17141
作者: Fazal Mittu,Yihuan Bu,Akshat Gupta,Ashok Devireddy,Alp Eren Ozdarendeli,Anant Singh,Gopala Anumanchipalli
关键词-EN: language modeling objective, text compression, compression, text compression systems, practical text compression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the language modeling objective has been shown to be deeply connected with compression, it is surprising that modern LLMs are not employed in practical text compression systems. In this paper, we provide an in-depth analysis of neural network and transformer-based compression techniques to answer this question. We compare traditional text compression systems with neural network and LLM-based text compression methods. Although LLM-based systems significantly outperform conventional compression methods, they are highly impractical. Specifically, LLMZip, a recent text compression system using Llama3-8B requires 9.5 days to compress just 10 MB of text, although with huge improvements in compression ratios. To overcome this, we present FineZip - a novel LLM-based text compression system that combines ideas of online memorization and dynamic context to reduce the compression time immensely. FineZip can compress the above corpus in approximately 4 hours compared to 9.5 days, a 54 times improvement over LLMZip and comparable performance. FineZip outperforms traditional algorithmic compression methods with a large margin, improving compression ratios by approximately 50%. With this work, we take the first step towards making lossless text compression with LLMs a reality. While FineZip presents a significant step in that direction, LLMs are still not a viable solution for large-scale text compression. We hope our work paves the way for future research and innovation to solve this problem.

[AI-3] urn Every Application into an Agent : Towards Efficient Human-Agent -Computer Interaction with API-First LLM-Based Agents

链接: https://arxiv.org/abs/2409.17140
作者: Junting Lu,Zhiyang Zhang,Fangkai Yang,Jue Zhang,Lu Wang,Chao Du,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词-EN: Multimodal large language, enhancing agents’ performance, large language models, Multimodal large, language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents’ performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs. It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS).

[AI-4] Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision Physics Simulation and a Robot with Reset

链接: https://arxiv.org/abs/2409.17126
作者: Andrew Goldberg,Kavish Kondap,Tianshuang Qiu,Zehan Ma,Letian Fu,Justin Kerr,Huang Huang,Kaiyuan Chen,Kuan Fang,Ken Goldberg
关键词-EN: shown impressive capabilities, creating text, shown impressive, impressive capabilities, capabilities in creating
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 Figures

点击查看摘要

Abstract:Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the rich history of research in industrial ‘‘Design for Assembly’’, we introduce a novel problem: Generative Design-for-Robot-Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., ‘‘giraffe’’) and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, and instructions for a robot to build this assembly. The output must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems with minimal human supervision. Blox-Net achieved a Top-1 accuracy of 63.5% in the ‘‘recognizability’’ of its designed assemblies (eg, resembling giraffe as judged by a VLM). These designs, after automated perturbation redesign, were reliably assembled by a robot, achieving near-perfect success across 10 consecutive assembly iterations with human intervention only during reset prior to assembly. Surprisingly, this entire design process from textual word (‘‘giraffe’’) to reliable physical assembly is performed with zero human intervention.

[AI-5] On-orbit Servicing for Spacecraft Collision Avoidance With Autonomous Decision Making

链接: https://arxiv.org/abs/2409.17125
作者: Susmitha Patnala,Adam Abdin
关键词-EN: autonomous On-Orbit Servicing, On-Orbit Servicing, mission to assist, develops an AI-based, Reinforcement Learning
类目: Artificial Intelligence (cs.AI)
*备注: The first joint European Space Agency SPAICE Conference / IAA Conference on AI in and for Space

点击查看摘要

Abstract:This study develops an AI-based implementation of autonomous On-Orbit Servicing (OOS) mission to assist with spacecraft collision avoidance maneuvers (CAMs). We propose an autonomous `servicer’ trained with Reinforcement Learning (RL) to autonomously detect potential collisions between a target satellite and space debris, rendezvous and dock with endangered satellites, and execute optimal CAM. The RL model integrates collision risk estimates, satellite specifications, and debris data to generate an optimal maneuver matrix for OOS rendezvous and collision prevention. We employ the Cross-Entropy algorithm to find optimal decision policies efficiently. Initial results demonstrate the feasibility of autonomous robotic OOS for collision avoidance services, focusing on one servicer spacecraft to one endangered satellite scenario. However, merging spacecraft rendezvous and optimal CAM presents significant complexities. We discuss design challenges and critical parameters for the successful implementation of the framework presented through a case study.

[AI-6] Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

链接: https://arxiv.org/abs/2409.17115
作者: Fan Zhou,Zengzhi Wang,Qian Liu,Junlong Li,Pengfei Liu
关键词-EN: numerous rules developed, Large language model, Large language, resulting in numerous, developed to date
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 13 figures, 34 tables

点击查看摘要

Abstract:Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with 100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: this https URL

[AI-7] Unveiling Ontological Commitment in Multi-Modal Foundation Models ECAI2024

链接: https://arxiv.org/abs/2409.17109
作者: Mert Keser,Gesina Schwalbe,Niki Amini-Naieni,Matthias Rottmann,Alois Knoll
关键词-EN: corner stone, models, Ontological commitment, foundation models, concepts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Qualitative Reasoning Workshop 2024 (QR2024) colocated with ECAI2024, camera-ready submission; first two authors contributed equally; 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Ontological commitment, i.e., used concepts, relations, and assumptions, are a corner stone of qualitative reasoning (QR) models. The state-of-the-art for processing raw inputs, though, are deep neural networks (DNNs), nowadays often based off from multimodal foundation models. These automatically learn rich representations of concepts and respective reasoning. Unfortunately, the learned qualitative knowledge is opaque, preventing easy inspection, validation, or adaptation against available QR models. So far, it is possible to associate pre-defined concepts with latent representations of DNNs, but extractable relations are mostly limited to semantic similarity. As a next step towards QR for validation and verification of DNNs: Concretely, we propose a method that extracts the learned superclass hierarchy from a multimodal DNN for a given set of leaf concepts. Under the hood we (1) obtain leaf concept embeddings using the DNN’s textual input modality; (2) apply hierarchical clustering to them, using that DNNs encode semantic similarities via vector distances; and (3) label the such-obtained parent concepts using search in available ontologies from QR. An initial evaluation study shows that meaningful ontological class hierarchies can be extracted from state-of-the-art foundation models. Furthermore, we demonstrate how to validate and verify a DNN’s learned representations against given ontologies. Lastly, we discuss potential future applications in the context of QR.

[AI-8] Accumulator-Aware Post-Training Quantization

链接: https://arxiv.org/abs/2409.17092
作者: Ian Colbert,Fabian Grob,Giuseppe Franco,Jinjie Zhang,Rayan Saab
关键词-EN: investigated low-precision accumulation, studies have investigated, investigated low-precision, PTQ algorithms, recent studies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.

[AI-9] Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification

链接: https://arxiv.org/abs/2409.17091
作者: Xinrui Zhou,Yuhao Huang,Haoran Dou,Shijing Chen,Ao Chang,Jia Liu,Weiran Long,Jian Zheng,Erjiao Xu,Jie Ren,Ruobing Huang,Jun Cheng,Wufeng Xue,Dong Ni
关键词-EN: labor-intensive annotation processes, annotation processes hinder, deep models, limited availability, availability of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, 7 tables

点击查看摘要

Abstract:In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.

[AI-10] he Effect of Perceptual Metrics on Music Representation Learning for Genre Classification

链接: https://arxiv.org/abs/2409.17069
作者: Tashi Namgyal,Alexander Hepburn,Raul Santos-Rodriguez,Valero Laparra,Jesus Malo
关键词-EN: objective perceptual metrics, perceptual metrics, subjective quality, approximated with objective, natural signals
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2312.03455

点击查看摘要

Abstract:The subjective quality of natural signals can be approximated with objective perceptual metrics. Designed to approximate the perceptual behaviour of human observers, perceptual metrics often reflect structures found in natural signals and neurological pathways. Models trained with perceptual metrics as loss functions can capture perceptually meaningful features from the structures held within these metrics. We demonstrate that using features extracted from autoencoders trained with perceptual losses can improve performance on music understanding tasks, i.e. genre classification, over using these metrics directly as distances when learning a classifier. This result suggests improved generalisation to novel signals when using perceptual metrics as loss functions for representation learning.

[AI-11] VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

链接: https://arxiv.org/abs/2409.17066
作者: Yifei Liu,Jicheng Wen,Yang Wang,Shengyu Ye,Li Lyna Zhang,Ting Cao,Cheng Li,Mao Yang
关键词-EN: Large Language Models, Large Language, Scaling model size, size significantly challenges, model size significantly
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01 - 0.34 on LLaMA-2, 0.38 - 0.68 on Mistral-7B, 4.41 - 7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79 - 1.5% on LLaMA-2, 1% on Mistral-7B, 11 - 22% on LLaMA-3 on QA tasks on average. We only utilize 10.4 - 18.6% of the quantization algorithm execution time, resulting in a 1.6 - 1.8\times increase in inference throughput compared to SOTA. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.17066 [cs.AI] (or arXiv:2409.17066v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.17066 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Benchmarking Domain Generalization Algorithms in Computational Pathology

链接: https://arxiv.org/abs/2409.17063
作者: Neda Zamanitajeddin,Mostafa Jahanifar,Kesi Xu,Fouzia Siraj,Nasir Rajpoot
关键词-EN: shown immense promise, unseen data due, computational pathology, Deep learning, Deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.

[AI-13] DRIM: Learning Disentangled Representations from Incomplete Multimodal Healthcare Data

链接: https://arxiv.org/abs/2409.17055
作者: Lucas Robinet,Ahmad Berjaoui,Ziad Kheil,Elizabeth Cohen-Jonathan Moyal
关键词-EN: Real-life medical data, advanced deep learning, deep learning models, learning models capable, Real-life medical
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-life medical data is often multimodal and incomplete, fueling the growing need for advanced deep learning models capable of integrating them efficiently. The use of diverse modalities, including histopathology slides, MRI, and genetic data, offers unprecedented opportunities to improve prognosis prediction and to unveil new treatment pathways. Contrastive learning, widely used for deriving representations from paired data in multimodal tasks, assumes that different views contain the same task-relevant information and leverages only shared information. This assumption becomes restrictive when handling medical data since each modality also harbors specific knowledge relevant to downstream tasks. We introduce DRIM, a new multimodal method for capturing these shared and unique representations, despite data sparsity. More specifically, given a set of modalities, we aim to encode a representation for each one that can be divided into two components: one encapsulating patient-related information common across modalities and the other, encapsulating modality-specific details. This is achieved by increasing the shared information among different patient modalities while minimizing the overlap between shared and unique components within each modality. Our method outperforms state-of-the-art algorithms on glioma patients survival prediction tasks, while being robust to missing modalities. To promote reproducibility, the code is made publicly available at this https URL

[AI-14] Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia

链接: https://arxiv.org/abs/2409.17054
作者: Azmul Asmar Irfan,Nur Ahmad Khatim,Mansur M. Arief
关键词-EN: key issues contributing, inefficiency in Puskesmas, key issues, issues contributing, contributing to inefficiency
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:One of the key issues contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors need to conduct thorough consultations, which include diagnosing the patient’s condition, providing treatment advice, and transcribing detailed notes into medical records. In regions with diverse linguistic backgrounds, doctors often have to ask clarifying questions, further prolonging the process. While diagnosing is essential, transcription and summarization can often be automated using AI to improve time efficiency and help doctors enhance care quality and enable early diagnosis and intervention. This paper proposes a solution using a localized large language model (LLM) to transcribe, translate, and summarize doctor-patient conversations. We utilize the Whisper model for transcription and GPT-3 to summarize them into the ePuskemas medical records format. This system is implemented as an add-on to an existing web browser extension, allowing doctors to fill out patient forms while talking. By leveraging this solution for real-time transcription, translation, and summarization, doctors can improve the turnaround time for patient care while enhancing the quality of records, which become more detailed and insightful for future visits. This innovation addresses challenges like overcrowded facilities and the administrative burden on healthcare providers in Indonesia. We believe this solution will help doctors save time, provide better care, and produce more accurate medical records, representing a significant step toward modernizing healthcare and ensuring patients receive timely, high-quality care, even in resource-constrained settings.

[AI-15] ControlCity: A Multimodal Diffusion Model Based Approach for Accurate Geospatial Data Generation and Urban Morphology Analysis

链接: https://arxiv.org/abs/2409.17049
作者: Fangshuo Zhou,Huaxia Li,Rui Hu,Sensen Wu,Hailin Feng,Zhenhong Du,Liuchang Xu
关键词-EN: Volunteer Geographic Information, building footprint data, Volunteer Geographic, Geographic Information, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Volunteer Geographic Information (VGI), with its rich variety, large volume, rapid updates, and diverse sources, has become a critical source of geospatial data. However, VGI data from platforms like OSM exhibit significant quality heterogeneity across different data types, particularly with urban building data. To address this, we propose a multi-source geographic data transformation solution, utilizing accessible and complete VGI data to assist in generating urban building footprint data. We also employ a multimodal data generation framework to improve accuracy. First, we introduce a pipeline for constructing an ‘image-text-metadata-building footprint’ dataset, primarily based on road network data and supplemented by other multimodal data. We then present ControlCity, a geographic data transformation method based on a multimodal diffusion model. This method first uses a pre-trained text-to-image model to align text, metadata, and building footprint data. An improved ControlNet further integrates road network and land-use imagery, producing refined building footprint data. Experiments across 22 global cities demonstrate that ControlCity successfully simulates real urban building patterns, achieving state-of-the-art performance. Specifically, our method achieves an average FID score of 50.94, reducing error by 71.01% compared to leading methods, and a MIoU score of 0.36, an improvement of 38.46%. Additionally, our model excels in tasks like urban morphology transfer, zero-shot city generation, and spatial data completeness assessment. In the zero-shot city task, our method accurately predicts and generates similar urban structures, demonstrating strong generalization. This study confirms the effectiveness of our approach in generating urban building footprint data and capturing complex city characteristics.

[AI-16] GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design

链接: https://arxiv.org/abs/2409.17045
作者: Phillip Mueller,Sebastian Mueller,Lars Mikelsons
关键词-EN: enabling Deep Generative, Deep Generative Models, Deep Generative, enabling Deep, automate data labeling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We provide a dataset for enabling Deep Generative Models (DGMs) in engineering design and propose methods to automate data labeling by utilizing large-scale foundation models. GeoBiked is curated to contain 4 355 bicycle images, annotated with structural and technical features and is used to investigate two automated labeling techniques: The utilization of consolidated latent features (Hyperfeatures) from image-generation models to detect geometric correspondences (e.g. the position of the wheel center) in structural images and the generation of diverse text descriptions for structural images. GPT-4o, a vision-language-model (VLM), is instructed to analyze images and produce diverse descriptions aligned with the system-prompt. By representing technical images as Diffusion-Hyperfeatures, drawing geometric correspondences between them is possible. The detection accuracy of geometric points in unseen samples is improved by presenting multiple annotated source images. GPT-4o has sufficient capabilities to generate accurate descriptions of technical images. Grounding the generation only on images leads to diverse descriptions but causes hallucinations, while grounding it on categorical labels restricts the diversity. Using both as input balances creativity and accuracy. Successfully using Hyperfeatures for geometric correspondence suggests that this approach can be used for general point-detection and annotation tasks in technical images. Labeling such images with text descriptions using VLMs is possible, but dependent on the models detection capabilities, careful prompt-engineering and the selection of input information. Applying foundation models in engineering design is largely unexplored. We aim to bridge this gap with a dataset to explore training, finetuning and conditioning DGMs in this field and suggesting approaches to bootstrap foundation models to process technical images.

[AI-17] How to Connect Speech Foundation Models and Large Language Models ? What Matters and What Does Not

链接: https://arxiv.org/abs/2409.17044
作者: Francesco Verdini,Pierfrancesco Melucci,Stefano Perna,Francesco Cariaggi,Marco Gaido,Sara Papi,Szymon Mazurek,Marek Kasztelnik,Luisa Bentivogli,Sébastien Bratières,Paolo Merialdo,Simone Scardapane
关键词-EN: Large Language Models, Large Language, Speech Foundational Model, driven research efforts, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.

[AI-18] Counterfactual Token Generation in Large Language Models

链接: https://arxiv.org/abs/2409.17027
作者: Ivi Chatzi,Nina Corvelo Benz,Eleni Straitouri,Stratis Tsirtsis,Manuel Gomez-Rodriguez
关键词-EN: Maelstrom Fury, Captain Lyra stood, trusty ship, endless sea, Captain Lyra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:“Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom’s Fury, gazing out at the endless sea. […] Lyra’s eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself.” Although this story, generated by a large language model, is captivating, one may wonder – how would the story have unfolded if the model had chosen “Captain Maeve” as the protagonist instead? We cannot know. State-of-the-art large language models are stateless – they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-instruct and conduct both qualitative and quantitative analyses of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.

[AI-19] AI-Driven Risk-Aware Scheduling for Active Debris Removal Missions

链接: https://arxiv.org/abs/2409.17012
作者: Antoine Poupon,Hugo de Rohan Willner,Pierre Nikitits,Adam Abdin
关键词-EN: Low Earth Orbit, Earth Orbit, Low Earth, Orbital Transfer Vehicles, represents a significant
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The proliferation of debris in Low Earth Orbit (LEO) represents a significant threat to space sustainability and spacecraft safety. Active Debris Removal (ADR) has emerged as a promising approach to address this issue, utilising Orbital Transfer Vehicles (OTVs) to facilitate debris deorbiting, thereby reducing future collision risks. However, ADR missions are substantially complex, necessitating accurate planning to make the missions economically viable and technically effective. Moreover, these servicing missions require a high level of autonomous capability to plan under evolving orbital conditions and changing mission requirements. In this paper, an autonomous decision-planning model based on Deep Reinforcement Learning (DRL) is developed to train an OTV to plan optimal debris removal sequencing. It is shown that using the proposed framework, the agent can find optimal mission plans and learn to update the planning autonomously to include risk handling of debris with high collision risk.

[AI-20] Models Can and Should Embrace the Communicative Nature of Human-Generated Math

链接: https://arxiv.org/abs/2409.17005
作者: Sasha Boguraev,Ben Lipkin,Leonie Weissweiler,Kyle Mahowald
关键词-EN: idealized mathematical entities, language corpora reflect, natural language corpora, corpora reflect, idealized mathematical
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Math is constructed by people for people: just as natural language corpora reflect not just propositions but the communicative goals of language users, the math data that models are trained on reflects not just idealized mathematical entities but rich communicative intentions. While there are important advantages to treating math in a purely symbolic manner, we here hypothesize that there are benefits to treating math as situated linguistic communication and that language models are well suited for this goal, in ways that are not fully appreciated. We illustrate these points with two case studies. First, we ran an experiment in which we found that language models interpret the equals sign in a humanlike way – generating systematically different word problems for the same underlying equation arranged in different ways. Second, we found that language models prefer proofs to be ordered in naturalistic ways, even though other orders would be logically equivalent. We advocate for AI systems that learn from and represent the communicative intentions latent in human-generated math.

[AI-21] INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

链接: https://arxiv.org/abs/2409.16997
作者: Shimao Chen,Zirui Liu,Zhiying Wu,Ce Zheng,Peizhuang Cong,Zihan Jiang,Lei Su,Tong Yang
关键词-EN: self-attention module faces, large language models, language models, self-attention module, sequence length
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.

[AI-22] Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

链接: https://arxiv.org/abs/2409.16986
作者: Chi Zhang,Huaping Zhong,Kuan Zhang,Chengliang Chai,Rui Wang,Xinlin Zhuang,Tianyi Bai,Jiantao Qiu,Lei Cao,Ye Yuan,Guoren Wang,Conghui He
关键词-EN: large language models, Data, pre-training large language, great significance, large language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top- k instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model’s ability to generalize effectively to various downstream tasks. In this paper, we introduce \textttQuad, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results. In particular, noting that attention layers capture extensive semantic details, we have adapted the accelerated iHVP computation methods for attention layers, enhancing our ability to evaluate the influence of data, i.e., its quality. For the diversity, \textttQuad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. To determine which clusters to select, we utilize the classic Multi-Armed Bandit method, treating each cluster as an arm. This approach favors clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.

[AI-23] AXCEL: Automated eXplainable Consistency Evaluation using LLMs

链接: https://arxiv.org/abs/2409.16984
作者: P Aditya Sreekar,Sahil Verma,Suransh Chopra,Sarik Ghazarian,Abhishek Persad,Narayanan Sadagopan
关键词-EN: Large Language Models, Large Language, Language Models, Natural Language Inference, generated text responses
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.

[AI-24] owards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI

链接: https://arxiv.org/abs/2409.16978
作者: Elisa Nguyen,Johannes Bertram,Evgenii Kortukov,Jean Y. Song,Seong Joon Oh
关键词-EN: Training Data Attribution, aims to make, make AI understandable, criticised for relying, mathematical soundness
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Explainable AI (XAI) aims to make AI understandable and useful to humans, it has been criticised for relying too much on formalism and solutionism, focusing more on mathematical soundness than user needs. We propose an alternative to this bottom-up approach inspired by design thinking: the XAI research community should adopt a top-down, user-focused perspective to ensure user relevance. We illustrate this with a relatively young subfield of XAI, Training Data Attribution (TDA). With the surge in TDA research and growing competition, the field risks repeating the same patterns of solutionism. We conducted a needfinding study with a diverse group of AI practitioners to identify potential user needs related to TDA. Through interviews (N=10) and a systematic survey (N=31), we uncovered new TDA tasks that are currently largely overlooked. We invite the TDA and XAI communities to consider these novel tasks and improve the user relevance of their research outcomes.

[AI-25] Decoding Large-Language Models: A Systematic Overview of Socio-Technical Impacts Constraints and Emerging Questions

链接: https://arxiv.org/abs/2409.16974
作者: Zeyneb N. Kaya,Souvick Ghosh
关键词-EN: large language models, natural language processing, language models, language processing, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 28 pages, 5 figures, preprint submitted to journal

点击查看摘要

Abstract:There have been rapid advancements in the capabilities of large language models (LLMs) in recent years, greatly revolutionizing the field of natural language processing (NLP) and artificial intelligence (AI) to understand and interact with human language. Therefore, in this work, we conduct a systematic investigation of the literature to identify the prominent themes and directions of LLM developments, impacts, and limitations. Our findings illustrate the aims, methodologies, limitations, and future directions of LLM research. It includes responsible development considerations, algorithmic improvements, ethical challenges, and societal implications of LLM development. Overall, this paper provides a rigorous and comprehensive overview of current research in LLM and identifies potential directions for future development. The article highlights the application areas that could have a positive impact on society along with the ethical considerations.

[AI-26] Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

链接: https://arxiv.org/abs/2409.16973
作者: Rafael Mendoza,Isabella Cruz,Richard Liu,Aarav Deshmukh,David Williams,Jesscia Peng,Rohan Iyer
关键词-EN: Large language models, Large language, interact with technology, significant challenge, individual user preferences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First ASLS

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.

[AI-27] Informed deep hierarchical classification: a non-standard analysis inspired approach

链接: https://arxiv.org/abs/2409.16956
作者: Lorenzo Fiaschi,Marco Cococcioni
关键词-EN: rigid parent-child structure, multiple labels organized, deep neural network, parent-child structure, work proposes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic (math.LO)
*备注:

点击查看摘要

Abstract:This work proposes a novel approach to the deep hierarchical classification task, i.e., the problem of classifying data according to multiple labels organized in a rigid parent-child structure. It consists in a multi-output deep neural network equipped with specific projection operators placed before each output layer. The design of such an architecture, called lexicographic hybrid deep neural network (LH-DNN), has been possible by combining tools from different and quite distant research fields: lexicographic multi-objective optimization, non-standard analysis, and deep learning. To assess the efficacy of the approach, the resulting network is compared against the B-CNN, a convolutional neural network tailored for hierarchical classification tasks, on the CIFAR10, CIFAR100 (where it has been originally and recently proposed before being adopted and tuned for multiple real-world applications) and Fashion-MNIST benchmarks. Evidence states that an LH-DNN can achieve comparable if not superior performance, especially in the learning of the hierarchical relations, in the face of a drastic reduction of the learning parameters, training epochs, and computational time, without the need for ad-hoc loss functions weighting values.

[AI-28] Dynamic Obstacle Avoidance through Uncertainty-Based Adaptive Planning with Diffusion

链接: https://arxiv.org/abs/2409.16950
作者: Vineet Punyamoorty,Pascal Jutras-Dubé,Ruqi Zhang,Vaneet Aggarwal,Damon Conover,Aniket Bera
关键词-EN: framing reinforcement learning, sequence modeling problem, modeling problem, recent work, framing reinforcement
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While replanning at every timestep could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences – a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts replanning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant replanning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and a 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates and an improved ability to navigate the environment safely.

[AI-29] Setting the AI Agenda – Evidence from Sweden in the ChatGPT Era ECAI2024

链接: https://arxiv.org/abs/2409.16946
作者: Bastiaan Bruinsma,Annika Fredén,Kajsa Hansson,Moa Johansson,Pasko Kisić-Merino,Denitsa Saynova
关键词-EN: Artificial Intelligence, meta-debate in Sweden, release of ChatGPT, paper examines, Intelligence
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: This paper is part of the Second AEQUITAS Workshop on Fairness and Bias in AI | co-located with ECAI 2024, October 19–24, 2024, Santiago de Compostela, Spain

点击查看摘要

Abstract:This paper examines the development of the Artificial Intelligence (AI) meta-debate in Sweden before and after the release of ChatGPT. From the perspective of agenda-setting theory, we propose that it is an elite outside of party politics that is leading the debate – i.e. that the politicians are relatively silent when it comes to this rapid development. We also suggest that the debate has become more substantive and risk-oriented in recent years. To investigate this claim, we draw on an original dataset of elite-level documents from the early 2010s to the present, using op-eds published in a number of leading Swedish newspapers. By conducting a qualitative content analysis of these materials, our preliminary findings lend support to the expectation that an academic, rather than a political elite is steering the debate.

[AI-30] Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

链接: https://arxiv.org/abs/2409.16944
作者: Phu Pham,Dipam Patel,Damon Conover,Aniket Bera
关键词-EN: Gaussian Splatting SLAM, Splatting SLAM, embedding object-level information, Gaussian Splatting, SLAM to reconstruct
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Go-SLAM, a novel framework that utilizes 3D Gaussian Splatting SLAM to reconstruct dynamic environments while embedding object-level information within the scene representations. This framework employs advanced object segmentation techniques, assigning a unique identifier to each Gaussian splat that corresponds to the object it represents. Consequently, our system facilitates open-vocabulary querying, allowing users to locate objects using natural language descriptions. Furthermore, the framework features an optimal path generation module that calculates efficient navigation paths for robots toward queried objects, considering obstacles and environmental uncertainties. Comprehensive evaluations in various scene settings demonstrate the effectiveness of our approach in delivering high-fidelity scene reconstructions, precise object segmentation, flexible object querying, and efficient robot path planning. This work represents an additional step forward in bridging the gap between 3D scene reconstruction, semantic object understanding, and real-time environment interactions.

[AI-31] Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model

链接: https://arxiv.org/abs/2409.16938
作者: Hongliang Zhong,Can Wang,Jingbo Zhang,Jing Liao
关键词-EN: versatile scene recreation, achieving versatile scene, scene recreation, achieving versatile, versatile scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Generating and inserting new objects into 3D content is a compelling approach for achieving versatile scene recreation. Existing methods, which rely on SDS optimization or single-view inpainting, often struggle to produce high-quality results. To address this, we propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model to facilitate view-consistent object inpainting. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation. After generating the multi-view inpainted results, we further propose a mask-aware 3D reconstruction technique to refine Gaussian Splatting reconstruction from these sparse inpainted views. By leveraging these fabricate techniques, our approach yields diverse results, ensures view-consistent and harmonious insertions, and produces better object quality. Extensive experiments demonstrate that our approach outperforms existing methods.

[AI-32] Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

链接: https://arxiv.org/abs/2409.16934
作者: Emanuela Boros,Maud Ehrmann
关键词-EN: named entity recognition, Transformer architecture, entity recognition, paper investigates, investigates the presence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the presence of OCR-sensitive neurons within the Transformer architecture and their influence on named entity recognition (NER) performance on historical documents. By analysing neuron activation patterns in response to clean and noisy text inputs, we identify and then neutralise OCR-sensitive neurons to improve model performance. Based on two open access large language models (Llama2 and Mistral), experiments demonstrate the existence of OCR-sensitive regions and show improvements in NER performance on historical newspapers and classical commentaries, highlighting the potential of targeted neuron modulation to improve models’ performance on noisy text.

[AI-33] Quantum-Classical Sentiment Analysis WWW

链接: https://arxiv.org/abs/2409.16928
作者: Mario Bifulco,Luca Roversi
关键词-EN: classical CPLEX classifier, hybrid classical-quantum classifier, CPLEX classifier, classical CPLEX, classical-quantum classifier
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to BigHPC 2024 - this https URL

点击查看摘要

Abstract:In this study, we initially investigate the application of a hybrid classical-quantum classifier (HCQC) for sentiment analysis, comparing its performance against the classical CPLEX classifier and the Transformer architecture. Our findings indicate that while the HCQC underperforms relative to the Transformer in terms of classification accuracy, but it requires significantly less time to converge to a reasonably good approximate solution. This experiment also reveals a critical bottleneck in the HCQC, whose architecture is partially undisclosed by the D-Wave property. To address this limitation, we propose a novel algorithm based on the algebraic decomposition of QUBO models, which enhances the time the quantum processing unit can allocate to problem-solving tasks.

[AI-34] AI-assisted Gaze Detection for Proctoring Online Exams

链接: https://arxiv.org/abs/2409.16923
作者: Yong-Siang Shih,Zach Zhao,Chenhao Niu,Bruce Iberg,James Sharpnack,Mirza Basim Baig
关键词-EN: detect potential rule, potential rule violations, high-stakes online exams, high-stakes online, important to detect
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted to HCOMP-24 Works-in-Progress and Demonstration track

点击查看摘要

Abstract:For high-stakes online exams, it is important to detect potential rule violations to ensure the security of the test. In this study, we investigate the task of detecting whether test takers are looking away from the screen, as such behavior could be an indication that the test taker is consulting external resources. For asynchronous proctoring, the exam videos are recorded and reviewed by the proctors. However, when the length of the exam is long, it could be tedious for proctors to watch entire exam videos to determine the exact moments when test takers look away. We present an AI-assisted gaze detection system, which allows proctors to navigate between different video frames and discover video frames where the test taker is looking in similar directions. The system enables proctors to work more effectively to identify suspicious moments in videos. An evaluation framework is proposed to evaluate the system against human-only and ML-only proctoring, and a user study is conducted to gather feedback from proctors, aiming to demonstrate the effectiveness of the system.

[AI-35] Me What You Dont Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing

链接: https://arxiv.org/abs/2409.16913
作者: Wenhao Liu,Siyu An,Junru Lu,Muling Wu,Tianlong Li,Xiaohua Wang,Xiaoqing Zheng,Di Yin,Xing Sun,Xuanjing Huang
关键词-EN: shown remarkable performance, knowledge conflicting requests, Role-Playing Agents, conflicting requests, shown remarkable
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs’ performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs’ ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions within the model’s forwarding representation, and thus influence the RPA’s final response behavior. Therefore, we introduce a lightweight representation editing approach that conveniently shifts conflicting requests to the rejection region, thereby enhancing the model’s refusal accuracy. The experimental results validate the effectiveness of our editing method, improving RPAs’ refusal ability of conflicting requests while maintaining their general role-playing capabilities.

[AI-36] Enhancing Temporal Sensitivity and Reasoning for Time-Sensitive Question Answering EMNLP2024

链接: https://arxiv.org/abs/2409.16909
作者: Wanqi Yang,Yanda Li,Meng Fang,Ling Chen
关键词-EN: Time-Sensitive Question Answering, address time-sensitive questions, encompassing multiple time-evolving, specific temporal contexts, multiple time-evolving facts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted by EMNLP 2024 Findings

点击查看摘要

Abstract:Time-Sensitive Question Answering (TSQA) demands the effective utilization of specific temporal contexts, encompassing multiple time-evolving facts, to address time-sensitive questions. This necessitates not only the parsing of temporal information within questions but also the identification and understanding of time-evolving facts to generate accurate answers. However, current large language models still have limited sensitivity to temporal information and their inadequate temporal reasoning this http URL this paper, we propose a novel framework that enhances temporal awareness and reasoning through Temporal Information-Aware Embedding and Granular Contrastive Reinforcement Learning. Experimental results on four TSQA datasets demonstrate that our framework significantly outperforms existing LLMs in TSQA tasks, marking a step forward in bridging the performance gap between machine and human temporal understanding and reasoning.

[AI-37] Discriminative Anchor Learning for Efficient Multi-view Clustering

链接: https://arxiv.org/abs/2409.16904
作者: Yalan Qin,Nan Pu,Hanzhou Wu,Nicu Sebe
关键词-EN: Multi-view clustering aims, anchor graph, shared anchor graph, underlying structure, aims to study
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted by TMM

点击查看摘要

Abstract:Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.

[AI-38] owards Underwater Camouflaged Object Tracking: An Experimental Evaluation of SAM and SAM 2

链接: https://arxiv.org/abs/2409.16902
作者: Chunhui Zhang,Li Liu,Guanjie Huang,Hao Wen,Xi Zhou,Yanfeng Wang
关键词-EN: visual object tracking, object tracking, object tracking methods, object tracking dataset, visual object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. Work in Progress

点击查看摘要

Abstract:Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale training datasets. However, existing tracking datasets are primarily focused on open-air scenarios, which greatly limits the development of object tracking in underwater environments. To address this issue, we take a step forward by proposing the first large-scale underwater camouflaged object tracking dataset, namely UW-COT. Based on the proposed dataset, this paper presents an experimental evaluation of several advanced visual object tracking methods and the latest advancements in image and video segmentation. Specifically, we compare the performance of the Segment Anything Model (SAM) and its updated version, SAM 2, in challenging underwater environments. Our findings highlight the improvements in SAM 2 over SAM, demonstrating its enhanced capability to handle the complexities of underwater camouflaged objects. Compared to current advanced visual object tracking methods, the latest video segmentation foundation model SAM 2 also exhibits significant advantages, providing valuable insights into the development of more effective tracking technologies for underwater scenarios. The dataset will be accessible at \colormagentathis https URL.

[AI-39] A Roadmap for Embodied and Social Grounding in LLMs

链接: https://arxiv.org/abs/2409.16900
作者: Sara Incao,Carlo Mazzola,Giulia Belgiovine,Alessandra Sciutti
关键词-EN: Large Language Models, offering unparalleled capabilities, multimodal input handling, fusion of Large, Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注: Accepted Version of a conference paper presented at Robophilosophy Conference 2024

点击查看摘要

Abstract:The fusion of Large Language Models (LLMs) and robotic systems has led to a transformative paradigm in the robotic field, offering unparalleled capabilities not only in the communication domain but also in skills like multimodal input handling, high-level reasoning, and plan generation. The grounding of LLMs knowledge into the empirical world has been considered a crucial pathway to exploit the efficiency of LLMs in robotics. Nevertheless, connecting LLMs’ representations to the external world with multimodal approaches or with robots’ bodies is not enough to let them understand the meaning of the language they are manipulating. Taking inspiration from humans, this work draws attention to three necessary elements for an agent to grasp and experience the world. The roadmap for LLMs grounding is envisaged in an active bodily system as the reference point for experiencing the environment, a temporally structured experience for a coherent, self-related interaction with the external world, and social skills to acquire a common-grounded shared experience.

[AI-40] AI-driven View Guidance System in Intra-cardiac Echocardiography Imaging

链接: https://arxiv.org/abs/2409.16898
作者: Jaeyoung Huh,Paul Klein,Gareth Funka-Lea,Puneet Sharma,Ankur Kapoor,Young-Ho Kim
关键词-EN: structural heart disease, Intra-cardiac Echocardiography, crucial imaging modality, heart disease, providing real-time
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Intra-cardiac Echocardiography (ICE) is a crucial imaging modality used in electrophysiology (EP) and structural heart disease (SHD) interventions, providing real-time, high-resolution views from within the heart. Despite its advantages, effective manipulation of the ICE catheter requires significant expertise, which can lead to inconsistent outcomes, particularly among less experienced operators. To address this challenge, we propose an AI-driven closed-loop view guidance system with human-in-the-loop feedback, designed to assist users in navigating ICE imaging without requiring specialized knowledge. Our method models the relative position and orientation vectors between arbitrary views and clinically defined ICE views in a spatial coordinate system, guiding users on how to manipulate the ICE catheter to transition from the current view to the desired view over time. Operating in a closed-loop configuration, the system continuously predicts and updates the necessary catheter manipulations, ensuring seamless integration into existing clinical workflows. The effectiveness of the proposed system is demonstrated through a simulation-based evaluation, achieving an 89% success rate with the 6532 test dataset, highlighting its potential to improve the accuracy and efficiency of ICE imaging procedures.

[AI-41] Revisiting Space Mission Planning: A Reinforcement Learning-Guided Approach for Multi-Debris Rendezvous

链接: https://arxiv.org/abs/2409.16882
作者: Agni Bandyopadhyay,Guenther Waxenegger-Wilfing
关键词-EN: Proximal Policy Optimization, masked Proximal Policy, masked Proximal, Policy Optimization, Proximal Policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for publication at the 2024 International Conference on Space Robotics (iSpaRo)

点击查看摘要

Abstract:This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo’s adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo’s adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately 10.96% and 13.66% compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.

[AI-42] Automating Traffic Model Enhancement with AI Research Agent

链接: https://arxiv.org/abs/2409.16876
作者: Xusen Guo,Xinxi Yang,Mingxing Peng,Hongliang Lu,Meixin Zhu,Hai Yang
关键词-EN: Developing efficient traffic, current approaches remain, approaches remain time-intensive, human errors due, Developing efficient
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Developing efficient traffic models is essential for optimizing transportation systems, yet current approaches remain time-intensive and susceptible to human errors due to their reliance on manual processes. Traditional workflows involve exhaustive literature reviews, formula optimization, and iterative testing, leading to inefficiencies in research. In response, we introduce the Traffic Research Agent (TR-Agent), an AI-driven system designed to autonomously develop and refine traffic models through an iterative, closed-loop process. Specifically, we divide the research pipeline into four key stages: idea generation, theory formulation, theory evaluation, and iterative optimization; and construct TR-Agent with four corresponding modules: Idea Generator, Code Generator, Evaluator, and Analyzer. Working in synergy, these modules retrieve knowledge from external resources, generate novel ideas, implement and debug models, and finally assess them on the evaluation datasets. Furthermore, the system continuously refines these models based on iterative feedback, enhancing research efficiency and model performance. Experimental results demonstrate that TR-Agent achieves significant performance improvements across multiple traffic models, including the Intelligent Driver Model (IDM) for car following, the MOBIL lane-changing model, and the Lighthill-Whitham-Richards (LWR) traffic flow model. Additionally, TR-Agent provides detailed explanations for its optimizations, allowing researchers to verify and build upon its improvements easily. This flexibility makes the framework a powerful tool for researchers in transportation and beyond. To further support research and collaboration, we have open-sourced both the code and data used in our experiments, facilitating broader access and enabling continued advancements in the field.

[AI-43] Ethical and Scalable Automation: A Governance and Compliance Framework for Business Applications

链接: https://arxiv.org/abs/2409.16872
作者: Haocheng Lin
关键词-EN: poses significant challenges, significant challenges relating, businesses poses significant, legal compliance, popularisation of applying
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The popularisation of applying AI in businesses poses significant challenges relating to ethical principles, governance, and legal compliance. Although businesses have embedded AI into their day-to-day processes, they lack a unified approach for mitigating its potential risks. This paper introduces a framework ensuring that AI must be ethical, controllable, viable, and desirable. Balancing these factors ensures the design of a framework that addresses its trade-offs, such as balancing performance against explainability. A successful framework provides practical advice for businesses to meet regulatory requirements in sectors such as finance and healthcare, where it is critical to comply with standards like GPDR and the EU AI Act. Different case studies validate this framework by integrating AI in both academic and practical environments. For instance, large language models are cost-effective alternatives for generating synthetic opinions that emulate attitudes to environmental issues. These case studies demonstrate how having a structured framework could enhance transparency and maintain performance levels as shown from the alignment between synthetic and expected distributions. This alignment is quantified using metrics like Chi-test scores, normalized mutual information, and Jaccard indexes. Future research should explore the framework’s empirical validation in diverse industrial settings further, ensuring the model’s scalability and adaptability.

[AI-44] Multi-objective Evolution of Heuristic Using Large Language Model

链接: https://arxiv.org/abs/2409.16867
作者: Shunyu Yao,Fei Liu,Xi Lin,Zhichao Lu,Zhenkun Wang,Qingfu Zhang
关键词-EN: heuristic search, search, multi-objective heuristic search, heuristic, Heuristics
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Heuristics are commonly used to tackle diverse search and optimization problems. Design heuristics usually require tedious manual crafting with domain knowledge. Recent works have incorporated large language models (LLMs) into automatic heuristic search leveraging their powerful language and coding capacity. However, existing research focuses on the optimal performance on the target problem as the sole objective, neglecting other criteria such as efficiency and scalability, which are vital in practice. To tackle this challenge, we propose to model heuristic search as a multi-objective optimization problem and consider introducing other practical criteria beyond optimal performance. Due to the complexity of the search space, conventional multi-objective optimization methods struggle to effectively handle multi-objective heuristic search. We propose the first LLM-based multi-objective heuristic search framework, Multi-objective Evolution of Heuristic (MEoH), which integrates LLMs in a zero-shot manner to generate a non-dominated set of heuristics to meet multiple design criteria. We design a new dominance-dissimilarity mechanism for effective population management and selection, which incorporates both code dissimilarity in the search space and dominance in the objective space. MEoH is demonstrated in two well-known combinatorial optimization problems: the online Bin Packing Problem (BPP) and the Traveling Salesman Problem (TSP). Results indicate that a variety of elite heuristics are automatically generated in a single run, offering more trade-off options than existing methods. It successfully achieves competitive or superior performance while improving efficiency up to 10 times. Moreover, we also observe that the multi-objective search introduces novel insights into heuristic design and leads to the discovery of diverse heuristics.

[AI-45] he Role of Language Models in Modern Healthcare: A Comprehensive Review

链接: https://arxiv.org/abs/2409.16860
作者: Amna Khalid,Ayma Khalid,Umar Khalid
关键词-EN: gained significant attention, significant attention due, process complex medical, large language models, clinical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in healthcare has gained significant attention due to their ability to process complex medical data and provide insights for clinical decision-making. These models have demonstrated substantial capabilities in understanding and generating natural language, which is crucial for medical documentation, diagnostics, and patient interaction. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs, highlighting their strengths in healthcare applications and discussing challenges such as data privacy, bias, and ethical considerations. The potential of LLMs to enhance healthcare delivery is explored, alongside the necessary steps to ensure their ethical and effective integration into medical practice.

[AI-46] Dispute resolution in legal mediation with quantitative argumentation

链接: https://arxiv.org/abs/2409.16854
作者: Xiao Chi
关键词-EN: extension of negotiation, taking into account, account the unique, unique role, Quantitative Argumentation Mediate
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mediation is often treated as an extension of negotiation, without taking into account the unique role that norms and facts play in legal mediation. Additionally, current approaches for updating argument acceptability in response to changing variables frequently require the introduction of new arguments or the removal of existing ones, which can be inefficient and cumbersome in decision-making processes within legal disputes. In this paper, our contribution is two-fold. First, we introduce a QuAM (Quantitative Argumentation Mediate) framework, which integrates the parties’ knowledge and the mediator’s knowledge, including facts and legal norms, when determining the acceptability of a mediation goal. Second, we develop a new formalism to model the relationship between the acceptability of a goal argument and the values assigned to a variable associated with the argument. We use a real-world legal mediation as a running example to illustrate our approach.

[AI-47] Exposing Assumptions in AI Benchmarks through Cognitive Modelling

链接: https://arxiv.org/abs/2409.16849
作者: Jonathan H. Rystrøm,Kenneth C. Enevoldsen
关键词-EN: Structural Equation Models, leading to vague, unclear interrelations, rely on implicit, vague formulations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations. We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models. Using cross-lingual alignment transfer as an example, we show how this approach can answer key research questions and identify missing datasets. This framework grounds benchmark construction theoretically and guides dataset development to improve construct measurement. By embracing transparency, we move towards more rigorous, cumulative AI evaluation science, challenging researchers to critically examine their assessment foundations.

[AI-48] OffRIPP: Offline RL-based Informative Path Planning ICRA2025

链接: https://arxiv.org/abs/2409.16830
作者: Srikar Babu Gadipudi,Srujan Deolasee,Siva Kailas,Wenhao Luo,Katia Sycara,Woojun Kim
关键词-EN: Informative path planning, Informative path, gather valuable information, path planning, task in robotics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures, submitted to ICRA 2025

点击查看摘要

Abstract:Informative path planning (IPP) is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment while adhering to resource constraints. Reinforcement learning (RL) has been shown to be effective for IPP, however, it requires environment interactions, which are risky and expensive in practice. To address this problem, we propose an offline RL-based IPP framework that optimizes information gain without requiring real-time interaction during training, offering safety and cost-efficiency by avoiding interaction, as well as superior performance and fast computation during execution – key advantages of RL. Our framework leverages batch-constrained reinforcement learning to mitigate extrapolation errors, enabling the agent to learn from pre-collected datasets generated by arbitrary algorithms. We validate the framework through extensive simulations and real-world experiments. The numerical results show that our framework outperforms the baselines, demonstrating the effectiveness of the proposed approach.

[AI-49] On the role of Artificial Intelligence methods in modern force-controlled manufacturing robotic tasks

链接: https://arxiv.org/abs/2409.16828
作者: Vincenzo Petrone,Enrico Ferrentino,Pasquale Chiacchio
关键词-EN: Artificial Intelligence, position paper explores, cornerstone of Industry, Fourth Industrial Revolution, integration of Artificial
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: To be published in Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO)

点击查看摘要

Abstract:This position paper explores the integration of Artificial Intelligence (AI) into force-controlled robotic tasks within the scope of advanced manufacturing, a cornerstone of Industry 4.0. AI’s role in enhancing robotic manipulators - key drivers in the Fourth Industrial Revolution - is rapidly leading to significant innovations in smart manufacturing. The objective of this article is to frame these innovations in practical force-controlled applications - e.g. deburring, polishing, and assembly tasks like peg-in-hole (PiH) - highlighting their necessity for maintaining high-quality production standards. By reporting on recent AI-based methodologies, this article contrasts them and identifies current challenges to be addressed in future research. The analysis concludes with a perspective on future research directions, emphasizing the need for common performance metrics to validate AI techniques, integration of various enhancements for performance optimization, and the importance of validating them in relevant scenarios. These future directions aim to provide consistency with already adopted approaches, so as to be compatible with manufacturing standards, increasing the relevance of AI-driven methods in both academic and industrial contexts.

[AI-50] Learning phase-space flows using time-discrete implicit Runge-Kutta PINNs

链接: https://arxiv.org/abs/2409.16826
作者: Álvaro Fernández Corral,Nicolás Mendoza,Armin Iske,Andrey Yachmenev,Jochen Küpper
关键词-EN: Informed Neural Networks, implicit Runge-Kutta Physics, Informed Neural, Neural Networks, obtaining multidimensional phase-space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 10 pages, 4 figures, published in the International Conference on Scientific Computing and Machine Learning, see this http URL

点击查看摘要

Abstract:We present a computational framework for obtaining multidimensional phase-space solutions of systems of non-linear coupled differential equations, using high-order implicit Runge-Kutta Physics- Informed Neural Networks (IRK-PINNs) schemes. Building upon foundational work originally solving differential equations for fields depending on coordinates [J. Comput. Phys. 378, 686 (2019)], we adapt the scheme to a context where the coordinates are treated as functions. This modification enables us to efficiently solve equations of motion for a particle in an external field. Our scheme is particularly useful for explicitly time-independent and periodic fields. We apply this approach to successfully solve the equations of motion for a mass particle placed in a central force field and a charged particle in a periodic electric field.

[AI-51] Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability

链接: https://arxiv.org/abs/2409.16824
作者: Carlos E. Luis,Alessandro G. Bottero,Julia Vinogradska,Felix Berkenkamp,Jan Peters
关键词-EN: Kalman filter, environment hidden state, partial observability requires, Kalman filter layer, Optimal decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal decision-making under partial observability requires reasoning about the uncertainty of the environment’s hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state-space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed-form Gaussian inference in linear state-space models and train it end-to-end within a model-free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop-in replacement for other recurrent layers in standard model-free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision-making, outperforming other stateful models.

[AI-52] XAI-guided Insulator Anomaly Detection for Imbalanced Datasets ECCV2024

链接: https://arxiv.org/abs/2409.16821
作者: Maximilian Andreas Hoefler,Karsten Mueller,Wojciech Samek
关键词-EN: Power grids serve, seamlessly delivering electrical, reliable operation indispensable, delivering electrical energy, Power grids
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as a workshop paper at ECCV 2024

点击查看摘要

Abstract:Power grids serve as a vital component in numerous industries, seamlessly delivering electrical energy to industrial processes and technologies, making their safe and reliable operation indispensable. However, powerlines can be hard to inspect due to difficult terrain or harsh climatic conditions. Therefore, unmanned aerial vehicles are increasingly deployed to inspect powerlines, resulting in a substantial stream of visual data which requires swift and accurate processing. Deep learning methods have become widely popular for this task, proving to be a valuable asset in fault detection. In particular, the detection of insulator defects is crucial for predicting powerline failures, since their malfunction can lead to transmission disruptions. It is therefore of great interest to continuously maintain and rigorously inspect insulator components. In this work we propose a novel pipeline to tackle this task. We utilize state-of-the-art object detection to detect and subsequently classify individual insulator anomalies. Our approach addresses dataset challenges such as imbalance and motion-blurred images through a fine-tuning methodology which allows us to alter the classification focus of the model by increasing the classification accuracy of anomalous insulators. In addition, we employ explainable-AI tools for precise localization and explanation of anomalies. This proposed method contributes to the field of anomaly detection, particularly vision-based industrial inspection and predictive maintenance. We significantly improve defect detection accuracy by up to 13%, while also offering a detailed analysis of model mis-classifications and localization quality, showcasing the potential of our method on real-world data.

[AI-53] PeerArg: Argumentative Peer Review with LLMs

链接: https://arxiv.org/abs/2409.16813
作者: Purin Sukpanichnant,Anna Rapberger,Francesca Toni
关键词-EN: conferences or journals, essential process, process to determine, determine the quality, submitted to scientific
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Peer review is an essential process to determine the quality of papers submitted to scientific conferences or journals. However, it is subjective and prone to biases. Several studies have been conducted to apply techniques from NLP to support peer review, but they are based on black-box techniques and their outputs are difficult to interpret and trust. In this paper, we propose a novel pipeline to support and understand the reviewing and decision-making processes of peer review: the PeerArg system combining LLMs with methods from knowledge representation. PeerArg takes in input a set of reviews for a paper and outputs the paper acceptance prediction. We evaluate the performance of the PeerArg pipeline on three different datasets, in comparison with a novel end-2-end LLM that uses few-shot learning to predict paper acceptance given reviews. The results indicate that the end-2-end LLM is capable of predicting paper acceptance from reviews, but a variant of the PeerArg pipeline outperforms this LLM.

[AI-54] Large Language Model Predicts Above Normal All India Summer Monsoon Rainfall in 2024

链接: https://arxiv.org/abs/2409.16799
作者: Ujjawal Sharma,Madhav Biyani,Akhil Dev Suresh,Debi Prasad Bhuyan,Saroj Kanta Mishra,Tanmoy Chakraborty
关键词-EN: India Summer Monsoon, India Summer, Summer Monsoon Rainfall, Reliable prediction, Summer Monsoon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 3 figures

点击查看摘要

Abstract:Reliable prediction of the All India Summer Monsoon Rainfall (AISMR) is pivotal for informed policymaking for the country, impacting the lives of billions of people. However, accurate simulation of AISMR has been a persistent challenge due to the complex interplay of various muti-scale factors and the inherent variability of the monsoon system. This research focuses on adapting and fine-tuning the latest LLM model, PatchTST, to accurately predict AISMR with a lead time of three months. The fine-tuned PatchTST model, trained with historical AISMR data, the Niño3.4 index, and categorical Indian Ocean Dipole values, outperforms several popular neural network models and statistical models. This fine-tuned LLM model exhibits an exceptionally low RMSE percentage of 0.07% and a Spearman correlation of 0.976. This is particularly impressive, since it is nearly 80% more accurate than the best-performing NN models. The model predicts an above-normal monsoon for the year 2024, with an accumulated rainfall of 921.6 mm in the month of June-September for the entire country.

[AI-55] Scalable Ensemble Diversification for OOD Generalization and Detection

链接: https://arxiv.org/abs/2409.16797
作者: Alexander Rubinstein,Luca Scimeca,Damien Teney,Seong Joon Oh
关键词-EN: Bayesian principles, OOD, OOD samples, practical applications, providing candidates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Training a diverse ensemble of models has several practical applications such as providing candidates for model selection with better out-of-distribution (OOD) generalization, and enabling the detection of OOD samples via Bayesian principles. An existing approach to diverse ensemble training encourages the models to disagree on provided OOD samples. However, the approach is computationally expensive and it requires well-separated ID and OOD examples, such that it has only been demonstrated in small-scale settings. \textbfMethod. This work presents a method for Scalable Ensemble Diversification (SED) applicable to large-scale settings (e.g. ImageNet) that does not require OOD samples. Instead, SED identifies hard training samples on the fly and encourages the ensemble members to disagree on these. To improve scaling, we show how to avoid the expensive computations in existing methods of exhaustive pairwise disagreements across models. \textbfResults. We evaluate the benefits of diversification with experiments on ImageNet. First, for OOD generalization, we observe large benefits from the diversification in multiple settings including output-space (classical) ensembles and weight-space ensembles (model soups). Second, for OOD detection, we turn the diversity of ensemble hypotheses into a novel uncertainty score estimator that surpasses a large number of OOD detection baselines. Code is available here: this https URL. Comments: Under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.16797 [cs.LG] (or arXiv:2409.16797v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.16797 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Rubinstein [view email] [v1] Wed, 25 Sep 2024 10:30:24 UTC (3,449 KB)

[AI-56] Symbolic State Partition for Reinforcement Learning

链接: https://arxiv.org/abs/2409.16791
作者: Mohsen Ghaffari,Mahsa Varshosaz,Einar Broch Johnsen,Andrzej Wąsowski
关键词-EN: Tabular reinforcement learning, Tabular reinforcement, state space, methods cannot operate, operate directly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular reinforcement learning methods cannot operate directly on continuous state spaces. One solution for this problem is to partition the state space. A good partitioning enables generalization during learning and more efficient exploitation of prior experiences. Consequently, the learning process becomes faster and produces more reliable policies. However, partitioning introduces approximation, which is particularly harmful in the presence of nonlinear relations between state components. An ideal partition should be as coarse as possible, while capturing the key structure of the state space for the given problem. This work extracts partitions from the environment dynamics by symbolic execution. We show that symbolic partitioning improves state space coverage with respect to environmental behavior and allows reinforcement learning to perform better for sparse rewards. We evaluate symbolic state space partitioning with respect to precision, scalability, learning agent performance and state space coverage for the learnt policies.

[AI-57] Enhancing Feature Selection and Interpretability in AI Regression Tasks Through Feature Attribution

链接: https://arxiv.org/abs/2409.16787
作者: Alexander Hinterleitner,Thomas Bartz-Beielstein,Richard Schulz,Sebastian Spengler,Thomas Winter,Christoph Leitenmeier
关键词-EN: Explainable Artificial Intelligence, Research in Explainable, Artificial Intelligence, Explainable Artificial, make deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research in Explainable Artificial Intelligence (XAI) is increasing, aiming to make deep learning models more transparent. Most XAI methods focus on justifying the decisions made by Artificial Intelligence (AI) systems in security-relevant applications. However, relatively little attention has been given to using these methods to improve the performance and robustness of deep learning algorithms. Additionally, much of the existing XAI work primarily addresses classification problems. In this study, we investigate the potential of feature attribution methods to filter out uninformative features in input data for regression problems, thereby improving the accuracy and stability of predictions. We introduce a feature selection pipeline that combines Integrated Gradients with k-means clustering to select an optimal set of variables from the initial data space. To validate the effectiveness of this approach, we apply it to a real-world industrial problem - blade vibration analysis in the development process of turbo machinery.

[AI-58] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction EMNLP2024

链接: https://arxiv.org/abs/2409.16783
作者: Jinchuan Zhang,Yan Zhou,Yaxin Liu,Ziming Li,Songlin Hu
关键词-EN: identifying misaligned behaviors, Automated red teaming, Holistic Automated Red, large language models, Automated red
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: EMNLP 2024 camera ready version

点击查看摘要

Abstract:Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine interactions. To overcome these limitations, we propose HARM (Holistic Automated Red teaMing), which scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance for the alignment process.

[AI-59] LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

链接: https://arxiv.org/abs/2409.16779
作者: Marc-Antoine Allard,Matin Ansaripour,Maria Yuffa,Paul Teiletche
关键词-EN: Large Language Models, Large Language, tasks requiring mathematical, Language Models, multiple-choice questions
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs). To address this issue, we developed LLaMa-SciQ, an educational chatbot designed to assist college students in solving and understanding MCQs in STEM fields. We begin by fine-tuning and aligning the models to human preferences. After comparing the performance of Mistral-7B and LLaMa-8B, we selected the latter as the base model due to its higher evaluation accuracy. To further enhance accuracy, we implement Retrieval-Augmented Generation (RAG) and apply quantization to compress the model, reducing inference time and increasing accessibility for students. For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset. However, RAG does not improve performance and even reduces it, likely due to retriever issues or the model’s unfamiliarity with context. Despite this, the quantized model shows only a 5% loss in performance, demonstrating significant efficiency improvements.

[AI-60] Super Level Sets and Exponential Decay: A Synergistic Approach to Stable Neural Network Training

链接: https://arxiv.org/abs/2409.16769
作者: Jatin Chaudhary,Dipak Nidhi,Jukka Heikkonen,Haari Merisaari,Rajiv Kanth
关键词-EN: advanced anti-overfitting strategies, integrates exponential decay, effectively integrates exponential, dynamic learning rate, anti-overfitting strategies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The objective of this paper is to enhance the optimization process for neural networks by developing a dynamic learning rate algorithm that effectively integrates exponential decay and advanced anti-overfitting strategies. Our primary contribution is the establishment of a theoretical framework where we demonstrate that the optimization landscape, under the influence of our algorithm, exhibits unique stability characteristics defined by Lyapunov stability principles. Specifically, we prove that the superlevel sets of the loss function, as influenced by our adaptive learning rate, are always connected, ensuring consistent training dynamics. Furthermore, we establish the “equiconnectedness” property of these superlevel sets, which maintains uniform stability across varying training conditions and epochs. This paper contributes to the theoretical understanding of dynamic learning rate mechanisms in neural networks and also pave the way for the development of more efficient and reliable neural optimization techniques. This study intends to formalize and validate the equiconnectedness of loss function as superlevel sets in the context of neural network training, opening newer avenues for future research in adaptive machine learning algorithms. We leverage previous theoretical discoveries to propose training mechanisms that can effectively handle complex and high-dimensional data landscapes, particularly in applications requiring high precision and reliability.

[AI-61] MaViLS a Benchmark Dataset for Video-to-Slide Alignment Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech OCR and Visual Features

链接: https://arxiv.org/abs/2409.16765
作者: Katharina Anderer,Andreas Reich,Matthias Wölfel
关键词-EN: paper presents, presents a benchmark, benchmark dataset, dataset for aligning, multimodal algorithm leveraging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.

[AI-62] Offline and Distributional Reinforcement Learning for Radio Resource Management

链接: https://arxiv.org/abs/2409.16764
作者: Eslam Eldeeb,Hirley Alves
关键词-EN: intelligent wireless networks, future intelligent wireless, Reinforcement learning, wireless networks, future intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has proved to have a promising role in future intelligent wireless networks. Online RL has been adopted for radio resource management (RRM), taking over traditional schemes. However, due to its reliance on online interaction with the environment, its role becomes limited in practical, real-world problems where online interaction is not feasible. In addition, traditional RL stands short in front of the uncertainties and risks in real-world stochastic environments. In this manner, we propose an offline and distributional RL scheme for the RRM problem, enabling offline training using a static dataset without any interaction with the environment and considering the sources of uncertainties using the distributions of the return. Simulation results demonstrate that the proposed scheme outperforms conventional resource management models. In addition, it is the only scheme that surpasses online RL and achieves a 16 % gain over online RL.

[AI-63] GB-RVFL: Fusion of Randomized Neural Network and Granular Ball Computing

链接: https://arxiv.org/abs/2409.16735
作者: M. Sajid,A. Quadir,M. Tanveer
关键词-EN: vector functional link, strong generalization ability, random vector functional, prominent classification model, functional link
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The random vector functional link (RVFL) network is a prominent classification model with strong generalization ability. However, RVFL treats all samples uniformly, ignoring whether they are pure or noisy, and its scalability is limited due to the need for inverting the entire training matrix. To address these issues, we propose granular ball RVFL (GB-RVFL) model, which uses granular balls (GBs) as inputs instead of training samples. This approach enhances scalability by requiring only the inverse of the GB center matrix and improves robustness against noise and outliers through the coarse granularity of GBs. Furthermore, RVFL overlooks the dataset’s geometric structure. To address this, we propose graph embedding GB-RVFL (GE-GB-RVFL) model, which fuses granular computing and graph embedding (GE) to preserve the topological structure of GBs. The proposed GB-RVFL and GE-GB-RVFL models are evaluated on KEEL, UCI, NDC and biomedical datasets, demonstrating superior performance compared to baseline models.

[AI-64] Non-stationary BERT: Exploring Augmented IMU Data For Robust Human Activity Recognition

链接: https://arxiv.org/abs/2409.16730
作者: Ning Sun,Yufei Wang,Yuwei Zhang,Jixiang Wan,Shenyue Wang,Ping Liu,Xudong Zhang
关键词-EN: Human Activity Recognition, gained great attention, observe users’ daily, users’ daily activity, Activity Recognition
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users’ daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability.

[AI-65] A Multi-Dataset Classification-Based Deep Learning Framework for Electronic Health Records and Predictive Analysis in Healthcare

链接: https://arxiv.org/abs/2409.16721
作者: Syed Mohd Faisal Malik,Md Tabrez Nafis,Mohd Abdul Ahad,Safdar Tanweer
关键词-EN: creating vast opportunities, deep learning techniques, deep learning predictive, deep learning, leverage deep learning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In contemporary healthcare, to protect patient data, electronic health records have become invaluable repositories, creating vast opportunities to leverage deep learning techniques for predictive analysis. Retinal fundus images, cirrhosis stages, and heart disease diagnostic predictions have shown promising results through the integration of deep learning techniques for classifying diverse datasets. This study proposes a novel deep learning predictive analysis framework for classifying multiple datasets by pre-processing data from three distinct sources. A hybrid deep learning model combining Residual Networks and Artificial Neural Networks is proposed to detect acute and chronic diseases such as heart diseases, cirrhosis, and retinal conditions, outperforming existing models. Dataset preparation involves aspects such as categorical data transformation, dimensionality reduction, and missing data synthesis. Feature extraction is effectively performed using scaler transformation for categorical datasets and ResNet architecture for image datasets. The resulting features are integrated into a unified classification model. Rigorous experimentation and evaluation resulted in high accuracies of 93%, 99%, and 95% for retinal fundus images, cirrhosis stages, and heart disease diagnostic predictions, respectively. The efficacy of the proposed method is demonstrated through a detailed analysis of F1-score, precision, and recall metrics. This study offers a comprehensive exploration of methodologies and experiments, providing in-depth knowledge of deep learning predictive analysis in electronic health records.

[AI-66] Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP2024

链接: https://arxiv.org/abs/2409.16718
作者: Ming Li,Jike Zhong,Chenxin Li,Liuzhuozheng Li,Nie Lin,Masashi Sugiyama
关键词-EN: Recent advances, classic model fine-tuning, fine-tuning Vision-Language Models, prompt tuning, adapter tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \urlthis https URL.

[AI-67] Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

链接: https://arxiv.org/abs/2409.16706
作者: Youngwan Jin,Incheol Park,Hanbin Song,Hyeongjin Ju,Yagiz Nalcakan,Shiho Kim
关键词-EN: generating high-quality Near-Infrared, RGB inputs, Vision Foundation Model, paper proposes, high-quality Near-Infrared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages,12 figures

点击查看摘要

Abstract:This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next’s advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.

[AI-68] A Survey of Low-bit Large Language Models : Basics Systems and Algorithms

链接: https://arxiv.org/abs/2409.16694
作者: Ruihao Gong,Yifu Ding,Zining Wang,Chengtao Lv,Xingyu Zheng,Jinyang Du,Haotong Qin,Jinyang Guo,Michele Magno,Xianglong Liu
关键词-EN: Large language models, natural language processing, showcasing exceptional performance, Large language, achieved remarkable advancements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Ruihao Gong leads the overall organization of the survey, with Yifu Ding and Jinyang Du contributing to Sections 2 and 3. Xingyu Zheng is responsible for authoring Section 4, while Chengtao Lv and Zining Wang collaborate on Section 5. Haotong Qin, Jinyang Guo, Michele Magno, and Xianglong Liu provide guidance during the whole process and assist in refining the final manuscript

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

[AI-69] CaBRNet an open-source library for developing and evaluating Case-Based Reasoning Models

链接: https://arxiv.org/abs/2409.16693
作者: Romain Xu-Darme(LSL),Aymeric Varasse(LSL),Alban Grastien(LSL),Julien Girard(LSL),Zakaria Chihani(LSL)
关键词-EN: model opaquely makes, field of explainable, vibrant effort, effort is dedicated, design of self-explainable
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of explainable AI, a vibrant effort is dedicated to the design of self-explainable models, as a more principled alternative to post-hoc methods that attempt to explain the decisions after a model opaquely makes them. However, this productive line of research suffers from common downsides: lack of reproducibility, unfeasible comparison, diverging standards. In this paper, we propose CaBRNet, an open-source, modular, backward-compatible framework for Case-Based Reasoning Networks: this https URL.

[AI-70] Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2409.16689
作者: Shoma Iwai,Atsuki Osanai,Shunsuke Kitada,Shinichiro Omachi
关键词-EN: task to synthesize, synthesize a harmonious, Layout, harmonious layout, characterized by attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by ECCV2024, Project Page: this https URL

点击查看摘要

Abstract:Layout generation is a task to synthesize a harmonious layout with elements characterized by attributes such as category, position, and size. Human designers experiment with the placement and modification of elements to create aesthetic layouts, however, we observed that current discrete diffusion models (DDMs) struggle to correct inharmonious layouts after they have been generated. In this paper, we first provide novel insights into layout sticking phenomenon in DDMs and then propose a simple yet effective layout-assessment module Layout-Corrector, which works in conjunction with existing DDMs to address the layout sticking problem. We present a learning-based module capable of identifying inharmonious elements within layouts, considering overall layout harmony characterized by complex composition. During the generation process, Layout-Corrector evaluates the correctness of each token in the generated layout, reinitializing those with low scores to the ungenerated state. The DDM then uses the high-scored tokens as clues to regenerate the harmonized tokens. Layout-Corrector, tested on common benchmarks, consistently boosts layout-generation performance when in conjunction with various state-of-the-art DDMs. Furthermore, our extensive analysis demonstrates that the Layout-Corrector (1) successfully identifies erroneous tokens, (2) facilitates control over the fidelity-diversity trade-off, and (3) significantly mitigates the performance drop associated with fast sampling.

[AI-71] MSI-Agent : Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making

链接: https://arxiv.org/abs/2409.16686
作者: Dayuan Fu,Biqing Qi,Yihuai Gao,Che Jiang,Guanting Dong,Bowen Zhou
关键词-EN: Long-term memory, insight, crucial role, memory is significant, play a crucial
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Long-term memory is significant for agents, in which insights play a crucial role. However, the emergence of irrelevant insight and the lack of general insight can greatly undermine the effectiveness of insight. To solve this problem, in this paper, we introduce Multi-Scale Insight Agent (MSI-Agent), an embodied agent designed to improve LLMs’ planning and decision-making ability by summarizing and utilizing insight effectively across different scales. MSI achieves this through the experience selector, insight generator, and insight selector. Leveraging a three-part pipeline, MSI can generate task-specific and high-level insight, store it in a database, and then use relevant insight from it to aid in decision-making. Our experiments show that MSI outperforms another insight strategy when planning by GPT3.5. Moreover, We delve into the strategies for selecting seed experience and insight, aiming to provide LLM with more useful and relevant insight for better decision-making. Our observations also indicate that MSI exhibits better robustness when facing domain-shifting scenarios.

[AI-72] Erase then Rectify: A Training-Free Parameter Editing Approach for Cost-Effective Graph Unlearning

链接: https://arxiv.org/abs/2409.16684
作者: Zhe-Rui Yang,Jindong Han,Chang-Dong Wang,Hao Liu
关键词-EN: Graph Neural Network, trained Graph Neural, Neural Network, Graph Neural, Graph unlearning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph unlearning, which aims to eliminate the influence of specific nodes, edges, or attributes from a trained Graph Neural Network (GNN), is essential in applications where privacy, bias, or data obsolescence is a concern. However, existing graph unlearning techniques often necessitate additional training on the remaining data, leading to significant computational costs, particularly with large-scale graphs. To address these challenges, we propose a two-stage training-free approach, Erase then Rectify (ETR), designed for efficient and scalable graph unlearning while preserving the model utility. Specifically, we first build a theoretical foundation showing that masking parameters critical for unlearned samples enables effective unlearning. Building on this insight, the Erase stage strategically edits model parameters to eliminate the impact of unlearned samples and their propagated influence on intercorrelated nodes. To further ensure the GNN’s utility, the Rectify stage devises a gradient approximation method to estimate the model’s gradient on the remaining dataset, which is then used to enhance model performance. Overall, ETR achieves graph unlearning without additional training or full training data access, significantly reducing computational overhead and preserving data privacy. Extensive experiments on seven public datasets demonstrate the consistent superiority of ETR in model utility, unlearning efficiency, and unlearning effectiveness, establishing it as a promising solution for real-world graph unlearning challenges.

[AI-73] GraphLoRA: Structure-Aware Contrastive Low-Rank Adaptation for Cross-Graph Transfer Learning

链接: https://arxiv.org/abs/2409.16670
作者: Zhe-Rui Yang,Jindong Han,Chang-Dong Wang,Hao Liu
关键词-EN: Graph Neural Networks, Neural Networks, demonstrated remarkable proficiency, social networks, graph analytical tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in handling a range of graph analytical tasks across various domains, such as e-commerce and social networks. Despite their versatility, GNNs face significant challenges in transferability, limiting their utility in real-world applications. Existing research in GNN transfer learning overlooks discrepancies in distribution among various graph datasets, facing challenges when transferring across different distributions. How to effectively adopt a well-trained GNN to new graphs with varying feature and structural distributions remains an under-explored problem. Taking inspiration from the success of Low-Rank Adaptation (LoRA) in adapting large language models to various domains, we propose GraphLoRA, an effective and parameter-efficient method for transferring well-trained GNNs to diverse graph domains. Specifically, we first propose a Structure-aware Maximum Mean Discrepancy (SMMD) to align divergent node feature distributions across source and target graphs. Moreover, we introduce low-rank adaptation by injecting a small trainable GNN alongside the pre-trained one, effectively bridging structural distribution gaps while mitigating the catastrophic forgetting. Additionally, a structure-aware regularization objective is proposed to enhance the adaptability of the pre-trained GNN to target graph with scarce supervision labels. Extensive experiments on six real-world datasets demonstrate the effectiveness of GraphLoRA against eleven baselines by tuning only 20% of parameters, even across disparate graph domains. The code is available at https://anonymous.4open.science/r/GraphLoRA.

[AI-74] Progressive Representation Learning for Real-Time UAV Tracking IROS2024

链接: https://arxiv.org/abs/2409.16652
作者: Changhong Fu,Xiang Lei,Haobo Zuo,Liangliang Yao,Guangze Zheng,Jia Pan
关键词-EN: unmanned aerial vehicles, significantly promoted autonomous, promoted autonomous applications, Visual object tracking, Visual object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Visual object tracking has significantly promoted autonomous applications for unmanned aerial vehicles (UAVs). However, learning robust object representations for UAV tracking is especially challenging in complex dynamic environments, when confronted with aspect ratio change and occlusion. These challenges severely alter the original information of the object. To handle the above issues, this work proposes a novel progressive representation learning framework for UAV tracking, i.e., PRL-Track. Specifically, PRL-Track is divided into coarse representation learning and fine representation learning. For coarse representation learning, two innovative regulators, which rely on appearance and semantic information, are designed to mitigate appearance interference and capture semantic information. Furthermore, for fine representation learning, a new hierarchical modeling generator is developed to intertwine coarse object representations. Exhaustive experiments demonstrate that the proposed PRL-Track delivers exceptional performance on three authoritative UAV tracking benchmarks. Real-world tests indicate that the proposed PRL-Track realizes superior tracking performance with 42.6 frames per second on the typical UAV platform equipped with an edge smart camera. The code, model, and demo videos are available at \urlthis https URL.

[AI-75] ask Addition in Multi-Task Learning by Geometrical Alignment

链接: https://arxiv.org/abs/2409.16645
作者: Soorin Yim,Dae-Woong Jeong,Sung Moon Ko,Sumin Lee,Hyunseung Kim,Chanhui Lee,Sehui Han
关键词-EN: molecular property prediction, deep learning models, Training deep learning, Aligned Transfer Encoder, Geometrically Aligned Transfer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 5 figures, Accepted at AI for Science Workshop at 41st International Conference on Machine Learning

点击查看摘要

Abstract:Training deep learning models on limited data while maintaining generalization is one of the fundamental challenges in molecular property prediction. One effective solution is transferring knowledge extracted from abundant datasets to those with scarce data. Recently, a novel algorithm called Geometrically Aligned Transfer Encoder (GATE) has been introduced, which uses soft parameter sharing by aligning the geometrical shapes of task-specific latent spaces. However, GATE faces limitations in scaling to multiple tasks due to computational costs. In this study, we propose a task addition approach for GATE to improve performance on target tasks with limited data while minimizing computational complexity. It is achieved through supervised multi-task pre-training on a large dataset, followed by the addition and training of task-specific modules for each target task. Our experiments demonstrate the superior performance of the task addition strategy for GATE over conventional multi-task methods, with comparable computational costs.

[AI-76] raining Language Models to Win Debates with Self-Play Improves Judge Accuracy

链接: https://arxiv.org/abs/2409.16636
作者: Samuel Arnesen,David Rein,Julian Michael
关键词-EN: generated via self-play, test the robustness, method of scalable, scalable oversight, data generated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 48 pages, 12 figures; code at this https URL

点击查看摘要

Abstract:We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

[AI-77] Judgment of Thoughts: Courtroom of the Binary Logical Reasoning in Large Language Models

链接: https://arxiv.org/abs/2409.16635
作者: Sungjune Park,Daeseon Choi
关键词-EN: unicode, engineering technique called, prompt engineering technique, technique called Judgment, paper proposes
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel prompt engineering technique called Judgment of Thought (JoT) that is specifically tailored for binary logical reasoning tasks. JoT employs three roles \unicodex2014 lawyer, prosecutor, and judge \unicodex2014 to facilitate more reliable and accurate reasoning by the model. In this framework, the judge utilizes a high \unicodex2010 level model, while the lawyer and prosecutor utilize low \unicodex2010 level models. This structure helps the judge better understand the responses from both the lawyer and prosecutor, enabling a more accurate judgment. Experimental results on large language model (LLM) benchmark datasets, such as BigBenchHard and Winogrande, demonstrate that JoT outperforms existing methods, including Chain of Thought (CoT) and Self \unicodex2010 Consistency (SC), in binary logical reasoning tasks. Additionally, in real \unicodex2010 world tasks, such as Fake News Detection and SMS Spam Detection, JoT shows comparable or improved performance compared to existing techniques. JoT significantly enhances the accuracy and reliability of models in binary reasoning tasks and show potential for practical applicability across various domains. Future research should aim to further broaden the applicability of JoT and optimize its implementation for real \unicodex2010 world problem \unicodex2010 solving.

[AI-78] Stochastic Subsampling With Average Pooling

链接: https://arxiv.org/abs/2409.16630
作者: Bum Jun Kim,Sang Woo Kim
关键词-EN: deep neural networks, deep neural, higher generalization performance, neural networks, achieve higher generalization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Regularization of deep neural networks has been an important issue to achieve higher generalization performance without overfitting problems. Although the popular method of Dropout provides a regularization effect, it causes inconsistent properties in the output, which may degrade the performance of deep neural networks. In this study, we propose a new module called stochastic average pooling, which incorporates Dropout-like stochasticity in pooling. We describe the properties of stochastic subsampling and average pooling and leverage them to design a module without any inconsistency problem. The stochastic average pooling achieves a regularization effect without any potential performance degradation due to the inconsistency issue and can easily be plugged into existing architectures of deep neural networks. Experiments demonstrate that replacing existing average pooling with stochastic average pooling yields consistent improvements across a variety of tasks, datasets, and models.

[AI-79] Ascend HiFloat8 Format for Deep Learning

链接: https://arxiv.org/abs/2409.16626
作者: Yuanyong Luo,Zhongxing Zhang,Richard Wu,Hu Liu,Ying Jin,Kai Zheng,Minmin Wang,Zhanying He,Guipeng Hu,Luyao Chen,Tianchi Hu,Junsong Wang,Minqi Chen,Mikhaylov Dmitry,Korviakov Vladimir,Bobrin Maxim,Yuhao Hu,Guanfu Chen,Zeyi Huang
关键词-EN: floating-point data format, preliminary white paper, white paper proposes, floating-point data, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 13 Pages, 4 Figures, 9 Tables

点击查看摘要

Abstract:This preliminary white paper proposes a novel 8-bit floating-point data format HiFloat8 (abbreviated as HiF8) for deep learning. HiF8 features tapered precision. For normal value encoding, it provides 7 exponents with 3-bit mantissa, 8 exponents with 2-bit mantissa, and 16 exponents with 1-bit mantissa. For denormal or subnormal value encoding, it extends the dynamic range by 7 extra powers of 2, from 31 to 38 binades (notice that FP16 covers 40 binades). Meanwhile, HiF8 encodes all the special values except that positive zero and negative zero are represented by only one bit-pattern. Thanks to the better balance between precision and dynamic range, HiF8 can be simultaneously used in both forward and backward passes of AI training. In this paper, we will describe the definition and rounding methods of HiF8, as well as the tentative training and inference solutions. To demonstrate the efficacy of HiF8 format, massive simulation results on various neural networks, including traditional neural networks and large language models (LLMs), will also be presented.

[AI-80] On Your Mark Get Set Predict! Modeling Continuous-Time Dynamics of Cascades for Information Popularity Prediction

链接: https://arxiv.org/abs/2409.16623
作者: Xin Jing,Yichen Jing,Yuhuan Lu,Bangchao Deng,Sikun Yang,Dingqi Yang
关键词-EN: including viral marketing, Information popularity prediction, Information popularity, including viral, important yet challenging
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Information popularity prediction is important yet challenging in various domains, including viral marketing and news recommendations. The key to accurately predicting information popularity lies in subtly modeling the underlying temporal information diffusion process behind observed events of an information cascade, such as the retweets of a tweet. To this end, most existing methods either adopt recurrent networks to capture the temporal dynamics from the first to the last observed event or develop a statistical model based on self-exciting point processes to make predictions. However, information diffusion is intrinsically a complex continuous-time process with irregularly observed discrete events, which is oversimplified using recurrent networks as they fail to capture the irregular time intervals between events, or using self-exciting point processes as they lack flexibility to capture the complex diffusion process. Against this background, we propose ConCat, modeling the Continuous-time dynamics of Cascades for information popularity prediction. On the one hand, it leverages neural Ordinary Differential Equations (ODEs) to model irregular events of a cascade in continuous time based on the cascade graph and sequential event information. On the other hand, it considers cascade events as neural temporal point processes (TPPs) parameterized by a conditional intensity function which can also benefit the popularity prediction task. We conduct extensive experiments to evaluate ConCat on three real-world datasets. Results show that ConCat achieves superior performance compared to state-of-the-art baselines, yielding a 2.3%-33.2% improvement over the best-performing baselines across the three datasets.

[AI-81] Entailment-Driven Privacy Policy Classification with LLMs

链接: https://arxiv.org/abs/2409.16621
作者: Bhanuka Silva,Dishanika Denipitiyage,Suranga Seneviratne,Anirban Mahanti,Aruna Seneviratne
关键词-EN: online services provide, lengthy and complicated, online services, understand what personal, privacy policies
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:While many online services provide privacy policies for end users to read and understand what personal data are being collected, these documents are often lengthy and complicated. As a result, the vast majority of users do not read them at all, leading to data collection under uninformed consent. Several attempts have been made to make privacy policies more user friendly by summarising them, providing automatic annotations or labels for key sections, or by offering chat interfaces to ask specific questions. With recent advances in Large Language Models (LLMs), there is an opportunity to develop more effective tools to parse privacy policies and help users make informed decisions. In this paper, we propose an entailment-driven LLM based framework to classify paragraphs of privacy policies into meaningful labels that are easily understood by users. The results demonstrate that our framework outperforms traditional LLM methods, improving the F1 score in average by 11.2%. Additionally, our framework provides inherently explainable and meaningful predictions.

[AI-82] Optimized Monte Carlo Tree Search for Enhanced Decision Making in the FrozenLake Environment

链接: https://arxiv.org/abs/2409.16620
作者: Esteban Aldana Guerra
关键词-EN: Monte Carlo Tree, Carlo Tree Search, Monte Carlo, Tree Search, complex decision-making problems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Monte Carlo Tree Search (MCTS) is a powerful algorithm for solving complex decision-making problems. This paper presents an optimized MCTS implementation applied to the FrozenLake environment, a classic reinforcement learning task characterized by stochastic transitions. The optimization leverages cumulative reward and visit count tables along with the Upper Confidence Bound for Trees (UCT) formula, resulting in efficient learning in a slippery grid world. We benchmark our implementation against other decision-making algorithms, including MCTS with Policy and Q-Learning, and perform a detailed comparison of their performance. The results demonstrate that our optimized approach effectively maximizes rewards and success rates while minimizing convergence time, outperforming baseline methods, especially in environments with inherent randomness.

[AI-83] CasFT: Future Trend Modeling for Information Popularity Prediction with Dynamic Cues-Driven Diffusion Models

链接: https://arxiv.org/abs/2409.16619
作者: Xin Jing,Yichen Jing,Yuhuan Lu,Bangchao Deng,Xueqin Chen,Dingqi Yang
关键词-EN: online social platforms, predicting content popularity, range of applications, strategic decision-making, rapid spread
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid spread of diverse information on online social platforms has prompted both academia and industry to realize the importance of predicting content popularity, which could benefit a wide range of applications, such as recommendation systems and strategic decision-making. Recent works mainly focused on extracting spatiotemporal patterns inherent in the information diffusion process within a given observation period so as to predict its popularity over a future period of time. However, these works often overlook the future popularity trend, as future popularity could either increase exponentially or stagnate, introducing uncertainties to the prediction performance. Additionally, how to transfer the preceding-term dynamics learned from the observed diffusion process into future-term trends remains an unexplored challenge. Against this background, we propose CasFT, which leverages observed information Cascades and dynamic cues extracted via neural ODEs as conditions to guide the generation of Future popularity-increasing Trends through a diffusion model. These generated trends are then combined with the spatiotemporal patterns in the observed information cascade to make the final popularity prediction. Extensive experiments conducted on three real-world datasets demonstrate that CasFT significantly improves the prediction accuracy, compared to state-of-the-art approaches, yielding 2.2%-19.3% improvement across different datasets.

[AI-84] Claim-Guided Textual Backdoor Attack for Practical Applications

链接: https://arxiv.org/abs/2409.16618
作者: Minkyoo Song,Hanna Kim,Jaehan Kim,Youngjin Jin,Seungwon Shin
关键词-EN: natural language processing, Recent advances, large language models, security vulnerabilities, backdoor attacks
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Under Review

点击查看摘要

Abstract:Recent advances in natural language processing and the increased use of large language models have exposed new security vulnerabilities, such as backdoor attacks. Previous backdoor attacks require input manipulation after model distribution to activate the backdoor, posing limitations in real-world applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor Attack (CGBA), which eliminates the need for such manipulations by utilizing inherent textual claims as triggers. CGBA leverages claim extraction, clustering, and targeted training to trick models to misbehave on targeted claims without affecting their performance on clean data. CGBA demonstrates its effectiveness and stealthiness across various datasets and models, significantly enhancing the feasibility of practical backdoor attacks. Our code and data will be available at this https URL.

[AI-85] Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications

链接: https://arxiv.org/abs/2409.16605
作者: Ethan Lin,Zhiyuan Peng,Yi Fang
关键词-EN: large language models, evaluated the creativity, semantic perspective, cognitive science, studies have evaluated
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs’ ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.

[AI-86] MambaJSCC: Adaptive Deep Joint Source-Channel Coding with Generalized State Space Model

链接: https://arxiv.org/abs/2409.16592
作者: Tong Wu,Zhiyong Chen,Meixia Tao,Yaping Sun,Xiaodong Xu,Wenjun Zhang,Ping Zhang
关键词-EN: joint source-channel coding, efficient neural network, deep joint source-channel, Lightweight and efficient, neural network models
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submitted to IEEE Journal

点击查看摘要

Abstract:Lightweight and efficient neural network models for deep joint source-channel coding (JSCC) are crucial for semantic communications. In this paper, we propose a novel JSCC architecture, named MambaJSCC, that achieves state-of-the-art performance with low computational and parameter overhead. MambaJSCC utilizes the visual state space model with channel adaptation (VSSM-CA) blocks as its backbone for transmitting images over wireless channels, where the VSSM-CA primarily consists of the generalized state space models (GSSM) and the zero-parameter, zero-computational channel adaptation method (CSI-ReST). We design the GSSM module, leveraging reversible matrix transformations to express generalized scan expanding operations, and theoretically prove that two GSSM modules can effectively capture global information. We discover that GSSM inherently possesses the ability to adapt to channels, a form of endogenous intelligence. Based on this, we design the CSI-ReST method, which injects channel state information (CSI) into the initial state of GSSM to utilize its native response, and into the residual state to mitigate CSI forgetting, enabling effective channel adaptation without introducing additional computational and parameter overhead. Experimental results show that MambaJSCC not only outperforms existing JSCC methods (e.g., SwinJSCC) across various scenarios but also significantly reduces parameter size, computational overhead, and inference delay.

[AI-87] AutoSTF: Decoupled Neural Architecture Search for Cost-Effective Automated Spatio-Temporal Forecasting

链接: https://arxiv.org/abs/2409.16586
作者: Tengfei Lyu,Weijia Zhang,Jinliang Deng,Hao Liu
关键词-EN: smart city applications, automated spatio-temporal forecasting, Spatio-temporal forecasting, energy management, spatio-temporal forecasting methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Spatio-temporal forecasting is a critical component of various smart city applications, such as transportation optimization, energy management, and socio-economic analysis. Recently, several automated spatio-temporal forecasting methods have been proposed to automatically search the optimal neural network architecture for capturing complex spatio-temporal dependencies. However, the existing automated approaches suffer from expensive neural architecture search overhead, which hinders their practical use and the further exploration of diverse spatio-temporal operators in a finer granularity. In this paper, we propose AutoSTF, a decoupled automatic neural architecture search framework for cost-effective automated spatio-temporal forecasting. From the efficiency perspective, we first decouple the mixed search space into temporal space and spatial space and respectively devise representation compression and parameter-sharing schemes to mitigate the parameter explosion. The decoupled spatio-temporal search not only expedites the model optimization process but also leaves new room for more effective spatio-temporal dependency modeling. From the effectiveness perspective, we propose a multi-patch transfer module to jointly capture multi-granularity temporal dependencies and extend the spatial search space to enable finer-grained layer-wise spatial dependency search. Extensive experiments on eight datasets demonstrate the superiority of AutoSTF in terms of both accuracy and efficiency. Specifically, our proposed method achieves up to 13.48x speed-up compared to state-of-the-art automatic spatio-temporal forecasting methods while maintaining the best forecasting accuracy.

[AI-88] Reactive Multi-Robot Navigation in Outdoor Environments Through Uncertainty-Aware Active Learning of Human Preference Landscape

链接: https://arxiv.org/abs/2409.16577
作者: Chao Huang,Wenshuo Zang,Carlo Pinciroli,Zhi Jane Li,Taposh Banerjee,Lili Su,Rui Liu
关键词-EN: Multi-Robot Systems, Compared with single, MRS, diverse capabilities, perform missions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compared with single robots, Multi-Robot Systems (MRS) can perform missions more efficiently due to the presence of multiple members with diverse capabilities. However, deploying an MRS in wide real-world environments is still challenging due to uncertain and various obstacles (e.g., building clusters and trees). With a limited understanding of environmental uncertainty on performance, an MRS cannot flexibly adjust its behaviors (e.g., teaming, load sharing, trajectory planning) to ensure both environment adaptation and task accomplishments. In this work, a novel joint preference landscape learning and behavior adjusting framework (PLBA) is designed. PLBA efficiently integrates real-time human guidance to MRS coordination and utilizes Sparse Variational Gaussian Processes with Varying Output Noise to quickly assess human preferences by leveraging spatial correlations between environment characteristics. An optimization-based behavior-adjusting method then safely adapts MRS behaviors to environments. To validate PLBA’s effectiveness in MRS behavior adaption, a flood disaster search and rescue task was designed. 20 human users provided 1764 feedback based on human preferences obtained from MRS behaviors related to “task quality”, “task progress”, “robot safety”. The prediction accuracy and adaptation speed results show the effectiveness of PLBA in preference learning and MRS behavior adaption.

[AI-89] Enhancing disease detection in radiology reports through fine-tuning lightweight LLM on weak labels

链接: https://arxiv.org/abs/2409.16563
作者: Yishu Wei,Xindi Wang,Hanley Ong,Yiliang Zhou,Adam Flanders,George Shih,Yifan Peng
关键词-EN: applying large language, large language models, practical applications, synthetic labels, significant progress
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite significant progress in applying large language models (LLMs) to the medical domain, several limitations still prevent them from practical applications. Among these are the constraints on model size and the lack of cohort-specific labeled datasets. In this work, we investigated the potential of improving a lightweight LLM, such as Llama 3.1-8B, through fine-tuning with datasets using synthetic labels. Two tasks are jointly trained by combining their respective instruction datasets. When the quality of the task-specific synthetic labels is relatively high (e.g., generated by GPT4- o), Llama 3.1-8B achieves satisfactory performance on the open-ended disease detection task, with a micro F1 score of 0.91. Conversely, when the quality of the task-relevant synthetic labels is relatively low (e.g., from the MIMIC-CXR dataset), fine-tuned Llama 3.1-8B is able to surpass its noisy teacher labels (micro F1 score of 0.67 v.s. 0.63) when calibrated against curated labels, indicating the strong inherent underlying capability of the model. These findings demonstrate the potential of fine-tuning LLMs with synthetic labels, offering a promising direction for future research on LLM specialization in the medical domain.

[AI-90] Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

链接: https://arxiv.org/abs/2409.16560
作者: Zongyue Qin,Zifan He,Neha Prakriya,Jason Cong,Yizhou Sun
关键词-EN: numerous real-world tasks, shown outstanding performance, Large language models, beam sampling, real-world tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model’s distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model’s distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling…

[AI-91] Demystifying Issues Causes and Solutions in LLM Open-Source Projects

链接: https://arxiv.org/abs/2409.16559
作者: Yangxiao Cai,Peng Liang,Yifei Wang,Zengyang Li,Mojtaba Shahin
关键词-EN: Large Language Models, LLM open-source projects, Large Language, core functional component, LLM open-source
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 22 pages, 2 images, 6 tables, Manuscript submitted to a journal (2024)

点击查看摘要

Abstract:With the advancements of Large Language Models (LLMs), an increasing number of open-source software projects are using LLMs as their core functional component. Although research and practice on LLMs are capturing considerable interest, no dedicated studies explored the challenges faced by practitioners of LLM open-source projects, the causes of these challenges, and potential solutions. To fill this research gap, we conducted an empirical study to understand the issues that practitioners encounter when developing and using LLM open-source software, the possible causes of these issues, and potential solutions.We collected all closed issues from 15 LLM open-source projects and labelled issues that met our requirements. We then randomly selected 994 issues from the labelled issues as the sample for data extraction and analysis to understand the prevalent issues, their underlying causes, and potential solutions. Our study results show that (1) Model Issue is the most common issue faced by practitioners, (2) Model Problem, Configuration and Connection Problem, and Feature and Method Problem are identified as the most frequent causes of the issues, and (3) Optimize Model is the predominant solution to the issues. Based on the study results, we provide implications for practitioners and researchers of LLM open-source projects.

[AI-92] Context-aware and Style-related Incremental Decoding framework for Discourse-Level Literary Translation

链接: https://arxiv.org/abs/2409.16539
作者: Yuanchang Luo,Jiaxin Guo,Daimeng Wei,Hengchao Shang,Zongyao Li,Zhanglin Wu,Zhiqiang Rao,Shaojun Li,Jinlong Yang,Hao Yang
关键词-EN: Chinese-English language pair, Constrained Track, Discourse-Level Literary Translation, report outlines, Chinese-English language
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures, wmt24

点击查看摘要

Abstract:This report outlines our approach for the WMT24 Discourse-Level Literary Translation Task, focusing on the Chinese-English language pair in the Constrained Track. Translating literary texts poses significant challenges due to the nuanced meanings, idiomatic expressions, and intricate narrative structures inherent in such works. To address these challenges, we leveraged the Chinese-Llama2 model, specifically enhanced for this task through a combination of Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT). Our methodology includes a novel Incremental Decoding framework, which ensures that each sentence is translated with consideration of its broader context, maintaining coherence and consistency throughout the text. This approach allows the model to capture long-range dependencies and stylistic elements, producing translations that faithfully preserve the original literary quality. Our experiments demonstrate significant improvements in both sentence-level and document-level BLEU scores, underscoring the effectiveness of our proposed framework in addressing the complexities of document-level literary translation.

[AI-93] Source-Free Domain Adaptation for YOLO Object Detection ECCV2024

链接: https://arxiv.org/abs/2409.16538
作者: Simon Varailhon,Masih Aminbeidokhti,Marco Pedersoli,Eric Granger
关键词-EN: efficiency reasons, object detection, privacy and efficiency, Source-free domain adaptation, proposed SFDA method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024: European Conference on Computer Vision - Workshop on Out-of-Distribution Generalization in Computer Vision Foundation Models, Milan Italy

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.

[AI-94] Graph Pruning Based Spatial and Temporal Graph Convolutional Network with Transfer Learning for Traffic Prediction

链接: https://arxiv.org/abs/2409.16532
作者: Zihao Jing
关键词-EN: increasingly critical concern, Graph Convolutional Network, Convolutional Network, Recurrent Neural Network, Spatial-temporal Convolutional Network
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, accepted by ICIAAI2023, withdrawn from proceedings

点击查看摘要

Abstract:With the process of urbanization and the rapid growth of population, the issue of traffic congestion has become an increasingly critical concern. Intelligent transportation systems heavily rely on real-time and precise prediction algorithms to address this problem. While Recurrent Neural Network (RNN) and Graph Convolutional Network (GCN) methods in deep learning have demonstrated high accuracy in predicting road conditions when sufficient data is available, forecasting in road networks with limited data remains a challenging task. This study proposed a novel Spatial-temporal Convolutional Network (TL-GPSTGN) based on graph pruning and transfer learning framework to tackle this issue. Firstly, the essential structure and information of the graph are extracted by analyzing the correlation and information entropy of the road network structure and feature data. By utilizing graph pruning techniques, the adjacency matrix of the graph and the input feature data are processed, resulting in a significant improvement in the model’s migration performance. Subsequently, the well-characterized data are inputted into the spatial-temporal graph convolutional network to capture the spatial-temporal relationships and make predictions regarding the road conditions. Furthermore, this study conducts comprehensive testing and validation of the TL-GPSTGN method on real datasets, comparing its prediction performance against other commonly used models under identical conditions. The results demonstrate the exceptional predictive accuracy of TL-GPSTGN on a single dataset, as well as its robust migration performance across different datasets.

[AI-95] SynChart: Synthesizing Charts from Language Models

链接: https://arxiv.org/abs/2409.16517
作者: Mengchen Liu,Qixiu Li,Dongdong Chen,Dong Chen,Jianmin Bao,Yunsheng Li
关键词-EN: gained significant popularity, generating pseudo labels, significant popularity, generating pseudo, pseudo labels
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explores the potential of using LLMs alone for data generation and develop competitive multi-modality models focusing on chart understanding. We construct a large-scale chart dataset, SynChart, which contains approximately 4 million diverse chart images with over 75 million dense annotations, including data tables, code, descriptions, and question-answer sets. We trained a 4.2B chart-expert model using this dataset and achieve near-GPT-4O performance on the ChartQA task, surpassing GPT-4V.

[AI-96] GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

链接: https://arxiv.org/abs/2409.16502
作者: Gennady Sidorov,Malik Mohrat,Ksenia Lebedeva,Ruslan Rakhimov,Sergey Kolyubin
关键词-EN: extensive optimization requirements, high memory consumption, localization approaches exist, visual localization approaches, approaches exist
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project website at this https URL

点击查看摘要

Abstract:Although various visual localization approaches exist, such as scene coordinate and pose regression, these methods often struggle with high memory consumption or extensive optimization requirements. To address these challenges, we utilize recent advancements in novel view synthesis, particularly 3D Gaussian Splatting (3DGS), to enhance localization. 3DGS allows for the compact encoding of both 3D geometry and scene appearance with its spatial features. Our method leverages the dense description maps produced by XFeat’s lightweight keypoint detection and description model. We propose distilling these dense keypoint descriptors into 3DGS to improve the model’s spatial understanding, leading to more accurate camera pose predictions through 2D-3D correspondences. After estimating an initial pose, we refine it using a photometric warping loss. Benchmarking on popular indoor and outdoor datasets shows that our approach surpasses state-of-the-art Neural Render Pose (NRP) methods, including NeRFMatch and PNeRFLoc.

[AI-97] Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval CIKM2024

链接: https://arxiv.org/abs/2409.16497
作者: Qiuhai Zeng,Zimeng Qiu,Dae Yon Hwang,Xin He,William M. Campbell
关键词-EN: systems are commonly, synthetic queries, Dense retrieval systems, LLM, pre-trained LLM
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at DCAI24 workshop@CIKM2024

点击查看摘要

Abstract:Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language models (LLM) under the dual-encoder retrieval framework. We demonstrate the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instructed-tuning. Specifically, we first prompt an open-box pre-trained LLM to follow defined instructions (i.e. question generation and keyword summarization) to generate synthetic queries. Next, we fine-tune the pre-trained LLM with defined instructions and the generated queries that passed quality check. Finally, we generate synthetic queries with the instruction-tuned LLM for each corpora and represent each corpora by weighted averaging the synthetic queries and original corpora embeddings. We evaluate our proposed method under low-resource settings on three English and one German retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing open-box FLAN-T5 model variations by [3.34%, 3.50%] in absolute and exceeding three competitive dense retrievers (i.e. mDPR, T-Systems, mBART-Large), with model of size at least 38% smaller, by 1.96%, 4.62%, 9.52% absolute on NDCG@10.

[AI-98] Algorithmic Drift: A Simulation Framework to Study the Effects of Recommender Systems on User Preferences

链接: https://arxiv.org/abs/2409.16478
作者: Erica Coppolillo,Simone Mungari,Ettore Ritacco,Francesco Fabbri,Marco Minici,Francesco Bonchi,Giuseppe Manco
关键词-EN: e-commerce websites adopt, websites adopt Recommender, Digital platforms, adopt Recommender Systems, media and e-commerce
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Digital platforms such as social media and e-commerce websites adopt Recommender Systems to provide value to the user. However, the social consequences deriving from their adoption are still unclear. Many scholars argue that recommenders may lead to detrimental effects, such as bias-amplification deriving from the feedback loop between algorithmic suggestions and users’ choices. Nonetheless, the extent to which recommenders influence changes in users leaning remains uncertain. In this context, it is important to provide a controlled environment for evaluating the recommendation algorithm before deployment. To address this, we propose a stochastic simulation framework that mimics user-recommender system interactions in a long-term scenario. In particular, we simulate the user choices by formalizing a user model, which comprises behavioral aspects, such as the user resistance towards the recommendation algorithm and their inertia in relying on the received suggestions. Additionally, we introduce two novel metrics for quantifying the algorithm’s impact on user preferences, specifically in terms of drift over time. We conduct an extensive evaluation on multiple synthetic datasets, aiming at testing the robustness of our framework when considering different scenarios and hyper-parameters setting. The experimental results prove that the proposed methodology is effective in detecting and quantifying the drift over the users preferences by means of the simulation. All the code and data used to perform the experiments are publicly available.

[AI-99] Artificial Intelligence for Secured Information Systems in Smart Cities: Collaborative IoT Computing with Deep Reinforcement Learning and Blockchain

链接: https://arxiv.org/abs/2409.16444
作者: Amin Zakaie Far,Mohammad Zakaie Far,Sonia Gharibzadeh,Shiva Zangeneh,Leila Amini,Morteza Rahimi
关键词-EN: Internet of Things, raised critical challenges, specifically in infrastructures, DRL, accelerated expansion
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The accelerated expansion of the Internet of Things (IoT) has raised critical challenges associated with privacy, security, and data integrity, specifically in infrastructures such as smart cities or smart manufacturing. Blockchain technology provides immutable, scalable, and decentralized solutions to address these challenges, and integrating deep reinforcement learning (DRL) into the IoT environment offers enhanced adaptability and decision-making. This paper investigates the integration of blockchain and DRL to optimize mobile transmission and secure data exchange in IoT-assisted smart cities. Through the clustering and categorization of IoT application systems, the combination of DRL and blockchain is shown to enhance the performance of IoT networks by maintaining privacy and security. Based on the review of papers published between 2015 and 2024, we have classified the presented approaches and offered practical taxonomies, which provide researchers with critical perspectives and highlight potential areas for future exploration and research. Our investigation shows how combining blockchain’s decentralized framework with DRL can address privacy and security issues, improve mobile transmission efficiency, and guarantee robust, privacy-preserving IoT systems. Additionally, we explore blockchain integration for DRL and outline the notable applications of DRL technology. By addressing the challenges of machine learning and blockchain integration, this study proposes novel perspectives for researchers and serves as a foundational exploration from an interdisciplinary standpoint.

[AI-100] Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition

链接: https://arxiv.org/abs/2409.16434
作者: Zheda Mai,Ping Zhang,Cheng-Hao Tu,Hong-You Chen,Li Zhang,Wei-Lun Chao
关键词-EN: Parameter-efficient transfer learning, attracted significant attention, Parameter-efficient transfer, PETL, transfer learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Parameter-efficient transfer learning (PETL) has attracted significant attention lately, due to the increasing size of pre-trained models and the need to fine-tune (FT) them for superior downstream performance. This community-wide enthusiasm has sparked a plethora of new methods. Nevertheless, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like when to apply PETL and which method to use largely unanswered. In this paper, we conduct a unifying empirical study of representative PETL methods in the context of Vision Transformers. We systematically tune their hyper-parameters to fairly compare their accuracy on downstream tasks. Our study not only offers a valuable user guide but also unveils several new insights. First, if tuned carefully, different PETL methods can obtain quite similar accuracy in the low-shot benchmark VTAB-1K. This includes simple methods like FT the bias terms that were reported inferior. Second, though with similar accuracy, we find that PETL methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementariness) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PETL is also useful in many-shot regimes – it achieves comparable and sometimes better accuracy than full FT, using much fewer learnable parameters. Last but not least, we investigate PETL’s ability to preserve a pre-trained model’s robustness to distribution shifts (e.g., a CLIP backbone). Perhaps not surprisingly, PETL methods outperform full FT alone. However, with weight-space ensembles, the fully FT model can achieve a better balance between downstream and out-of-distribution performance, suggesting a future research direction for PETL.

[AI-101] A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions

链接: https://arxiv.org/abs/2409.16430
作者: Rajesh Ranjan,Shailja Gupta,Surya Narayan Singh
关键词-EN: Large Language Models, natural language processing, Large Language, unprecedented text generation, providing unprecedented text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 2 Tables, 1 Figure

点击查看摘要

Abstract:Large Language Models(LLMs) have revolutionized various applications in natural language processing (NLP) by providing unprecedented text generation, translation, and comprehension capabilities. However, their widespread deployment has brought to light significant concerns regarding biases embedded within these models. This paper presents a comprehensive survey of biases in LLMs, aiming to provide an extensive review of the types, sources, impacts, and mitigation strategies related to these biases. We systematically categorize biases into several dimensions. Our survey synthesizes current research findings and discusses the implications of biases in real-world applications. Additionally, we critically assess existing bias mitigation techniques and propose future research directions to enhance fairness and equity in LLMs. This survey serves as a foundational resource for researchers, practitioners, and policymakers concerned with addressing and understanding biases in LLMs.

[AI-102] Leveraging Local Structure for Improving Model Explanations: An Information Propagation Approach

链接: https://arxiv.org/abs/2409.16429
作者: Ruo Yang,Binghui Wang,Mustafa Bilgic
关键词-EN: deep neural network, Numerous explanation methods, Numerous explanation, neural network, recently developed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous explanation methods have been recently developed to interpret the decisions made by deep neural network (DNN) models. For image classifiers, these methods typically provide an attribution score to each pixel in the image to quantify its contribution to the prediction. However, most of these explanation methods appropriate attribution scores to pixels independently, even though both humans and DNNs make decisions by analyzing a set of closely related pixels simultaneously. Hence, the attribution score of a pixel should be evaluated jointly by considering itself and its structurally-similar pixels. We propose a method called IProp, which models each pixel’s individual attribution score as a source of explanatory information and explains the image prediction through the dynamic propagation of information across all pixels. To formulate the information propagation, IProp adopts the Markov Reward Process, which guarantees convergence, and the final status indicates the desired pixels’ attribution scores. Furthermore, IProp is compatible with any existing attribution-based explanation method. Extensive experiments on various explanation methods and DNN models verify that IProp significantly improves them on a variety of interpretability metrics.

[AI-103] HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

链接: https://arxiv.org/abs/2409.16427
作者: Xuhui Zhou,Hyunwoo Kim,Faeze Brahman,Liwei Jiang,Hao Zhu,Ximing Lu,Frank Xu,Bill Yuchen Lin,Yejin Choi,Niloofar Mireshghallah,Ronan Le Bras,Maarten Sap
关键词-EN: increased interactional safety, leading to increased, agents, increasingly autonomous, increased interactional
类目: Artificial Intelligence (cs.AI)
*备注: Both the second and third authors contributed equally

点击查看摘要

Abstract:AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi-turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients’ profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi-dimensional evaluation framework that uses metrics covering operational, content-related, societal, and legal risks. Through running 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

[AI-104] Lessons for Editors of AI Incidents from the AI Incident Database

链接: https://arxiv.org/abs/2409.16425
作者: Kevin Paeth,Daniel Atherton,Nikiforos Pittaras,Heather Frase,Sean McGregor
关键词-EN: increasingly deployed, artificial intelligence, incidents, events to individuals, Incident
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 0 figures

点击查看摘要

Abstract:As artificial intelligence (AI) systems become increasingly deployed across the world, they are also increasingly implicated in AI incidents - harm events to individuals and society. As a result, industry, civil society, and governments worldwide are developing best practices and regulations for monitoring and analyzing AI incidents. The AI Incident Database (AIID) is a project that catalogs AI incidents and supports further research by providing a platform to classify incidents for different operational and research-oriented goals. This study reviews the AIID’s dataset of 750+ AI incidents and two independent taxonomies applied to these incidents to identify common challenges to indexing and analyzing AI incidents. We find that certain patterns of AI incidents present structural ambiguities that challenge incident databasing and explore how epistemic uncertainty in AI incident reporting is unavoidable. We therefore report mitigations to make incident processes more robust to uncertainty related to cause, extent of harm, severity, or technical details of implicated systems. With these findings, we discuss how to develop future AI incident reporting practices.

[AI-105] ask-oriented Prompt Enhancement via Script Generation

链接: https://arxiv.org/abs/2409.16418
作者: Chung-Yu Wang,Alireza DaghighFarsoodeh,Hung Viet Pham
关键词-EN: Large Language Models, Large Language, Language Models, leveraging advanced reasoning, demonstrated remarkable abilities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 17 pages + reference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable abilities across various tasks, leveraging advanced reasoning. Yet, they struggle with task-oriented prompts due to a lack of specific prior knowledge of the task answers. The current state-of-the-art approach, PAL, utilizes code generation to address this issue. However, PAL depends on manually crafted prompt templates and examples while still producing inaccurate results. In this work, we present TITAN-a novel strategy designed to enhance LLMs’ performance on task-oriented prompts. TITAN achieves this by generating scripts using a universal approach and zero-shot learning. Unlike existing methods, TITAN eliminates the need for detailed task-specific instructions and extensive manual efforts. TITAN enhances LLMs’ performance on various tasks by utilizing their analytical and code-generation capabilities in a streamlined process. TITAN employs two key techniques: (1) step-back prompting to extract the task’s input specifications and (2) chain-of-thought prompting to identify required procedural steps. This information is used to improve the LLMs’ code-generation process. TITAN further refines the generated script through post-processing and the script is executed to retrieve the final answer. Our comprehensive evaluation demonstrates TITAN’s effectiveness in a diverse set of tasks. On average, TITAN outperforms the state-of-the-art zero-shot approach by 7.6% and 3.9% when paired with GPT-3.5 and GPT-4. Overall, without human annotation, TITAN achieves state-of-the-art performance in 8 out of 11 cases while only marginally losing to few-shot approaches (which needed human intervention) on three occasions by small margins. This work represents a significant advancement in addressing task-oriented prompts, offering a novel solution for effectively utilizing LLMs in everyday life tasks.

[AI-106] Selection of Prompt Engineering Techniques for Code Generation through Predicting Code Complexity

链接: https://arxiv.org/abs/2409.16416
作者: Chung-Yu Wang,Alireza DaghighFarsoodeh,Hung Viet Pham
关键词-EN: Large Language Models, Large Language, demonstrated impressive performance, software engineering tasks, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 18 pages + reference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance in software engineering tasks. However, improving their accuracy in generating correct and reliable code remains challenging. Numerous prompt engineering techniques (PETs) have been developed to address this, but no single approach is universally optimal. Selecting the right PET for each query is difficult for two primary reasons: (1) interactive prompting techniques may not consistently deliver the expected benefits, especially for simpler queries, and (2) current automated prompt engineering methods lack adaptability and fail to fully utilize multi-stage responses. To overcome these challenges, we propose PET-Select, a PET-agnostic selection model that uses code complexity as a proxy to classify queries and select the most appropriate PET. By incorporating contrastive learning, PET-Select effectively distinguishes between simple and complex problems, allowing it to choose PETs that are best suited for each query’s complexity level. Our evaluations on the MBPP and HumanEval benchmarks using GPT-3.5 Turbo and GPT-4o show up to a 1.9% improvement in pass@1 accuracy, along with a 74.8% reduction in token usage. Additionally, we provide both quantitative and qualitative results to demonstrate how PET-Select effectively selects the most appropriate techniques for each code generation query, further showcasing its efficiency in optimizing PET selection.

[AI-107] Modern Hopfield Networks meet Encoded Neural Representations – Addressing Practical Considerations NEURIPS

链接: https://arxiv.org/abs/2409.16408
作者: Satyananda Kashyap,Niharika S. D’Souza,Luyao Shi,Ken C. L. Wong,Hongzhi Wang,Tanveer Syeda-Mahmood
关键词-EN: Modern Hopfield Networks, storage faces challenges, Content-addressable memories, human declarative memory, Modern Hopfield
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 8 figures, workshop submission to Neurips

点击查看摘要

Abstract:Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.

[AI-108] Design and Evaluation of a CDSS for Drug Allergy Management Using LLMs and Pharmaceutical Data Integration

链接: https://arxiv.org/abs/2409.16395
作者: Gabriele De Vito,Filomena Ferrucci,Athanasios Angelakis
关键词-EN: Medication errors significantly, substantial economic burdens, errors significantly threaten, threaten patient safety, adverse drug events
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Medication errors significantly threaten patient safety, leading to adverse drug events and substantial economic burdens on healthcare systems. Clinical Decision Support Systems (CDSSs) aimed at mitigating these errors often face limitations, including reliance on static databases and rule-based algorithms, which can result in high false alert rates and alert fatigue among clinicians. This paper introduces HELIOT, an innovative CDSS for drug allergy management, integrating Large Language Models (LLMs) with a comprehensive pharmaceutical data repository. HELIOT leverages advanced natural language processing capabilities to interpret complex medical texts and synthesize unstructured data, overcoming the limitations of traditional CDSSs. An empirical evaluation using a synthetic patient dataset and expert-verified ground truth demonstrates HELIOT’s high accuracy, precision, recall, and F1 score, uniformly reaching 100% across multiple experimental runs. The results underscore HELIOT’s potential to enhance decision support in clinical settings, offering a scalable, efficient, and reliable solution for managing drug allergies.

[AI-109] Rao-Blackwellized POMDP Planning

链接: https://arxiv.org/abs/2409.16392
作者: Jiho Lee,Nisar R. Ahmed,Kyle H. Wray,Zachary N. Sunberg
关键词-EN: Partially Observable Markov, Markov Decision Processes, Observable Markov Decision, Partially Observable, Decision Processes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Partially Observable Markov Decision Processes (POMDPs) provide a structured framework for decision-making under uncertainty, but their application requires efficient belief updates. Sequential Importance Resampling Particle Filters (SIRPF), also known as Bootstrap Particle Filters, are commonly used as belief updaters in large approximate POMDP solvers, but they face challenges such as particle deprivation and high computational costs as the system’s state dimension grows. To address these issues, this study introduces Rao-Blackwellized POMDP (RB-POMDP) approximate solvers and outlines generic methods to apply Rao-Blackwellization in both belief updates and online planning. We compare the performance of SIRPF and Rao-Blackwellized Particle Filters (RBPF) in a simulated localization problem where an agent navigates toward a target in a GPS-denied environment using POMCPOW and RB-POMCPOW planners. Our results not only confirm that RBPFs maintain accurate belief approximations over time with fewer particles, but, more surprisingly, RBPFs combined with quadrature-based integration improve planning quality significantly compared to SIRPF-based planning under the same computational limits.

[AI-110] Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling

链接: https://arxiv.org/abs/2409.16376
作者: Ville Heilala,Roberto Araya,Raija Hämäläinen
关键词-EN: reshape education, multimodal, education, Generative, Abstract
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) can reshape education and learning. While large language models (LLMs) like ChatGPT dominate current educational research, multimodal capabilities, such as text-to-speech and text-to-image, are less explored. This study uses topic modeling to map the research landscape of multimodal and generative AI in education. An extensive literature search using this http URL yielded 4175 articles. Employing a topic modeling approach, latent topics were extracted, resulting in 38 interpretable topics organized into 14 thematic areas. Findings indicate a predominant focus on text-to-text models in educational contexts, with other modalities underexplored, overlooking the broader potential of multimodal approaches. The results suggest a research gap, stressing the importance of more balanced attention across different AI modalities and educational levels. In summary, this research provides an overview of current trends in generative AI for education, underlining opportunities for future exploration of multimodal technologies to fully realize the transformative potential of artificial intelligence in education.

[AI-111] Exploring the traditional NMT model and Large Language Model for chat translation

链接: https://arxiv.org/abs/2409.16331
作者: Jinlong Yang,Hengchao Shang,Daimeng Wei,Jiaxin Guo,Zongyao Li,Zhanglin Wu,Zhiqiang Rao,Shaojun Li,Yuhao Xie,Yuanchang Luo,Jiawei Zheng,Bin Wei,Hao Yang
关键词-EN: Translation Services Center, Huawei Translation Services, Services Center, translation shared task, Minimum Bayesian Risk
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 Tables, WMT24

点击查看摘要

Abstract:This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English \leftrightarrow Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.

[AI-112] Automated Spatio-Temporal Weather Modeling for Load Forecasting

链接: https://arxiv.org/abs/2409.16326
作者: Julie Keisler(CRIStAL, EDF R\amp;D OSIRIS, EDF R\amp;D),Margaux Bregere(EDF R\amp;D, EDF R\amp;D OSIRIS, LPSM (UMR_8001))
关键词-EN: prohibitive cost, balance between generation, difficult to store, Electricity, load
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Electricity is difficult to store, except at prohibitive cost, and therefore the balance between generation and load must be maintained at all times. Electricity is traditionally managed by anticipating demand and intermittent production (wind, solar) and matching flexible production (hydro, nuclear, coal and gas). Accurate forecasting of electricity load and renewable production is therefore essential to ensure grid performance and stability. Both are highly dependent on meteorological variables (temperature, wind, sunshine). These dependencies are complex and difficult to model. On the one hand, spatial variations do not have a uniform impact because population, industry, and wind and solar farms are not evenly distributed across the territory. On the other hand, temporal variations can have delayed effects on load (due to the thermal inertia of buildings). With access to observations from different weather stations and simulated data from meteorological models, we believe that both phenomena can be modeled together. In today’s state-of-the-art load forecasting models, the spatio-temporal modeling of the weather is fixed. In this work, we aim to take advantage of the automated representation and spatio-temporal feature extraction capabilities of deep neural networks to improve spatio-temporal weather modeling for load forecasting. We compare our deep learning-based methodology with the state-of-the-art on French national load. This methodology could also be fully adapted to forecasting renewable energy production.

[AI-113] WeatherFormer: Empowering Global Numerical Weather Forecasting with Space-Time Transformer

链接: https://arxiv.org/abs/2409.16321
作者: Junchao Gong,Tao Han,Kang Chen,Lei Bai
关键词-EN: Numerical Weather Prediction, huge computing cluster, Numerical Weather, Weather Prediction, society.Traditional NWP system
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Numerical Weather Prediction (NWP) system is an infrastructure that exerts considerable impacts on modern society.Traditional NWP system, however, resolves it by solving complex partial differential equations with a huge computing cluster, resulting in tons of carbon emission. Exploring efficient and eco-friendly solutions for NWP attracts interest from Artificial Intelligence (AI) and earth science communities. To narrow the performance gap between the AI-based methods and physic predictor, this work proposes a new transformer-based NWP framework, termed as WeatherFormer, to model the complex spatio-temporal atmosphere dynamics and empowering the capability of data-driven NWP. WeatherFormer innovatively introduces the space-time factorized transformer blocks to decrease the parameters and memory consumption, in which Position-aware Adaptive Fourier Neural Operator (PAFNO) is proposed for location sensible token mixing. Besides, two data augmentation strategies are utilized to boost the performance and decrease training consumption. Extensive experiments on WeatherBench dataset show WeatherFormer achieves superior performance over existing deep learning methods and further approaches the most advanced physical model.

[AI-114] DeepScore: A Comprehensive Approach to Measuring Quality in AI-Generated Clinical Documentation

链接: https://arxiv.org/abs/2409.16307
作者: Jon Oleson
关键词-EN: rapidly adopting generative, significant time savings, Medical practitioners, leading to significant, reduced stress
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Medical practitioners are rapidly adopting generative AI solutions for clinical documentation, leading to significant time savings and reduced stress. However, evaluating the quality of AI-generated documentation is a complex and ongoing challenge. This paper presents an overview of DeepScribe’s methodologies for assessing and managing note quality, focusing on various metrics and the composite “DeepScore”, an overall index of quality and accuracy. These methodologies aim to enhance the quality of patient care documentation through accountability and continuous improvement.

[AI-115] HyperAgent : Generalist Software Engineering Agents to Solve Coding Tasks at Scale

链接: https://arxiv.org/abs/2409.16299
作者: Huy Nhat Phan,Phong X. Nguyen,Nghi D. Q. Bui
关键词-EN: Large Language Models, demonstrating remarkable capabilities, revolutionized software engineering, Large Language, Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized software engineering (SE), demonstrating remarkable capabilities in various coding tasks. While recent efforts have produced autonomous software agents based on LLMs for end-to-end development tasks, these systems are typically designed for specific SE tasks. We introduce HyperAgent, a novel generalist multi-agent system designed to address a wide spectrum of SE tasks across different programming languages by mimicking human developers’ workflows. Comprising four specialized agents - Planner, Navigator, Code Editor, and Executor. HyperAgent manages the full lifecycle of SE tasks, from initial conception to final verification. Through extensive evaluations, HyperAgent achieves state-of-the-art performance across diverse SE tasks: it attains a 25.01% success rate on SWE-Bench-Lite and 31.40% on SWE-Bench-Verified for GitHub issue resolution, surpassing existing methods. Furthermore, HyperAgent demonstrates SOTA performance in repository-level code generation (RepoExec), and in fault localization and program repair (Defects4J), often outperforming specialized systems. This work represents a significant advancement towards versatile, autonomous agents capable of handling complex, multi-step SE tasks across various domains and languages, potentially transforming AI-assisted software development practices.

[AI-116] Explaining Human Comparisons using Alignment-Importance Heatmaps

链接: https://arxiv.org/abs/2409.16292
作者: Nhut Truong,Dario Pesenti,Uri Hasson
关键词-EN: computational explainability approach, Deep Neural Network, Alignment Importance Score, Alignment Importance, AIS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a computational explainability approach for human comparison tasks, using Alignment Importance Score (AIS) heatmaps derived from deep-vision models. The AIS reflects a feature-map’s unique contribution to the alignment between Deep Neural Network’s (DNN) representational geometry and that of humans. We first validate the AIS by showing that prediction of out-of-sample human similarity judgments is improved when constructing representations using only higher-scoring AIS feature maps identified from a training set. We then compute image-specific heatmaps that visually indicate the areas that correspond to feature-maps with higher AIS scores. These maps provide an intuitive explanation of which image areas are more important when it is compared to other images in a cohort. We observe a correspondence between these heatmaps and saliency maps produced by a gaze-prediction model. However, in some cases, meaningful differences emerge, as the dimensions relevant for comparison are not necessarily the most visually salient. To conclude, Alignment Importance improves prediction of human similarity judgments from DNN embeddings, and provides interpretable insights into the relevant information in image space.

[AI-117] Beyond Following: Mixing Active Initiative into Computational Creativity

链接: https://arxiv.org/abs/2409.16291
作者: Zhiyu Lin,Upol Ehsan,Rohan Agarwal,Samihan Dani,Vidushi Vashishth,Mark Riedl
关键词-EN: Procedural Content Generation, Generative Artificial Intelligence, Artificial Intelligence, Content Generation, Procedural Content
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the influence of the adaptive ability of an active and learning AI agent on creators’ expectancy of creative responsibilities in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study, Our developed system’s learning capabilities are well recognized compared to the non-learning ablation, corresponding to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and deepened understanding among all participants.

[AI-118] Identify As A Human Does: A Pathfinder of Next-Generation Anti-Cheat Framework for First-Person Shooter Games

链接: https://arxiv.org/abs/2409.14830
作者: Jiayi Zhang,Chenxin Sun,Yue Gu,Qingyu Zhang,Jiayi Lin,Xiaojiang Du,Chenxiong Qian
关键词-EN: experienced substantial growth, online games poses, gaming experience, gaming industry, substantial growth
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The gaming industry has experienced substantial growth, but cheating in online games poses a significant threat to the integrity of the gaming experience. Cheating, particularly in first-person shooter (FPS) games, can lead to substantial losses for the game industry. Existing anti-cheat solutions have limitations, such as client-side hardware constraints, security risks, server-side unreliable methods, and both-sides suffer from a lack of comprehensive real-world datasets. To address these limitations, the paper proposes HAWK, a server-side FPS anti-cheat framework for the popular game CS:GO. HAWK utilizes machine learning techniques to mimic human experts’ identification process, leverages novel multi-view features, and it is equipped with a well-defined workflow. The authors evaluate HAWK with the first large and real-world datasets containing multiple cheat types and cheating sophistication, and it exhibits promising efficiency and acceptable overheads, shorter ban times compared to the in-use anti-cheat, a significant reduction in manual labor, and the ability to capture cheaters who evaded official inspections.

[AI-119] Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement EMNLP2024

链接: https://arxiv.org/abs/2406.11176
作者: Weimin Xiong,Yifan Song,Xiutian Zhao,Wenhao Wu,Xun Wang,Ke Wang,Cheng Li,Wei Peng,Sujian Li
关键词-EN: Large language model, Large language, exhibited exceptional performance, exhibited exceptional, step-level Process Refinement
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to EMNLP 2024 (Main Conference)

点击查看摘要

Abstract:Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.

[AI-120] SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking

链接: https://arxiv.org/abs/2409.17087
作者: Luigi Russo,Francesco Mauro,Alessandro Sebastianelli,Paolo Gamba,Silvia Liberata Ullo
关键词-EN: increasing droughts pose, pose significant challenges, droughts pose significant, water resource management, water
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Geoscience and Remote Sensing. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Climate change and increasing droughts pose significant challenges to water resource management around the world. These problems lead to severe water shortages that threaten ecosystems, agriculture, and human communities. To advance the fight against these challenges, we present a new dataset, SEN12-WATER, along with a benchmark using a novel end-to-end Deep Learning (DL) framework for proactive drought-related analysis. The dataset, identified as a spatiotemporal datacube, integrates SAR polarization, elevation, slope, and multispectral optical bands. Our DL framework enables the analysis and estimation of water losses over time in reservoirs of interest, revealing significant insights into water dynamics for drought analysis by examining temporal changes in physical quantities such as water volume. Our methodology takes advantage of the multitemporal and multimodal characteristics of the proposed dataset, enabling robust generalization and advancing understanding of drought, contributing to climate change resilience and sustainable water resource management. The proposed framework involves, among the several components, speckle noise removal from SAR data, a water body segmentation through a U-Net architecture, the time series analysis, and the predictive capability of a Time-Distributed-Convolutional Neural Network (TD-CNN). Results are validated through ground truth data acquired on-ground via dedicated sensors and (tailored) metrics, such as Precision, Recall, Intersection over Union, Mean Squared Error, Structural Similarity Index Measure and Peak Signal-to-Noise Ratio.

[AI-121] Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling

链接: https://arxiv.org/abs/2409.16937
作者: Yuanchao Li,Zixing Zhang,Jing Han,Peter Bell,Catherine Lai
关键词-EN: extensive subjective assessment, requiring extensive subjective, cognitive state classification, subjective assessment, common challenge
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.

[AI-122] Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models

链接: https://arxiv.org/abs/2409.16920
作者: Zhichen Han,Tianqi Geng,Hui Feng,Jiahong Yuan,Korin Richmond,Yuanchao Li
关键词-EN: Speech Emotion Recognition, Utilizing Self-Supervised Learning, Utilizing Self-Supervised, explored cross-lingual scenarios, Emotion Recognition
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background. Moreover, both humans and models exhibit distinct behaviors across different emotions. These results offer new insights into the cross-lingual SER capabilities of SSL models, underscoring both their similarities to and differences from human emotion perception.

[AI-123] SBP: Improving Object Detection in Histology Images via Test-time Self-guided Bounding-box Propagation MICCAI2024

链接: https://arxiv.org/abs/2409.16678
作者: Tingting Yang,Liang Xiao,Yizhe Zhang
关键词-EN: Earth Mover Distance, detection, TSBP, Test-time Self-guided Bounding-box, Self-guided Bounding-box Propagation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:A global threshold (e.g., 0.5) is often applied to determine which bounding boxes should be included in the final results for an object detection task. A higher threshold reduces false positives but may result in missing a significant portion of true positives. A lower threshold can increase detection recall but may also result in more false positives. Because of this, using a preset global threshold (e.g., 0.5) applied to all the bounding box candidates may lead to suboptimal solutions. In this paper, we propose a Test-time Self-guided Bounding-box Propagation (TSBP) method, leveraging Earth Mover’s Distance (EMD) to enhance object detection in histology images. TSBP utilizes bounding boxes with high confidence to influence those with low confidence, leveraging visual similarities between them. This propagation mechanism enables bounding boxes to be selected in a controllable, explainable, and robust manner, which surpasses the effectiveness of using simple thresholds and uncertainty calibration methods. Importantly, TSBP does not necessitate additional labeled samples for model training or parameter estimation, unlike calibration methods. We conduct experiments on gland detection and cell detection tasks in histology images. The results show that our proposed TSBP significantly improves detection outcomes when working in conjunction with state-of-the-art deep learning-based detection networks. Compared to other methods such as uncertainty calibration, TSBP yields more robust and accurate object detection predictions while using no additional labeled samples. The code is available at this https URL.

[AI-124] ECG-Image-Database: A Dataset of ECG Images with Real-World Imaging and Scanning Artifacts; A Foundation for Computerized ECG Image Digitization and Analysis

链接: https://arxiv.org/abs/2409.16612
作者: Matthew A. Reyna,Deepanshi,James Weigle,Zuzana Koscova,Kiersten Campbell,Kshama Kodthalu Shivashankara,Soheil Saghafi,Sepideh Nikookar,Mohsen Motie-Shirazi,Yashar Kiarashi,Salman Seyedi,Gari D. Clifford,Reza Sameni
关键词-EN: ECG, ECG images, images, collection of electrocardiogram, ECG time-series
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We introduce the ECG-Image-Database, a large and diverse collection of electrocardiogram (ECG) images generated from ECG time-series data, with real-world scanning, imaging, and physical artifacts. We used ECG-Image-Kit, an open-source Python toolkit, to generate realistic images of 12-lead ECG printouts from raw ECG time-series. The images include realistic distortions such as noise, wrinkles, stains, and perspective shifts, generated both digitally and physically. The toolkit was applied to 977 12-lead ECG records from the PTB-XL database and 1,000 from Emory Healthcare to create high-fidelity synthetic ECG images. These unique images were subjected to both programmatic distortions using ECG-Image-Kit and physical effects like soaking, staining, and mold growth, followed by scanning and photography under various lighting conditions to create real-world artifacts. The resulting dataset includes 35,595 software-labeled ECG images with a wide range of imaging artifacts and distortions. The dataset provides ground truth time-series data alongside the images, offering a reference for developing machine and deep learning models for ECG digitization and classification. The images vary in quality, from clear scans of clean papers to noisy photographs of degraded papers, enabling the development of more generalizable digitization algorithms. ECG-Image-Database addresses a critical need for digitizing paper-based and non-digital ECGs for computerized analysis, providing a foundation for developing robust machine and deep learning models capable of converting ECG images into time-series. The dataset aims to serve as a reference for ECG digitization and computerized annotation efforts. ECG-Image-Database was used in the PhysioNet Challenge 2024 on ECG image digitization and classification. Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP) Cite as: arXiv:2409.16612 [q-bio.QM] (or arXiv:2409.16612v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2409.16612 Focus to learn more arXiv-issued DOI via DataCite

[AI-125] A Hybrid Quantum Neural Network for Split Learning

链接: https://arxiv.org/abs/2409.16593
作者: Hevish Cowlessur,Chandra Thapa,Tansu Alpcan,Seyit Camtepe
关键词-EN: Quantum Machine Learning, distributed collaborative learning, Machine Learning, Quantum Split Learning, Split Learning
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 47 pages

点击查看摘要

Abstract:Quantum Machine Learning (QML) is an emerging field of research with potential applications to distributed collaborative learning, such as Split Learning (SL). SL allows resource-constrained clients to collaboratively train ML models with a server, reduce their computational overhead, and enable data privacy by avoiding raw data sharing. Although QML with SL has been studied, the problem remains open in resource-constrained environments where clients lack quantum computing capabilities. Additionally, data privacy leakage between client and server in SL poses risks of reconstruction attacks on the server side. To address these issues, we propose Hybrid Quantum Split Learning (HQSL), an application of Hybrid QML in SL. HQSL enables classical clients to train models with a hybrid quantum server and curtails reconstruction attacks. In addition, we introduce a novel qubit-efficient data-loading technique for designing a quantum layer in HQSL, minimizing both the number of qubits and circuit depth. Experiments on five datasets demonstrate HQSL’s feasibility and ability to enhance classification performance compared to its classical models. Notably, HQSL achieves mean improvements of over 3% in both accuracy and F1-score for the Fashion-MNIST dataset, and over 1.5% in both metrics for the Speech Commands dataset. We expand these studies to include up to 100 clients, confirming HQSL’s scalability. Moreover, we introduce a noise-based defense mechanism to tackle reconstruction attacks on the server side. Overall, HQSL enables classical clients to collaboratively train their models with a hybrid quantum server, leveraging quantum advantages while improving model performance and security against data privacy leakage-related reconstruction attacks.

[AI-126] Center-fixing of tropical cyclones using uncertainty-aware deep learning applied to high-temporal-resolution geostationary satellite imagery

链接: https://arxiv.org/abs/2409.16507
作者: Ryan Lagerquist,Galina Chirokova,Robert DeMaria,Mark DeMaria,Imme Ebert-Uphoff
关键词-EN: surface circulation center, surface circulation, circulation center, TC-forecasting process, affecting current
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: Submitted to AMS journal Weather and Forecasting. Main body is 52 pages and 14 figures; supplement is another 33 pages and 28 figures

点击查看摘要

Abstract:Determining the location of a tropical cyclone’s (TC) surface circulation center – “center-fixing” – is a critical first step in the TC-forecasting process, affecting current and future estimates of track, intensity, and structure. Despite a recent increase in the number of automated center-fixing methods, only one such method (ARCHER-2) is operational, and its best performance is achieved when using microwave or scatterometer data, which are not available at every forecast cycle. We develop a deep-learning algorithm called GeoCenter; it relies only on geostationary IR satellite imagery, which is available for all TC basins at high frequency (10-15 min) and low latency ( 10 min) during both day and night. GeoCenter ingests an animation (time series) of IR images, including 10 channels at lag times up to 3 hours. The animation is centered at a “first guess” location, offset from the true TC-center location by 48 km on average and sometimes 100 km; GeoCenter is tasked with correcting this offset. On an independent testing dataset, GeoCenter achieves a mean/median/RMS (root mean square) error of 26.9/23.3/32.0 km for all systems, 25.7/22.3/30.5 km for tropical systems, and 15.7/13.6/18.6 km for category-2–5 hurricanes. These values are similar to ARCHER-2 errors when microwave or scatterometer data are available, and better than ARCHER-2 errors when only IR data are available. GeoCenter also performs skillful uncertainty quantification (UQ), producing a well calibrated ensemble of 200 TC-center locations. Furthermore, all predictors used by GeoCenter are available in real time, which would make GeoCenter easy to implement operationally every 10-15 min.

[AI-127] o Explore the Potential Inhibitors against Multitarget Proteins of COVID 19 using In Silico Study

链接: https://arxiv.org/abs/2409.16486
作者: Imra Aqeel
关键词-EN: global pandemic due, potential inhibitors, learning regression approaches, machine learning regression, due to emergence
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 22 pages

点击查看摘要

Abstract:The global pandemic due to emergence of COVID 19 has created the unrivaled public health crisis. It has huge morbidity rate never comprehended in the recent decades. Researchers have made many efforts to find the optimal solution of this pandemic. Progressively, drug repurposing is an emergent and powerful strategy with saving cost, time, and labor. Lacking of identified repurposed drug candidates against COVID 19 demands more efforts to explore the potential inhibitors for effective cure. In this study, we used the combination of molecular docking and machine learning regression approaches to explore the potential inhibitors for the treatment of COVID 19. We calculated the binding affinities of these drugs to multitarget proteins using molecular docking process. We perform the QSAR modeling by employing various machine learning regression approaches to identify the potential inhibitors against COVID 19. Our findings with best scores of R2 and RMSE demonstrated that our proposed Decision Tree Regression (DTR) model is the most appropriate model to explore the potential inhibitors. We proposed five novel promising inhibitors with their respective Zinc IDs ZINC (3873365, 85432544, 8214470, 85536956, and 261494640) within the range of -19.7 kcal/mol to -12.6 kcal/mol. We further analyzed the physiochemical and pharmacokinetic properties of these most potent inhibitors to examine their behavior. The analysis of these properties is the key factor to promote an effective cure for public health. Our work constructs an efficient structure with which to probe the potential inhibitors against COVID-19, creating the combination of molecular docking with machine learning regression approaches.

[AI-128] Future-Proofing Medical Imaging with Privacy-Preserving Federated Learning and Uncertainty Quantification: A Review

链接: https://arxiv.org/abs/2409.16340
作者: Nikolas Koutsoubis,Asim Waqas,Yasin Yilmaz,Ravi P. Ramachandran,Matthew Schabath,Ghulam Rasool
关键词-EN: Artificial Intelligence, robust Artificial intelligence, Artificial intelligence models, medical imaging tasks, demonstrated significant potential
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 5 figures, 4 tables, Review paper, preprint to Radiology AI. arXiv admin note: text overlap with arXiv:2406.12815

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated significant potential in automating various medical imaging tasks, which could soon become routine in clinical practice for disease diagnosis, prognosis, treatment planning, and post-treatment surveillance. However, the privacy concerns surrounding patient data present a major barrier to the widespread adoption of AI in medical imaging, as large, diverse training datasets are essential for developing accurate, generalizable, and robust Artificial intelligence models. Federated Learning (FL) offers a solution that enables organizations to train AI models collaboratively without sharing sensitive data. federated learning exchanges model training information, such as gradients, between the participating sites. Despite its promise, federated learning is still in its developmental stages and faces several challenges. Notably, sensitive information can still be inferred from the gradients shared during model training. Quantifying AI models’ uncertainty is vital due to potential data distribution shifts post-deployment, which can affect model performance. Uncertainty quantification (UQ) in FL is particularly challenging due to data heterogeneity across participating sites. This review provides a comprehensive examination of FL, privacy-preserving FL (PPFL), and UQ in FL. We identify key gaps in current FL methodologies and propose future research directions to enhance data privacy and trustworthiness in medical imaging applications.

[AI-129] MRI Radiomics for IDH Genotype Prediction in Glioblastoma Diagnosis

链接: https://arxiv.org/abs/2409.16329
作者: Stanislav Kozák
关键词-EN: utilises automatically identified, automatically identified features, radiological scans, field which utilises, utilises automatically
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:Radiomics is a relatively new field which utilises automatically identified features from radiological scans. It has found a widespread application, particularly in oncology because many of the important oncological biomarkers are not visible to the naked eye. The recent advent of big data, including in medical imaging, and the development of new ML techniques brought the possibility of faster and more accurate oncological diagnosis. Furthermore, standardised mathematical feature extraction based on radiomics helps to eliminate possible radiologist bias. This paper reviews the recent development in the oncological use of MRI radiomic features. It focuses on the identification of the isocitrate dehydrogenase (IDH) mutation status, which is an important biomarker for the diagnosis of glioblastoma and grade IV astrocytoma.

[AI-130] owards Within-Class Variation in Alzheimers Disease Detection from Spontaneous Speech

链接: https://arxiv.org/abs/2409.16322
作者: Jiawen Kang,Dongrui Han,Lingwei Meng,Jingyan Zhou,Jinchao Li,Xixin Wu,Helen Meng
关键词-EN: Alzheimer Disease, promising research area, employs machine learning, machine learning classification, promising research
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease (AD) detection has emerged as a promising research area that employs machine learning classification models to distinguish between individuals with AD and those without. Unlike conventional classification tasks, we identify within-class variation as a critical challenge in AD detection: individuals with AD exhibit a spectrum of cognitive impairments. Given that many AD detection tasks lack fine-grained labels, simplistic binary classification may overlook two crucial aspects: within-class differences and instance-level imbalance. The former compels the model to map AD samples with varying degrees of impairment to a single diagnostic label, disregarding certain changes in cognitive function. While the latter biases the model towards overrepresented severity levels. This work presents early efforts to address these challenges. We propose two novel methods: Soft Target Distillation (SoTD) and Instance-level Re-balancing (InRe), targeting two problems respectively. Experiments on the ADReSS and ADReSSo datasets demonstrate that the proposed methods significantly improve detection accuracy. Further analysis reveals that SoTD effectively harnesses the strengths of multiple component models, while InRe substantially alleviates model over-fitting. These findings provide insights for developing more robust and reliable AD detection models.

[AI-131] Developing a Thailand solar irradiance map using Himawari-8 satellite imageries and deep learning models

链接: https://arxiv.org/abs/2409.16320
作者: Suwichaya Suwanwimolkul,Natanon Tongamrak,Nuttamon Thungka,Naebboon Hoonchareon,Jitkomut Songsiri
关键词-EN: Thailand solar irradiance, shows Thailand solar, GHI, presents an online, online platform
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 14 figures

点击查看摘要

Abstract:This paper presents an online platform that shows Thailand’s solar irradiance map every 30 minutes. It is available at this https URL. The methodology for estimating global horizontal irradiance (GHI) across Thailand relies on cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky model with locally-tuned Linke turbidity, and machine learning models. The methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature data from the MERRA-2 database, and date-time as inputs for GHI estimation models, including LightGBM, LSTM, Informer, and Transformer. These are benchmarked with the estimate from the SolCast service by evaluation of 15-minute ground GHI data from 53 ground stations over 1.5 years during 2022-2023. The results show that the four models have competitive performances and outperform the SolCast service. The best model is LightGBM, with an MAE of 78.58 W/sqm and RMSE of 118.97 W/sqm. Obtaining re-analyzed MERRA-2 data for Thailand is not economically feasible for deployment. When removing these features, the Informer model has a winning performance of 78.67 W/sqm in MAE. The obtained performance aligns with existing literature by taking the climate zone and time granularity of data into consideration. As the map shows an estimate of GHI over 93,000 grids with a frequent update, the paper also describes a computational framework for displaying the entire map. It tests the runtime performance of deep learning models in the GHI estimation process.

[AI-132] A Literature Review of Keyword Spotting Technologies for Urdu

链接: https://arxiv.org/abs/2409.16317
作者: Syed Muhammad Aqdas Rizvi
关键词-EN: Pakistan low-resource language, Pakistan low-resource, literature review surveys, keyword spotting, specifically focusing
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:This literature review surveys the advancements of keyword spotting (KWS) technologies, specifically focusing on Urdu, Pakistan’s low-resource language (LRL), which has complex phonetics. Despite the global strides in speech technology, Urdu presents unique challenges requiring more tailored solutions. The review traces the evolution from foundational Gaussian Mixture Models to sophisticated neural architectures like deep neural networks and transformers, highlighting significant milestones such as integrating multi-task learning and self-supervised approaches that leverage unlabeled data. It examines emerging technologies’ role in enhancing KWS systems’ performance within multilingual and resource-constrained settings, emphasizing the need for innovations that cater to languages like Urdu. Thus, this review underscores the need for context-specific research addressing the inherent complexities of Urdu and similar URLs and the means of regions communicating through such languages for a more inclusive approach to speech technology.

[AI-133] Surface solar radiation: AI satellite retrieval can outperform Heliosat and generalizes well to other climate zones

链接: https://arxiv.org/abs/2409.16316
作者: K. R. Schuurman,A. Meyer
关键词-EN: building control applications, SSI, SSI estimates, SSI retrieval model, solar resource assessments
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Accurate estimates of surface solar irradiance (SSI) are essential for solar resource assessments and solar energy forecasts in grid integration and building control applications. SSI estimates for spatially extended regions can be retrieved from geostationary satellites such as Meteosat. Traditional SSI satellite retrievals like Heliosat rely on physical radiative transfer modelling. We introduce the first machine-learning-based satellite retrieval for instantaneous SSI and demonstrate its capability to provide accurate and generalizable SSI estimates across Europe. Our deep learning retrieval provides near real-time SSI estimates based on data-driven emulation of Heliosat and fine-tuning on pyranometer networks. By including SSI from ground stations, our SSI retrieval model can outperform Heliosat accuracy and generalize well to regions with other climates and surface albedos in cloudy conditions (clear-sky index 0.8). We also show that the SSI retrieved from Heliosat exhibits large biases in mountain regions, and that training and fine-tuning our retrieval models on SSI data from ground stations strongly reduces these biases, outperforming Heliosat. Furthermore, we quantify the relative importance of the Meteosat channels and other predictor variables like solar zenith angle for the accuracy of our deep learning SSI retrieval model in different cloud conditions. We find that in cloudy conditions multiple near-infrared and infrared channels enhance the performance. Our results can facilitate the development of more accurate satellite retrieval models of surface solar irradiance.

[AI-134] SEE: Semantically Aligned EEG-to-Text Translation

链接: https://arxiv.org/abs/2409.16312
作者: Yitian Tao,Yan Liang,Luoyu Wang,Yongqing Li,Qing Yang,Han Zhang
关键词-EN: great research interest, Decoding neurophysiological signals, brain-computer interface, neurophysiological signals, great research
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 4 pages

点击查看摘要

Abstract:Decoding neurophysiological signals into language is of great research interest within brain-computer interface (BCI) applications. Electroencephalography (EEG), known for its non-invasiveness, ease of use, and cost-effectiveness, has been a popular method in this field. However, current EEG-to-Text decoding approaches face challenges due to the huge domain gap between EEG recordings and raw texts, inherent data bias, and small closed vocabularies. In this paper, we propose SEE: Semantically Aligned EEG-to-Text Translation, a novel method aimed at improving EEG-to-Text decoding by seamlessly integrating two modules into a pre-trained BART language model. These two modules include (1) a Cross-Modal Codebook that learns cross-modal representations to enhance feature consolidation and mitigate domain gap, and (2) a Semantic Matching Module that fully utilizes pre-trained text representations to align multi-modal features extracted from EEG-Text pairs while considering noise caused by false negatives, i.e., data from different EEG-Text pairs that have similar semantic meanings. Experimental results on the Zurich Cognitive Language Processing Corpus (ZuCo) demonstrate the effectiveness of SEE, which enhances the feasibility of accurate EEG-to-Text decoding.

计算机视觉

[CV-0] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

链接: https://arxiv.org/abs/2409.17146
作者: Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Jen Dumas,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi
关键词-EN: Today most advanced, advanced multimodal models, multimodal models remain, advanced multimodal, models remain proprietary
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today’s most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild QA and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.17146 [cs.CV] (or arXiv:2409.17146v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-1] DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion ALT

链接: https://arxiv.org/abs/2409.17145
作者: Yukun Huang,Jianan Wang,Ailing Zeng,Zheng-Jun Zha,Lei Zhang,Xihui Liu
关键词-EN: shown promising results, Leveraging pretrained, score distillation sampling, Skeleton-guided Score Distillation, Gaussian Avatar representation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.

[CV-2] Attention Prompting on Image for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.17143
作者: Runpeng Yu,Weihao Yu,Xinchao Wang
关键词-EN: Large Language Models, Large Vision-Language Models, Large Language, demonstrating impressive performance, Compared with Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Website, see this https URL

点击查看摘要

Abstract:Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs’ capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models’ ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.

[CV-3] PACE: marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization NEURIPS2024

链接: https://arxiv.org/abs/2409.17137
作者: Yao Ni,Shan Zhang,Piotr Koniusz
关键词-EN: effectively adapts pre-trained, adapts pre-trained vision, pre-trained vision transformers, effectively adapts, vision transformers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024 as a spotlight. This preliminary version will soon be extended with the experiments and analyses from the rebuttal

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained vision transformers to downstream tasks. However, the optimization for tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue, we theoretically connect smaller weight gradient norms during training and larger datasets to the improved model generalization. Motivated by this connection, we propose reducing gradient norms for enhanced generalization and aligning fine-tuned model with the pre-trained counterpart to retain knowledge from large-scale pre-training data. Yet, naive alignment does not guarantee gradient reduction and can potentially cause gradient explosion, complicating efforts to manage gradients. To address such issues, we propose PACE, marrying generalization of PArameter-efficient fine-tuning with Consistency rEgularization. We perturb features learned from the adapter with the multiplicative noise and ensure the fine-tuned model remains consistent for same sample under different perturbations. Theoretical analysis shows that PACE not only implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. Experimental evidence supports our theories. PACE outperforms existing PEFT methods in four visual adaptation tasks: VTAB-1k, FGVC, few-shot learning and domain adaptation. Code will be available at this https URL

[CV-4] Streaming Neural Images ICIP

链接: https://arxiv.org/abs/2409.17134
作者: Marcos V. Conde,Andy Bigos,Radu Timofte
关键词-EN: attracted considerable interest, Implicit Neural Representations, Neural Representations, signal representation, attracted considerable
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: IEEE International Conference on Image Processing (ICIP)2024

点击查看摘要

Abstract:Implicit Neural Representations (INRs) are a novel paradigm for signal representation that have attracted considerable interest for image compression. INRs offer unprecedented advantages in signal resolution and memory efficiency, enabling new possibilities for compression techniques. However, the existing limitations of INRs for image compression have not been sufficiently addressed in the literature. In this work, we explore the critical yet overlooked limiting factors of INRs, such as computational cost, unstable performance, and robustness. Through extensive experiments and empirical analysis, we provide a deeper and more nuanced understanding of implicit neural image compression methods such as Fourier Feature Networks and Siren. Our work also offers valuable insights for future research in this area.

[CV-5] Small data deep learning methodology for in-field disease detection

链接: https://arxiv.org/abs/2409.17119
作者: David Herrera-Poyato,Jacinto Domínguez-Rull,Rosana Montes,Inés Hernánde,Ignacio Barrio,Carlos Poblete-Echeverria,Javier Tardaguila,Francisco Herrera,Andrés Herrera-Poyatos
关键词-EN: prevent harvest losses, machine learning, final product, essential to prevent, prevent harvest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Early detection of diseases in crops is essential to prevent harvest losses and improve the quality of the final product. In this context, the combination of machine learning and proximity sensors is emerging as a technique capable of achieving this detection efficiently and effectively. For example, this machine learning approach has been applied to potato crops – to detect late blight (Phytophthora infestans) – and grapevine crops – to detect downy mildew. However, most of these AI models found in the specialised literature have been developed using leaf-by-leaf images taken in the lab, which does not represent field conditions and limits their applicability. In this study, we present the first machine learning model capable of detecting mild symptoms of late blight in potato crops through the analysis of high-resolution RGB images captured directly in the field, overcoming the limitations of other publications in the literature and presenting real-world applicability. Our proposal exploits the availability of high-resolution images via the concept of patching, and is based on deep convolutional neural networks with a focal loss function, which makes the model to focus on the complex patterns that arise in field conditions. Additionally, we present a data augmentation scheme that facilitates the training of these neural networks with few high-resolution images, which allows for development of models under the small data paradigm. Our model correctly detects all cases of late blight in the test dataset, demonstrating a high level of accuracy and effectiveness in identifying early symptoms. These promising results reinforce the potential use of machine learning for the early detection of diseases and pests in agriculture, enabling better treatment and reducing their impact on crops. Comments: 9 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.17119 [cs.CV] (or arXiv:2409.17119v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-6] MorphoSeg: An Uncertainty-Aware Deep Learning Method for Biomedical Segmentation of Complex Cellular Morphologies

链接: https://arxiv.org/abs/2409.17110
作者: Tianhao Zhang,Heather J. McCourty,Berardo M. Sanchez-Tafolla,Anton Nikolaev,Lyudmila S. Mihaylova
关键词-EN: revolutionized medical, biological imaging, segmentation, Dice Similarity Coefficient, Deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has revolutionized medical and biological imaging, particularly in segmentation tasks. However, segmenting biological cells remains challenging due to the high variability and complexity of cell shapes. Addressing this challenge requires high-quality datasets that accurately represent the diverse morphologies found in biological cells. Existing cell segmentation datasets are often limited by their focus on regular and uniform shapes. In this paper, we introduce a novel benchmark dataset of Ntera-2 (NT2) cells, a pluripotent carcinoma cell line, exhibiting diverse morphologies across multiple stages of differentiation, capturing the intricate and heterogeneous cellular structures that complicate segmentation tasks. To address these challenges, we propose an uncertainty-aware deep learning framework for complex cellular morphology segmentation (MorphoSeg) by incorporating sampling of virtual outliers from low-likelihood regions during training. Our comprehensive experimental evaluations against state-of-the-art baselines demonstrate that MorphoSeg significantly enhances segmentation accuracy, achieving up to a 7.74% increase in the Dice Similarity Coefficient (DSC) and a 28.36% reduction in the Hausdorff Distance. These findings highlight the effectiveness of our dataset and methodology in advancing cell segmentation capabilities, especially for complex and variable cell morphologies. The dataset and source code is publicly available at this https URL.

[CV-7] Unveiling Ontological Commitment in Multi-Modal Foundation Models ECAI2024

链接: https://arxiv.org/abs/2409.17109
作者: Mert Keser,Gesina Schwalbe,Niki Amini-Naieni,Matthias Rottmann,Alois Knoll
关键词-EN: corner stone, models, Ontological commitment, foundation models, concepts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Qualitative Reasoning Workshop 2024 (QR2024) colocated with ECAI2024, camera-ready submission; first two authors contributed equally; 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Ontological commitment, i.e., used concepts, relations, and assumptions, are a corner stone of qualitative reasoning (QR) models. The state-of-the-art for processing raw inputs, though, are deep neural networks (DNNs), nowadays often based off from multimodal foundation models. These automatically learn rich representations of concepts and respective reasoning. Unfortunately, the learned qualitative knowledge is opaque, preventing easy inspection, validation, or adaptation against available QR models. So far, it is possible to associate pre-defined concepts with latent representations of DNNs, but extractable relations are mostly limited to semantic similarity. As a next step towards QR for validation and verification of DNNs: Concretely, we propose a method that extracts the learned superclass hierarchy from a multimodal DNN for a given set of leaf concepts. Under the hood we (1) obtain leaf concept embeddings using the DNN’s textual input modality; (2) apply hierarchical clustering to them, using that DNNs encode semantic similarities via vector distances; and (3) label the such-obtained parent concepts using search in available ontologies from QR. An initial evaluation study shows that meaningful ontological class hierarchies can be extracted from state-of-the-art foundation models. Furthermore, we demonstrate how to validate and verify a DNN’s learned representations against given ontologies. Lastly, we discuss potential future applications in the context of QR.

[CV-8] xt2CAD: Generating Sequential CAD Models from Beginner-to-Expert Level Text Prompts NEURIPS2024

链接: https://arxiv.org/abs/2409.17106
作者: Mohammad Sadil Khan,Sankalp Sinha,Talha Uddin Sheikh,Didier Stricker,Sk Aziz Ali,Muhammad Zeshan Afzal
关键词-EN: Prototyping complex computer-aided, Prototyping complex, complex computer-aided design, complex computer-aided, modern softwares
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted in NeurIPS 2024 (Spotlight)

点击查看摘要

Abstract:Prototyping complex computer-aided design (CAD) models in modern softwares can be very time-consuming. This is due to the lack of intelligent systems that can quickly generate simpler intermediate parts. We propose Text2CAD, the first AI framework for generating text-to-parametric CAD models using designer-friendly instructions for all skill levels. Furthermore, we introduce a data annotation pipeline for generating text prompts based on natural language instructions for the DeepCAD dataset using Mistral and LLaVA-NeXT. The dataset contains \sim170 K models and \sim660 K text annotations, from abstract CAD descriptions (e.g., generate two concentric cylinders) to detailed specifications (e.g., draw two circles with center (x,y) and radius r_1 , r_2 , and extrude along the normal by d …). Within the Text2CAD framework, we propose an end-to-end transformer-based auto-regressive network to generate parametric CAD models from input texts. We evaluate the performance of our model through a mixture of metrics, including visual quality, parametric precision, and geometrical accuracy. Our proposed framework shows great potential in AI-aided design applications. Our source code and annotations will be publicly available.

[CV-9] General Detection-based Text Line Recognition

链接: https://arxiv.org/abs/2409.17095
作者: Raphael Baena,Syrine Kalleli,Mathieu Aubry
关键词-EN: general detection-based approach, introduce a general, text line recognition, OCR, general detection-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters. Detection-based approaches have until now been largely discarded for HTR because reading characters separately is often challenging, and character-level annotation is difficult and expensive. We overcome these challenges thanks to three main insights: (i) synthetic pre-training with sufficiently diverse data enables learning reasonable character localization for any script; (ii) modern transformer-based detectors can jointly detect a large number of instances, and, if trained with an adequate masking strategy, leverage consistency between the different detections; (iii) once a pre-trained detection model with approximate character localization is available, it is possible to fine-tune it with line-level annotation on real data, even with a different alphabet. Our approach, dubbed DTLR, builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding, predicting character values one by one, while we treat a complete line in parallel. Remarkably, we demonstrate good performance on a large range of scripts, usually tackled with specialized approaches. In particular, we improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets. Our code and models are available at this https URL.

[CV-10] BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

链接: https://arxiv.org/abs/2409.17093
作者: Yongqi Xu,Yujian Lee,Gao Yi,Bosheng Liu,Yucong Chen,Peng Liu,Jigang Wu,Xiaoming Chen,Yinhe Han
关键词-EN: Deep neural networks, Deep neural, object detection, neural networks, image classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are powerful for cognitive tasks such as image classification, object detection, and scene segmentation. One drawback however is the significant high computational complexity and memory consumption, which makes them unfeasible to run real-time on embedded platforms because of the limited hardware resources. Block floating point (BFP) quantization is one of the representative compression approaches for reducing the memory and computational burden owing to their capability to effectively capture the broad data distribution of DNN models. Unfortunately, prior works on BFP-based quantization empirically choose the block size and the precision that preserve accuracy. In this paper, we develop a BFP-based bitwidth-aware analytical modeling framework (called ``BitQ’') for the best BFP implementation of DNN inference on embedded platforms. We formulate and resolve an optimization problem to identify the optimal BFP block size and bitwidth distribution by the trade-off of both accuracy and performance loss. Experimental results show that compared with an equal bitwidth setting, the BFP DNNs with optimized bitwidth allocation provide efficient computation, preserving accuracy on famous benchmarks. The source code and data are available at this https URL.

[CV-11] Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification

链接: https://arxiv.org/abs/2409.17091
作者: Xinrui Zhou,Yuhao Huang,Haoran Dou,Shijing Chen,Ao Chang,Jia Liu,Weiran Long,Jian Zheng,Erjiao Xu,Jie Ren,Ruobing Huang,Jun Cheng,Wufeng Xue,Dong Ni
关键词-EN: labor-intensive annotation processes, annotation processes hinder, deep models, limited availability, availability of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, 7 tables

点击查看摘要

Abstract:In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.

[CV-12] Parameter-efficient Bayesian Neural Networks for Uncertainty-aware Depth Estimation ECCV’24

链接: https://arxiv.org/abs/2409.17085
作者: Richard D. Paul,Alessio Quercia,Vincent Fortuin,Katharina Nöh,Hanno Scharr
关键词-EN: monocular depth estimation, modern Transformer-based architectures, depth estimation, rely heavily, heavily on large
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Presented at UnCV Workshop at ECCV’24

点击查看摘要

Abstract:State-of-the-art computer vision tasks, like monocular depth estimation (MDE), rely heavily on large, modern Transformer-based architectures. However, their application in safety-critical domains demands reliable predictive performance and uncertainty quantification. While Bayesian neural networks provide a conceptually simple approach to serve those requirements, they suffer from the high dimensionality of the parameter space. Parameter-efficient fine-tuning (PEFT) methods, in particular low-rank adaptations (LoRA), have emerged as a popular strategy for adapting large-scale models to down-stream tasks by performing parameter inference on lower-dimensional subspaces. In this work, we investigate the suitability of PEFT methods for subspace Bayesian inference in large-scale Transformer-based vision models. We show that, indeed, combining BitFit, DiffFit, LoRA, and CoLoRA, a novel LoRA-inspired PEFT method, with Bayesian inference enables more robust and reliable predictive performance in MDE.

[CV-13] Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

链接: https://arxiv.org/abs/2409.17080
作者: Bowen Zhao,Leo Parker Dirac,Paulina Varshavskaya
关键词-EN: Large vision-language models, popular adaptation strategy, computer vision tasks, Large vision-language, vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 13 pages, 4 figures. Code released at this https URL

点击查看摘要

Abstract:Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.

[CV-14] he Effect of Perceptual Metrics on Music Representation Learning for Genre Classification

链接: https://arxiv.org/abs/2409.17069
作者: Tashi Namgyal,Alexander Hepburn,Raul Santos-Rodriguez,Valero Laparra,Jesus Malo
关键词-EN: objective perceptual metrics, perceptual metrics, subjective quality, approximated with objective, natural signals
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2312.03455

点击查看摘要

Abstract:The subjective quality of natural signals can be approximated with objective perceptual metrics. Designed to approximate the perceptual behaviour of human observers, perceptual metrics often reflect structures found in natural signals and neurological pathways. Models trained with perceptual metrics as loss functions can capture perceptually meaningful features from the structures held within these metrics. We demonstrate that using features extracted from autoencoders trained with perceptual losses can improve performance on music understanding tasks, i.e. genre classification, over using these metrics directly as distances when learning a classifier. This result suggests improved generalisation to novel signals when using perceptual metrics as loss functions for representation learning.

[CV-15] Benchmarking Domain Generalization Algorithms in Computational Pathology

链接: https://arxiv.org/abs/2409.17063
作者: Neda Zamanitajeddin,Mostafa Jahanifar,Kesi Xu,Fouzia Siraj,Nasir Rajpoot
关键词-EN: shown immense promise, unseen data due, computational pathology, Deep learning, Deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.

[CV-16] Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

链接: https://arxiv.org/abs/2409.17058
作者: Aiping Zhang,Zongsheng Yue,Renjing Pei,Wenqi Ren,Xiaochun Cao
关键词-EN: achieved remarkable success, leveraging large pre-trained, achieved remarkable, remarkable success, success by leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary information in solving the SR problem. In this work, we introduced a novel one-step SR model, which significantly addresses the efficiency issue of diffusion-based SR methods. Unlike existing fine-tuning strategies, we designed a degradation-guided Low-Rank Adaptation (LoRA) module specifically for SR, which corrects the model parameters based on the pre-estimated degradation information from low-resolution images. This module not only facilitates a powerful data-dependent or degradation-dependent SR model but also preserves the generative prior of the pre-trained diffusion model as much as possible. Furthermore, we tailor a novel training pipeline by introducing an online negative sample generation strategy. Combined with the classifier-free guidance strategy during inference, it largely improves the perceptual quality of the super-resolution results. Extensive experiments have demonstrated the superior efficiency and effectiveness of the proposed model compared to recent state-of-the-art methods.

[CV-17] ControlCity: A Multimodal Diffusion Model Based Approach for Accurate Geospatial Data Generation and Urban Morphology Analysis

链接: https://arxiv.org/abs/2409.17049
作者: Fangshuo Zhou,Huaxia Li,Rui Hu,Sensen Wu,Hailin Feng,Zhenhong Du,Liuchang Xu
关键词-EN: Volunteer Geographic Information, building footprint data, Volunteer Geographic, Geographic Information, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Volunteer Geographic Information (VGI), with its rich variety, large volume, rapid updates, and diverse sources, has become a critical source of geospatial data. However, VGI data from platforms like OSM exhibit significant quality heterogeneity across different data types, particularly with urban building data. To address this, we propose a multi-source geographic data transformation solution, utilizing accessible and complete VGI data to assist in generating urban building footprint data. We also employ a multimodal data generation framework to improve accuracy. First, we introduce a pipeline for constructing an ‘image-text-metadata-building footprint’ dataset, primarily based on road network data and supplemented by other multimodal data. We then present ControlCity, a geographic data transformation method based on a multimodal diffusion model. This method first uses a pre-trained text-to-image model to align text, metadata, and building footprint data. An improved ControlNet further integrates road network and land-use imagery, producing refined building footprint data. Experiments across 22 global cities demonstrate that ControlCity successfully simulates real urban building patterns, achieving state-of-the-art performance. Specifically, our method achieves an average FID score of 50.94, reducing error by 71.01% compared to leading methods, and a MIoU score of 0.36, an improvement of 38.46%. Additionally, our model excels in tasks like urban morphology transfer, zero-shot city generation, and spatial data completeness assessment. In the zero-shot city task, our method accurately predicts and generates similar urban structures, demonstrating strong generalization. This study confirms the effectiveness of our approach in generating urban building footprint data and capturing complex city characteristics.

[CV-18] GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design

链接: https://arxiv.org/abs/2409.17045
作者: Phillip Mueller,Sebastian Mueller,Lars Mikelsons
关键词-EN: enabling Deep Generative, Deep Generative Models, Deep Generative, enabling Deep, automate data labeling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We provide a dataset for enabling Deep Generative Models (DGMs) in engineering design and propose methods to automate data labeling by utilizing large-scale foundation models. GeoBiked is curated to contain 4 355 bicycle images, annotated with structural and technical features and is used to investigate two automated labeling techniques: The utilization of consolidated latent features (Hyperfeatures) from image-generation models to detect geometric correspondences (e.g. the position of the wheel center) in structural images and the generation of diverse text descriptions for structural images. GPT-4o, a vision-language-model (VLM), is instructed to analyze images and produce diverse descriptions aligned with the system-prompt. By representing technical images as Diffusion-Hyperfeatures, drawing geometric correspondences between them is possible. The detection accuracy of geometric points in unseen samples is improved by presenting multiple annotated source images. GPT-4o has sufficient capabilities to generate accurate descriptions of technical images. Grounding the generation only on images leads to diverse descriptions but causes hallucinations, while grounding it on categorical labels restricts the diversity. Using both as input balances creativity and accuracy. Successfully using Hyperfeatures for geometric correspondence suggests that this approach can be used for general point-detection and annotation tasks in technical images. Labeling such images with text descriptions using VLMs is possible, but dependent on the models detection capabilities, careful prompt-engineering and the selection of input information. Applying foundation models in engineering design is largely unexplored. We aim to bridge this gap with a dataset to explore training, finetuning and conditioning DGMs in this field and suggesting approaches to bootstrap foundation models to process technical images.

[CV-19] EventHDR: from Event to High-Speed HDR Videos and Beyond

链接: https://arxiv.org/abs/2409.17029
作者: Yunhao Zou,Ying Fu,Tsuyoshi Takatani,Yinqiang Zheng
关键词-EN: innovative neuromorphic sensors, high-speed HDR videos, HDR videos, high-speed HDR, innovative neuromorphic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: TPAMI 2024

点击查看摘要

Abstract:Event cameras are innovative neuromorphic sensors that asynchronously capture the scene dynamics. Due to the event-triggering mechanism, such cameras record event streams with much shorter response latency and higher intensity sensitivity compared to conventional cameras. On the basis of these features, previous works have attempted to reconstruct high dynamic range (HDR) videos from events, but have either suffered from unrealistic artifacts or failed to provide sufficiently high frame rates. In this paper, we present a recurrent convolutional neural network that reconstruct high-speed HDR videos from event sequences, with a key frame guidance to prevent potential error accumulation caused by the sparse event data. Additionally, to address the problem of severely limited real dataset, we develop a new optical system to collect a real-world dataset with paired high-speed HDR videos and event streams, facilitating future research in this field. Our dataset provides the first real paired dataset for event-to-HDR reconstruction, avoiding potential inaccuracies from simulation strategies. Experimental results demonstrate that our method can generate high-quality, high-speed HDR videos. We further explore the potential of our work in cross-camera reconstruction and downstream computer vision tasks, including object detection, panoramic segmentation, optical flow estimation, and monocular depth estimation under HDR scenarios.

[CV-20] Enhanced Wavelet Scattering Network for image inpainting detection

链接: https://arxiv.org/abs/2409.17023
作者: Barglazan Adrian-Alin,Brad Remus
关键词-EN: made digital image, Complex Wavelet Transform, image inpainting tools, manipulation alarmingly accessible, digital image manipulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid advancement of image inpainting tools, especially those aimed at removing artifacts, has made digital image manipulation alarmingly accessible. This paper proposes several innovative ideas for detecting inpainting forgeries based on low level noise analysis by combining Dual-Tree Complex Wavelet Transform (DT-CWT) for feature extraction with convolutional neural networks (CNN) for forged area detection and localization, and lastly by employing an innovative combination of texture segmentation with noise variance estimations. The DT-CWT offers significant advantages due to its shift-invariance, enhancing its robustness against subtle manipulations during the inpainting process. Furthermore, its directional selectivity allows for the detection of subtle artifacts introduced by inpainting within specific frequency bands and orientations. Various neural network architectures were evaluated and proposed. Lastly, we propose a fusion detection module that combines texture analysis with noise variance estimation to give the forged area. Our approach was benchmarked against state-of-the-art methods and demonstrated superior performance over all cited alternatives. The training code (with pretrained model weights) as long as the dataset will be available at this https URL

[CV-21] PTQ4RIS: Post-Training Quantization for Referring Image Segmentation

链接: https://arxiv.org/abs/2409.17020
作者: Xiaoyan Jiang,Hang Yang,Kaiying Zhu,Xihe Qiu,Shibo Zhao,Sifan Zhou
关键词-EN: Referring Image Segmentation, Referring Image, Image Segmentation, aims to segment, linguistic information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring Image Segmentation (RIS), aims to segment the object referred by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods tend to explore top-performance models, disregarding considerations for practical applications on resources-limited edge devices. This oversight poses a significant challenge for on-device RIS inference. To this end, we propose an effective and efficient post-training quantization framework termed PTQ4RIS. Specifically, we first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization and propose dual-region quantization (DRQ) and reorder-based outlier-retained quantization (RORQ) to address the quantization difficulties in visual and text encoders. Extensive experiments on three benchmarks with different bits settings (from 8 to 4 bits) demonstrates its superior performance. Importantly, we are the first PTQ method specifically designed for the RIS task, highlighting the feasibility of PTQ in RIS applications. Code will be available at this https URL.

[CV-22] CNN Mixture-of-Depths ACCV

链接: https://arxiv.org/abs/2409.17016
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, selectively processing channels, processing channels based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Conference Paper of the Asian Conference on Computer Vision (ACCV) 2024

点击查看摘要

Abstract:We introduce Mixture-of-Depths (MoD) for Convolutional Neural Networks (CNNs), a novel approach that enhances the computational efficiency of CNNs by selectively processing channels based on their relevance to the current prediction. This method optimizes computational resources by dynamically selecting key channels in feature maps for focused processing within the convolutional blocks (Conv-Blocks), while skipping less relevant channels. Unlike conditional computation methods that require dynamic computation graphs, CNN MoD uses a static computation graph with fixed tensor sizes which improve hardware efficiency. It speeds up the training and inference processes without the need for customized CUDA kernels, unique loss functions, or finetuning. CNN MoD either matches the performance of traditional CNNs with reduced inference times, GMACs, and parameters, or exceeds their performance while maintaining similar inference times, GMACs, and parameters. For example, on ImageNet, ResNet86-MoD exceeds the performance of the standard ResNet50 by 0.45% with a 6% speedup on CPU and 5% on GPU. Moreover, ResNet75-MoD achieves the same performance as ResNet50 with a 25% speedup on CPU and 15% on GPU.

[CV-23] Adverse Weather Optical Flow: Cumulative Homogeneous-Heterogeneous Adaptation

链接: https://arxiv.org/abs/2409.17001
作者: Hanyu Zhou,Yi Chang,Zhiwei Shi,Wending Yan,Gang Chen,Yonghong Tian,Luxin Yan
关键词-EN: made great progress, gradient continuity assumptions, real degraded domains, Optical flow, real adverse weather
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Optical flow has made great progress in clean scenes, while suffers degradation under adverse weather due to the violation of the brightness constancy and gradient continuity assumptions of optical flow. Typically, existing methods mainly adopt domain adaptation to transfer motion knowledge from clean to degraded domain through one-stage adaptation. However, this direct adaptation is ineffective, since there exists a large gap due to adverse weather and scene style between clean and real degraded domains. Moreover, even within the degraded domain itself, static weather (e.g., fog) and dynamic weather (e.g., rain) have different impacts on optical flow. To address above issues, we explore synthetic degraded domain as an intermediate bridge between clean and real degraded domains, and propose a cumulative homogeneous-heterogeneous adaptation framework for real adverse weather optical flow. Specifically, for clean-degraded transfer, our key insight is that static weather possesses the depth-association homogeneous feature which does not change the intrinsic motion of the scene, while dynamic weather additionally introduces the heterogeneous feature which results in a significant boundary discrepancy in warp errors between clean and degraded domains. For synthetic-real transfer, we figure out that cost volume correlation shares a similar statistical histogram between synthetic and real degraded domains, benefiting to holistically aligning the homogeneous correlation distribution for synthetic-real knowledge distillation. Under this unified framework, the proposed method can progressively and explicitly transfer knowledge from clean scenes to real adverse weather. In addition, we further collect a real adverse weather dataset with manually annotated optical flow labels and perform extensive experiments to verify the superiority of the proposed method.

[CV-24] WasteGAN: Data Augmentation for Robotic Waste Sorting through Generative Adversarial Networks IROS2024

链接: https://arxiv.org/abs/2409.16999
作者: Alberto Bacchin,Leonardo Barcellona,Matteo Terreran,Stefano Ghidoni,Emanuele Menegatti,Takuya Kiyokawa
关键词-EN: cluttered conveyor belt, poses significant challenges, Robotic waste sorting, perception and manipulation, conveyor belt
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Robotic waste sorting poses significant challenges in both perception and manipulation, given the extreme variability of objects that should be recognized on a cluttered conveyor belt. While deep learning has proven effective in solving complex tasks, the necessity for extensive data collection and labeling limits its applicability in real-world scenarios like waste sorting. To tackle this issue, we introduce a data augmentation method based on a novel GAN architecture called wasteGAN. The proposed method allows to increase the performance of semantic segmentation models, starting from a very limited bunch of labeled examples, such as few as 100. The key innovations of wasteGAN include a novel loss function, a novel activation function, and a larger generator block. Overall, such innovations helps the network to learn from limited number of examples and synthesize data that better mirrors real-world distributions. We then leverage the higher-quality segmentation masks predicted from models trained on the wasteGAN synthetic data to compute semantic-aware grasp poses, enabling a robotic arm to effectively recognizing contaminants and separating waste in a real-world scenario. Through comprehensive evaluation encompassing dataset-based assessments and real-world experiments, our methodology demonstrated promising potential for robotic waste sorting, yielding performance gains of up to 5.8% in picking contaminants. The project page is available at this https URL

[CV-25] Single Image Any Face: Generalisable 3D Face Generation

链接: https://arxiv.org/abs/2409.16990
作者: Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu
关键词-EN: underlies numerous real-world, numerous real-world vision, graphics applications, fundamental task, task that underlies
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The creation of 3D human face avatars from a single unconstrained image is a fundamental task that underlies numerous real-world vision and graphics applications. Despite the significant progress made in generative models, existing methods are either less suited in design for human faces or fail to generalise from the restrictive training domain to unconstrained facial images. To address these limitations, we propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input within a multi-view consistent diffusion framework. Given a specific input image, our model first produces multi-view images, followed by neural surface construction. To incorporate face geometry information in a generalisable manner, we utilise input-conditioned mesh estimation instead of ground-truth mesh along with synthetic multi-view training data. Importantly, we introduce a multi-view joint generation scheme to enhance appearance consistency among different views. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images for generic human subject across domains. Extensive experiments demonstrate the superiority of our method over previous alternatives for out-of-domain singe image 3D face generation and top competition for in-domain setting.

[CV-26] Multi-Robot Informative Path Planning for Efficient Target Mapping using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.16967
作者: Apoorva Vashisth,Dipam Patel,Damon Conover,Aniket Bera
关键词-EN: low labor costs, collection tasks due, data collection tasks, Autonomous robots, labor costs
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2402.04894

点击查看摘要

Abstract:Autonomous robots are being employed in several mapping and data collection tasks due to their efficiency and low labor costs. In these tasks, the robots are required to map targets-of-interest in an unknown environment while constrained to a given resource budget such as path length or mission time. This is a challenging problem as each robot has to not only detect and avoid collisions from static obstacles in the environment but also has to model other robots’ trajectories to avoid inter-robot collisions. We propose a novel deep reinforcement learning approach for multi-robot informative path planning to map targets-of-interest in an unknown 3D environment. A key aspect of our approach is an augmented graph that models other robots’ trajectories to enable planning for communication and inter-robot collision avoidance. We train our decentralized reinforcement learning policy via the centralized training and decentralized execution paradigm. Once trained, our policy is also scalable to varying number of robots and does not require re-training. Our approach outperforms other state-of-the-art multi-robot target mapping approaches by 33.75% in terms of the number of discovered targets-of-interest. We open-source our code and model at: this https URL

[CV-27] Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration

链接: https://arxiv.org/abs/2409.16953
作者: Jiazhou Zhou,Kanghao Chen,Lei Zhang,Lin Wang
关键词-EN: output event streams, high temporal resolution, distinct advantages, bio-inspired sensors, sensors that capture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: First version

点击查看摘要

Abstract:Event cameras are bio-inspired sensors that capture the intensity changes asynchronously and output event streams with distinct advantages, such as high temporal resolution. To exploit event cameras for object/action recognition, existing methods predominantly sample and aggregate events in a second-level duration at every fixed temporal interval (or frequency). However, they often face difficulties in capturing the spatiotemporal relationships for longer, e.g., minute-level, events and generalizing across varying temporal frequencies. To fill the gap, we present a novel framework, dubbed PAST-SSM, exhibiting superior capacity in recognizing events with arbitrary duration (e.g., 0.1s to 4.5s) and generalizing to varying inference frequencies. Our key insight is to learn the spatiotemporal relationships from the encoded event features via the state space model (SSM) – whose linear complexity makes it ideal for modeling high temporal resolution events with longer sequences. To achieve this goal, we first propose a Path-Adaptive Event Aggregation and Scan (PEAS) module to encode events of varying duration into features with fixed dimensions by adaptively scanning and selecting aggregated event frames. On top of PEAS, we introduce a novel Multi-faceted Selection Guiding (MSG) loss to minimize the randomness and redundancy of the encoded features. This subtly enhances the model generalization across different inference frequencies. Lastly, the SSM is employed to better learn the spatiotemporal properties from the encoded features. Moreover, we build a minute-level event-based recognition dataset, named ArDVS100, with arbitrary duration for the benefit of the community. Extensive experiments prove that our method outperforms prior arts by +3.45%, +0.38% and +8.31% on the DVS Action, SeAct and HARDVS datasets, respectively.

[CV-28] DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling ECCV

链接: https://arxiv.org/abs/2409.16949
作者: Kyuheon Jung,Yongdeuk Seo,Seongwoo Cho,Jaeyoung Kim,Hyun-seok Min,Sungchul Choi
关键词-EN: Large Language Model, Language Model, Diffusion Model, Large Language, effective data augmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV Synthetic Data for Computer Vision Workshop (Oral)

点击查看摘要

Abstract:In this paper, we present an effective data augmentation framework leveraging the Large Language Model (LLM) and Diffusion Model (DM) to tackle the challenges inherent in data-scarce scenarios. Recently, DMs have opened up the possibility of generating synthetic images to complement a few training images. However, increasing the diversity of synthetic images also raises the risk of generating samples outside the target distribution. Our approach addresses this issue by embedding novel semantic information into text prompts via LLM and utilizing real images as visual prompts, thus generating semantically rich images. To ensure that the generated images remain within the target distribution, we dynamically adjust the guidance weight based on each image’s CLIPScore to control the diversity. Experimental results show that our method produces synthetic images with enhanced diversity while maintaining adherence to the target distribution. Consequently, our approach proves to be more efficient in the few-shot setting on several benchmarks. Our code is available at this https URL .

[CV-29] NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

链接: https://arxiv.org/abs/2409.16947
作者: Longguang Wang,Yulan Guo,Juncheng Li,Hongda Liu,Yang Zhao,Yingqian Wang,Zhi Jin,Shuhang Gu,Radu Timofte
关键词-EN: NTIRE challenge, stereo image super-resolution, paper summarizes, stereo image, NTIRE
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper summarizes the 3rd NTIRE challenge on stereo image super-resolution (SR) with a focus on new solutions and results. The task of this challenge is to super-resolve a low-resolution stereo image pair to a high-resolution one with a magnification factor of x4 under a limited computational budget. Compared with single image SR, the major challenge of this challenge lies in how to exploit additional information in another viewpoint and how to maintain stereo consistency in the results. This challenge has 2 tracks, including one track on bicubic degradation and one track on real degradations. In total, 108 and 70 participants were successfully registered for each track, respectively. In the test phase, 14 and 13 teams successfully submitted valid results with PSNR (RGB) scores better than the baseline. This challenge establishes a new benchmark for stereo image SR.

[CV-30] Face Forgery Detection with Elaborate Backbone

链接: https://arxiv.org/abs/2409.16945
作者: Zonghui Guo,Yingjie Liu,Jie Zhang,Haiyong Zheng,Shiguang Shan
关键词-EN: FFD, Deepfake detection, Face Forgery Detection, FFD models, aims to determine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone’s ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at this https URL.

[CV-31] Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

链接: https://arxiv.org/abs/2409.16944
作者: Phu Pham,Dipam Patel,Damon Conover,Aniket Bera
关键词-EN: Gaussian Splatting SLAM, Splatting SLAM, embedding object-level information, Gaussian Splatting, SLAM to reconstruct
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce Go-SLAM, a novel framework that utilizes 3D Gaussian Splatting SLAM to reconstruct dynamic environments while embedding object-level information within the scene representations. This framework employs advanced object segmentation techniques, assigning a unique identifier to each Gaussian splat that corresponds to the object it represents. Consequently, our system facilitates open-vocabulary querying, allowing users to locate objects using natural language descriptions. Furthermore, the framework features an optimal path generation module that calculates efficient navigation paths for robots toward queried objects, considering obstacles and environmental uncertainties. Comprehensive evaluations in various scene settings demonstrate the effectiveness of our approach in delivering high-fidelity scene reconstructions, precise object segmentation, flexible object querying, and efficient robot path planning. This work represents an additional step forward in bridging the gap between 3D scene reconstruction, semantic object understanding, and real-time environment interactions.

[CV-32] Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model

链接: https://arxiv.org/abs/2409.16938
作者: Hongliang Zhong,Can Wang,Jingbo Zhang,Jing Liao
关键词-EN: versatile scene recreation, achieving versatile scene, scene recreation, achieving versatile, versatile scene
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Generating and inserting new objects into 3D content is a compelling approach for achieving versatile scene recreation. Existing methods, which rely on SDS optimization or single-view inpainting, often struggle to produce high-quality results. To address this, we propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model to facilitate view-consistent object inpainting. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation. After generating the multi-view inpainted results, we further propose a mask-aware 3D reconstruction technique to refine Gaussian Splatting reconstruction from these sparse inpainted views. By leveraging these fabricate techniques, our approach yields diverse results, ensures view-consistent and harmonious insertions, and produces better object quality. Extensive experiments demonstrate that our approach outperforms existing methods.

[CV-33] Game4Loc: A UAV Geo-Localization Benchmark from Game Data

链接: https://arxiv.org/abs/2409.16925
作者: Yuxiang Ji,Boyong He,Zhuoyue Tan,Liaoni Wu
关键词-EN: navigation satellite systems, source of GPS, global navigation satellite, vision-based geo-localization technology, GPS information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:The vision-based geo-localization technology for UAV, serving as a secondary source of GPS information in addition to the global navigation satellite systems (GNSS), can still operate independently in the GPS-denied environment. Recent deep learning based methods attribute this as the task of image matching and retrieval. By retrieving drone-view images in geo-tagged satellite image database, approximate localization information can be obtained. However, due to high costs and privacy concerns, it is usually difficult to obtain large quantities of drone-view images from a continuous area. Existing drone-view datasets are mostly composed of small-scale aerial photography with a strong assumption that there exists a perfect one-to-one aligned reference image for any query, leaving a significant gap from the practical localization scenario. In this work, we construct a large-range contiguous area UAV geo-localization dataset named GTA-UAV, featuring multiple flight altitudes, attitudes, scenes, and targets using modern computer games. Based on this dataset, we introduce a more practical UAV geo-localization task including partial matches of cross-view paired data, and expand the image-level retrieval to the actual localization in terms of distance (meters). For the construction of drone-view and satellite-view pairs, we adopt a weight-based contrastive learning approach, which allows for effective learning while avoiding additional post-processing matching steps. Experiments demonstrate the effectiveness of our data and training method for UAV geo-localization, as well as the generalization capabilities to real-world scenarios.

[CV-34] An Adaptive Screen-Space Meshing Approach for Normal Integration

链接: https://arxiv.org/abs/2409.16907
作者: Moritz Heep,Eduard Zell
关键词-EN: Reconstructing surfaces, photometric stereo, component of photometric, Reconstructing, key component
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing surfaces from normals is a key component of photometric stereo. This work introduces an adaptive surface triangulation in the image domain and afterwards performs the normal integration on a triangle mesh. Our key insight is that surface curvature can be computed from normals. Based on the curvature, we identify flat areas and aggregate pixels into triangles. The approximation quality is controlled by a single user parameter facilitating a seamless generation of low- to high-resolution meshes. Compared to pixel grids, our triangle meshes adapt locally to surface details and allow for a sparser representation. Our new mesh-based formulation of the normal integration problem is strictly derived from discrete differential geometry and leads to well-conditioned linear systems. Results on real and synthetic data show that 10 to 100 times less vertices are required than pixels. Experiments suggest that this sparsity translates into a sublinear runtime in the number of pixels. For 64 MP normal maps, our meshing-first approach generates and integrates meshes in minutes while pixel-based approaches require hours just for the integration.

[CV-35] owards Underwater Camouflaged Object Tracking: An Experimental Evaluation of SAM and SAM 2

链接: https://arxiv.org/abs/2409.16902
作者: Chunhui Zhang,Li Liu,Guanjie Huang,Hao Wen,Xi Zhou,Yanfeng Wang
关键词-EN: visual object tracking, object tracking, object tracking methods, object tracking dataset, visual object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Preprint. Work in Progress

点击查看摘要

Abstract:Over the past decade, significant progress has been made in visual object tracking, largely due to the availability of large-scale training datasets. However, existing tracking datasets are primarily focused on open-air scenarios, which greatly limits the development of object tracking in underwater environments. To address this issue, we take a step forward by proposing the first large-scale underwater camouflaged object tracking dataset, namely UW-COT. Based on the proposed dataset, this paper presents an experimental evaluation of several advanced visual object tracking methods and the latest advancements in image and video segmentation. Specifically, we compare the performance of the Segment Anything Model (SAM) and its updated version, SAM 2, in challenging underwater environments. Our findings highlight the improvements in SAM 2 over SAM, demonstrating its enhanced capability to handle the complexities of underwater camouflaged objects. Compared to current advanced visual object tracking methods, the latest video segmentation foundation model SAM 2 also exhibits significant advantages, providing valuable insights into the development of more effective tracking technologies for underwater scenarios. The dataset will be accessible at \colormagentathis https URL.

[CV-36] HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space

链接: https://arxiv.org/abs/2409.16897
作者: Jacob Fein-Ashley,Ethan Feng,Minh Pham
关键词-EN: Vision Transformer, Hyperbolic Vision Transformer, representation in non-Euclidean, complex relationships, relationships in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.

[CV-37] Linking in Style: Understanding learned features in deep learning models

链接: https://arxiv.org/abs/2409.16865
作者: Maren H. Wehrheim,Pamela Osuna-Vargas,Matthias Kaschube
关键词-EN: Convolutional neural networks, perform object classification, high computational costs, remains challenging due, features remains challenging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) learn abstract features to perform object classification, but understanding these features remains challenging due to difficult-to-interpret results or high computational costs. We propose an automatic method to visualize and systematically analyze learned features in CNNs. Specifically, we introduce a linking network that maps the penultimate layer of a pre-trained classifier to the latent space of a generative model (StyleGAN-XL), thereby enabling an interpretable, human-friendly visualization of the classifier’s representations. Our findings indicate a congruent semantic order in both spaces, enabling a direct linear mapping between them. Training the linking network is computationally inexpensive and decoupled from training both the GAN and the classifier. We introduce an automatic pipeline that utilizes such GAN-based visualizations to quantify learned representations by analyzing activation changes in the classifier in the image domain. This quantification allows us to systematically study the learned representations in several thousand units simultaneously and to extract and visualize units selective for specific semantic concepts. Further, we illustrate how our method can be used to quantify and interpret the classifier’s decision boundary using counterfactual examples. Overall, our method offers systematic and objective perspectives on learned abstract representations in CNNs. this https URL

[CV-38] owards Unified 3D Hair Reconstruction from Single-View Portraits SIGGRAPH

链接: https://arxiv.org/abs/2409.16863
作者: Yujian Zheng,Yuda Qiu,Leyang Jin,Chongyang Ma,Haibin Huang,Di Zhang,Pengfei Wan,Xiaoguang Han
关键词-EN: wide range, range of shape, shape variations, hair, complex hairstyles
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SIGGRAPH Asia 2024, project page: this https URL

点击查看摘要

Abstract:Single-view 3D hair reconstruction is challenging, due to the wide range of shape variations among diverse hairstyles. Current state-of-the-art methods are specialized in recovering un-braided 3D hairs and often take braided styles as their failure cases, because of the inherent difficulty to define priors for complex hairstyles, whether rule-based or data-based. We propose a novel strategy to enable single-view 3D reconstruction for a variety of hair types via a unified pipeline. To achieve this, we first collect a large-scale synthetic multi-view hair dataset SynMvHair with diverse 3D hair in both braided and un-braided styles, and learn two diffusion priors specialized on hair. Then we optimize 3D Gaussian-based hair from the priors with two specially designed modules, i.e. view-wise and pixel-wise Gaussian refinement. Our experiments demonstrate that reconstructing braided and un-braided 3D hair from single-view images via a unified approach is possible and our method achieves the state-of-the-art performance in recovering complex hairstyles. It is worth to mention that our method shows good generalization ability to real images, although it learns hair priors from synthetic data.

[CV-39] Limitations of (Procrustes) Alignment in Assessing Multi-Person Human Pose and Shape Estimation

链接: https://arxiv.org/abs/2409.16861
作者: Drazic Martin,Pierre Perrault
关键词-EN: video surveillance scenarios, accurately estimating, human pose, surveillance scenarios, challenges of accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We delve into the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios. Beginning with the advocacy for metrics like W-MPJPE and W-PVE, which omit the (Procrustes) realignment step, to improve model evaluation, we then introduce RotAvat. This technique aims to enhance these metrics by refining the alignment of 3D meshes with the ground plane. Through qualitative comparisons, we demonstrate RotAvat’s effectiveness in addressing the limitations of existing aproaches.

[CV-40] he Role of Language Models in Modern Healthcare: A Comprehensive Review

链接: https://arxiv.org/abs/2409.16860
作者: Amna Khalid,Ayma Khalid,Umar Khalid
关键词-EN: gained significant attention, significant attention due, process complex medical, large language models, clinical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The application of large language models (LLMs) in healthcare has gained significant attention due to their ability to process complex medical data and provide insights for clinical decision-making. These models have demonstrated substantial capabilities in understanding and generating natural language, which is crucial for medical documentation, diagnostics, and patient interaction. This review examines the trajectory of language models from their early stages to the current state-of-the-art LLMs, highlighting their strengths in healthcare applications and discussing challenges such as data privacy, bias, and ethical considerations. The potential of LLMs to enhance healthcare delivery is explored, alongside the necessary steps to ensure their ethical and effective integration into medical practice.

[CV-41] A Versatile and Differentiable Hand-Object Interaction Representation

链接: https://arxiv.org/abs/2409.16855
作者: Théo Morales,Omid Taheri,Gerard Lacey
关键词-EN: Synthesizing accurate hands-object, Augmented Reality, Mixed Reality, Computer Vision, Synthesizing accurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the Winter Applications in Computer Vision 2025 conference. 9 pages, 6 figures

点击查看摘要

Abstract:Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion’s improvements over the SOTA in both applications: it increases the contact F1 score by 5% for refinement and decreases the sim. displacement by 46% for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks. Our models and code will be publicly available to the research community.

[CV-42] Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms

链接: https://arxiv.org/abs/2409.16850
作者: Chun-Jung Lin,Sourav Garg,Tat-Jun Chin,Feras Dayoub
关键词-EN: visual foundational model, address key challenges, robust feature extraction, integrates full-image cross-attention, scene change detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

Abstract:We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) freeze'' the backbone in order to retain the generality of dense foundation features, and b) employ full-image’’ cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method’s superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Source code will be made publicly available upon acceptance.

[CV-43] IRASNet: Improved Feature-Level Clutter Reduction for Domain Generalized SAR-ATR

链接: https://arxiv.org/abs/2409.16845
作者: Oh-Tae Jang,Hae-Kang Song,Min-Jun Kim,Kyung-Hwan Lee,Geon Lee,Sung-Ho Kim,Kyung-Tae Kim
关键词-EN: computer-aided design models, augment synthetic aperture, computer-aided design, electromagnetic simulations, Recently
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Recently, computer-aided design models and electromagnetic simulations have been used to augment synthetic aperture radar (SAR) data for deep learning. However, an automatic target recognition (ATR) model struggles with domain shift when using synthetic data because the model learns specific clutter patterns present in such data, which disturbs performance when applied to measured data with different clutter distributions. This study proposes a framework particularly designed for domain-generalized SAR-ATR called IRASNet, enabling effective feature-level clutter reduction and domain-invariant feature learning. First, we propose a clutter reduction module (CRM) that maximizes the signal-to-clutter ratio on feature maps. The module reduces the impact of clutter at the feature level while preserving target and shadow information, thereby improving ATR performance. Second, we integrate adversarial learning with CRM to extract clutter-reduced domain-invariant features. The integration bridges the gap between synthetic and measured datasets without requiring measured data during training. Third, we improve feature extraction from target and shadow regions by implementing a positional supervision task using mask ground truth encoding. The improvement enhances the ability of the model to discriminate between classes. Our proposed IRASNet presents new state-of-the-art public SAR datasets utilizing target and shadow information to achieve superior performance across various test conditions. IRASNet not only enhances generalization performance but also significantly improves feature-level clutter reduction, making it a valuable advancement in the field of radar image pattern recognition.

[CV-44] Explicitly Modeling Pre-Cortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

链接: https://arxiv.org/abs/2409.16838
作者: Lucas Piper,Arlindo L. Oliveira,Tiago Marques
关键词-EN: convolutional neural networks, classify images corrupted, neural networks, limiting their real-world, real-world applicability
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:While convolutional neural networks (CNNs) excel at clean image classification, they struggle to classify images corrupted with different common corruptions, limiting their real-world applicability. Recent work has shown that incorporating a CNN front-end block that simulates some features of the primate primary visual cortex (V1) can improve overall model robustness. Here, we expand on this approach by introducing two novel biologically-inspired CNN model families that incorporate a new front-end block designed to simulate pre-cortical visual processing. RetinaNet, a hybrid architecture containing the novel front-end followed by a standard CNN back-end, shows a relative robustness improvement of 12.3% when compared to the standard model; and EVNet, which further adds a V1 block after the pre-cortical front-end, shows a relative gain of 18.5%. The improvement in robustness was observed for all the different corruption categories, though accompanied by a small decrease in clean image accuracy, and generalized to a different back-end architecture. These findings show that simulating multiple stages of early visual processing in CNN early layers provides cumulative benefits for model robustness.

[CV-45] Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

链接: https://arxiv.org/abs/2409.16827
作者: Xu Han,Junyu Gao,Chuang Yang,Yuan Yuan,Qi Wang
关键词-EN: formidable challenge, diversity of scene, efficiently detecting text, model, Due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM’s ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.

[CV-46] XAI-guided Insulator Anomaly Detection for Imbalanced Datasets ECCV2024

链接: https://arxiv.org/abs/2409.16821
作者: Maximilian Andreas Hoefler,Karsten Mueller,Wojciech Samek
关键词-EN: Power grids serve, seamlessly delivering electrical, reliable operation indispensable, delivering electrical energy, Power grids
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted as a workshop paper at ECCV 2024

点击查看摘要

Abstract:Power grids serve as a vital component in numerous industries, seamlessly delivering electrical energy to industrial processes and technologies, making their safe and reliable operation indispensable. However, powerlines can be hard to inspect due to difficult terrain or harsh climatic conditions. Therefore, unmanned aerial vehicles are increasingly deployed to inspect powerlines, resulting in a substantial stream of visual data which requires swift and accurate processing. Deep learning methods have become widely popular for this task, proving to be a valuable asset in fault detection. In particular, the detection of insulator defects is crucial for predicting powerline failures, since their malfunction can lead to transmission disruptions. It is therefore of great interest to continuously maintain and rigorously inspect insulator components. In this work we propose a novel pipeline to tackle this task. We utilize state-of-the-art object detection to detect and subsequently classify individual insulator anomalies. Our approach addresses dataset challenges such as imbalance and motion-blurred images through a fine-tuning methodology which allows us to alter the classification focus of the model by increasing the classification accuracy of anomalous insulators. In addition, we employ explainable-AI tools for precise localization and explanation of anomalies. This proposed method contributes to the field of anomaly detection, particularly vision-based industrial inspection and predictive maintenance. We significantly improve defect detection accuracy by up to 13%, while also offering a detailed analysis of model mis-classifications and localization quality, showcasing the potential of our method on real-world data.

[CV-47] Spotlight Text Detector: Spotlight on Candidate Regions Like a Camera

链接: https://arxiv.org/abs/2409.16820
作者: Xu Han,Junyu Gao,Chuang Yang,Yuan Yuan,Qi Wang
关键词-EN: irregular contour representation, irregular contour, contour representation, tough challenges, scene text detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model’s ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.

[CV-48] Inline Photometrically Calibrated Hybrid Visual SLAM

链接: https://arxiv.org/abs/2409.16810
作者: Nicolas Abboud,Malak Sayour,Imad H. Elhajj,John Zelek,Daniel Asmar
关键词-EN: direct-indirect visual SLAM, sequential photometric calibration, merging online sequential, Hybrid direct-indirect visual, Visual SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents an integrated approach to Visual SLAM, merging online sequential photometric calibration within a Hybrid direct-indirect visual SLAM (H-SLAM). Photometric calibration helps normalize pixel intensity values under different lighting conditions, and thereby improves the direct component of our H-SLAM. A tangential benefit also results to the indirect component of H-SLAM given that the detected features are more stable across variable lighting conditions. Our proposed photometrically calibrated H-SLAM is tested on several datasets, including the TUM monoVO as well as on a dataset we created. Calibrated H-SLAM outperforms other state of the art direct, indirect, and hybrid Visual SLAM systems in all the experiments. Furthermore, in online SLAM tested at our site, it also significantly outperformed the other SLAM Systems.

[CV-49] Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices

链接: https://arxiv.org/abs/2409.16808
作者: Daghash K. Alqahtani,Aamir Cheema,Adel N. Toosi
关键词-EN: Jetson Orin Nano, Modern applications, resource-constrained edge devices, object detection models, autonomous vehicles
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Modern applications, such as autonomous vehicles, require deploying deep learning algorithms on resource-constrained edge devices for real-time image and video processing. However, there is limited understanding of the efficiency and performance of various object detection models on these devices. In this paper, we evaluate state-of-the-art object detection models, including YOLOv8 (Nano, Small, Medium), EfficientDet Lite (Lite0, Lite1, Lite2), and SSD (SSD MobileNet V1, SSDLite MobileDet). We deployed these models on popular edge devices like the Raspberry Pi 3, 4, and 5 with/without TPU accelerators, and Jetson Orin Nano, collecting key performance metrics such as energy consumption, inference time, and Mean Average Precision (mAP). Our findings highlight that lower mAP models such as SSD MobileNet V1 are more energy-efficient and faster in inference, whereas higher mAP models like YOLOv8 Medium generally consume more energy and have slower inference, though with exceptions when accelerators like TPUs are used. Among the edge devices, Jetson Orin Nano stands out as the fastest and most energy-efficient option for request handling, despite having the highest idle energy consumption. These results emphasize the need to balance accuracy, speed, and energy efficiency when deploying deep learning models on edge devices, offering valuable guidance for practitioners and researchers selecting models and devices for their applications.

[CV-50] opological SLAM in colonoscopies leveraging deep features and topological priors MICCAI2024

链接: https://arxiv.org/abs/2409.16806
作者: Javier Morlana,Juan D. Tardós,José M. M. Montiel
关键词-EN: multiple-map metric SLAM, classical multiple-map metric, combines classical multiple-map, create topological maps, classical multiple-map
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:We introduce ColonSLAM, a system that combines classical multiple-map metric SLAM with deep features and topological priors to create topological maps of the whole colon. The SLAM pipeline by itself is able to create disconnected individual metric submaps representing locations from short video subsections of the colon, but is not able to merge covisible submaps due to deformations and the limited performance of the SIFT descriptor in the medical domain. ColonSLAM is guided by topological priors and combines a deep localization network trained to distinguish if two images come from the same place or not and the soft verification of a transformer-based matching network, being able to relate far-in-time submaps during an exploration, grouping them in nodes imaging the same colon place, building more complex maps than any other approach in the literature. We demonstrate our approach in the Endomapper dataset, showing its potential for producing maps of the whole colon in real human explorations. Code and models are available at: this https URL.

[CV-51] Scalable Ensemble Diversification for OOD Generalization and Detection

链接: https://arxiv.org/abs/2409.16797
作者: Alexander Rubinstein,Luca Scimeca,Damien Teney,Seong Joon Oh
关键词-EN: Bayesian principles, OOD, OOD samples, practical applications, providing candidates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Training a diverse ensemble of models has several practical applications such as providing candidates for model selection with better out-of-distribution (OOD) generalization, and enabling the detection of OOD samples via Bayesian principles. An existing approach to diverse ensemble training encourages the models to disagree on provided OOD samples. However, the approach is computationally expensive and it requires well-separated ID and OOD examples, such that it has only been demonstrated in small-scale settings. \textbfMethod. This work presents a method for Scalable Ensemble Diversification (SED) applicable to large-scale settings (e.g. ImageNet) that does not require OOD samples. Instead, SED identifies hard training samples on the fly and encourages the ensemble members to disagree on these. To improve scaling, we show how to avoid the expensive computations in existing methods of exhaustive pairwise disagreements across models. \textbfResults. We evaluate the benefits of diversification with experiments on ImageNet. First, for OOD generalization, we observe large benefits from the diversification in multiple settings including output-space (classical) ensembles and weight-space ensembles (model soups). Second, for OOD detection, we turn the diversity of ensemble hypotheses into a novel uncertainty score estimator that surpasses a large number of OOD detection baselines. Code is available here: this https URL. Comments: Under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.16797 [cs.LG] (or arXiv:2409.16797v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.16797 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Rubinstein [view email] [v1] Wed, 25 Sep 2024 10:30:24 UTC (3,449 KB)

[CV-52] Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data

链接: https://arxiv.org/abs/2409.16793
作者: Lukas Heine,Fabian Hörst,Jana Fragemann,Gijs Luijten,Miriam Balzer,Jan Egger,Fin Bahnsen,M. Saquib Sarfraz,Jens Kleesiek,Constantin Seibold
关键词-EN: manufacturing presents significant, presents significant challenges, manufacturing presents, presents significant, significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Unstructured data in industries such as healthcare, finance, and manufacturing presents significant challenges for efficient analysis and decision making. Detecting patterns within this data and understanding their impact is critical but complex without the right tools. Traditionally, these tasks relied on the expertise of data analysts or labor-intensive manual reviews. In response, we introduce Spacewalker, an interactive tool designed to explore and annotate data across multiple modalities. Spacewalker allows users to extract data representations and visualize them in low-dimensional spaces, enabling the detection of semantic similarities. Through extensive user studies, we assess Spacewalker’s effectiveness in data annotation and integrity verification. Results show that the tool’s ability to traverse latent spaces and perform multi-modal queries significantly enhances the user’s capacity to quickly identify relevant data. Moreover, Spacewalker allows for annotation speed-ups far superior to conventional methods, making it a promising tool for efficiently navigating unstructured data and improving decision making processes. The code of this work is open-source and can be found at: this https URL

[CV-53] MixPolyp: Integrating Mask Box and Scribble Supervision for Enhanced Polyp Segmentation

链接: https://arxiv.org/abs/2409.16774
作者: Yiwen Hu,Jun Wei,Yuncheng Jiang,Haoyang Li,Shuguang Cui,Zhen Li,Song Wu
关键词-EN: polyp segmentation, polyp segmentation paradigm, polyp segmentation models, supervised polyp segmentation, Subspace Projection loss
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE BIBM 2024

点击查看摘要

Abstract:Limited by the expensive labeling, polyp segmentation models are plagued by data shortages. To tackle this, we propose the mixed supervised polyp segmentation paradigm (MixPolyp). Unlike traditional models relying on a single type of annotation, MixPolyp combines diverse annotation types (mask, box, and scribble) within a single model, thereby expanding the range of available data and reducing labeling costs. To achieve this, MixPolyp introduces three novel supervision losses to handle various annotations: Subspace Projection loss (L_SP), Binary Minimum Entropy loss (L_BME), and Linear Regularization loss (L_LR). For box annotations, L_SP eliminates shape inconsistencies between the prediction and the supervision. For scribble annotations, L_BME provides supervision for unlabeled pixels through minimum entropy constraint, thereby alleviating supervision sparsity. Furthermore, L_LR provides dense supervision by enforcing consistency among the predictions, thus reducing the non-uniqueness. These losses are independent of the model structure, making them generally applicable. They are used only during training, adding no computational cost during inference. Extensive experiments on five datasets demonstrate MixPolyp’s effectiveness.

[CV-54] MaViLS a Benchmark Dataset for Video-to-Slide Alignment Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech OCR and Visual Features

链接: https://arxiv.org/abs/2409.16765
作者: Katharina Anderer,Andreas Reich,Matthias Wölfel
关键词-EN: paper presents, presents a benchmark, benchmark dataset, dataset for aligning, multimodal algorithm leveraging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.

[CV-55] Statewide Visual Geolocalization in the Wild

链接: https://arxiv.org/abs/2409.16763
作者: Florian Fervers,Sebastian Bullinger,Christoph Bodensteiner,Michael Arens,Rainer Stiefelhagen
关键词-EN: aerial reference imagery, state-sized search region, search region, reference imagery, work presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work presents a method that is able to predict the geolocation of a street-view photo taken in the wild within a state-sized search region by matching against a database of aerial reference imagery. We partition the search region into geographical cells and train a model to map cells and corresponding photos into a joint embedding space that is used to perform retrieval at test time. The model utilizes aerial images for each cell at multiple levels-of-detail to provide sufficient information about the surrounding scene. We propose a novel layout of the search region with consistent cell resolutions that allows scaling to large geographical regions. Experiments demonstrate that the method successfully localizes 60.6% of all non-panoramic street-view photos uploaded to the crowd-sourcing platform Mapillary in the state of Massachusetts to within 50m of their ground-truth location. Source code is available at this https URL.

[CV-56] Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

链接: https://arxiv.org/abs/2409.16756
作者: Lukas Klein,Carsten T. Lüth,Udo Schlegel,Till J. Bungert,Mennatallah El-Assady,Paul F. Jäger
关键词-EN: rapidly growing domain, rapidly growing, growing domain, myriad of proposed, XAI
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current studies are often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics and neglect thorough validation, increasing the risk of selection bias and ignoring discrepancies among metrics. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, is not examined in any relevant related study. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-)evaluation dataset.

[CV-57] Commonly Interesting Images ECCV2024

链接: https://arxiv.org/abs/2409.16736
作者: Fitim Abdullahu,Helmut Grabner
关键词-EN: trigger emotions, image, recall memories, Abstract, interesting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Images tell stories, trigger emotions, and let us recall memories – they make us think. Thus, they have the ability to attract and hold one’s attention, which is the definition of being “interesting”. Yet, the appeal of an image is highly subjective. Looking at the image of my son taking his first steps will always bring me back to this emotional moment, while it is just a blurry, quickly taken snapshot to most others. Preferences vary widely: some adore cats, others are dog enthusiasts, and a third group may not be fond of either. We argue that every image can be interesting to a particular observer under certain circumstances. This work particularly emphasizes subjective preferences. However, our analysis of 2.5k image collections from diverse users of the photo-sharing platform Flickr reveals that specific image characteristics make them commonly more interesting. For instance, images, including professionally taken landscapes, appeal broadly due to their aesthetic qualities. In contrast, subjectively interesting images, such as those depicting personal or niche community events, resonate on a more individual level, often evoking personal memories and emotions.

[CV-58] Non-stationary BERT: Exploring Augmented IMU Data For Robust Human Activity Recognition

链接: https://arxiv.org/abs/2409.16730
作者: Ning Sun,Yufei Wang,Yuwei Zhang,Jixiang Wan,Shenyue Wang,Ping Liu,Xudong Zhang
关键词-EN: Human Activity Recognition, gained great attention, observe users’ daily, users’ daily activity, Activity Recognition
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users’ daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability.

[CV-59] EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models

链接: https://arxiv.org/abs/2409.16723
作者: Jiacheng Zhang,Yang Jiao,Shaoxiang Chen,Jingjing Chen,Yu-Gang Jiang
关键词-EN: Large Language Models, Multimodal Large Language, referring visual prompts, sparked great research, great research interests
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have sparked great research interests owing to their exceptional content-reasoning and instruction-following capabilities. To effectively instruct an MLLM, in addition to conventional language expressions, the practice of referring to objects by painting with brushes on images has emerged as a prevalent tool (referred to as “referring visual prompts”) due to its efficacy in aligning the user’s intention with specific image regions. To accommodate the most common referring visual prompts, namely points, boxes, and masks, existing approaches initially utilize specialized feature encoding modules to capture the semantics of the highlighted areas indicated by these prompts. Subsequently, these encoded region features are adapted to MLLMs through fine-tuning on a meticulously curated multimodal instruction dataset. However, such designs suffer from redundancy in architecture. Moreover, they face challenges in effectively generalizing when encountering a diverse range of arbitrary referring visual prompts in real-life scenarios. To address the above issues, we propose EAGLE, a novel MLLM that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches. Specifically, our EAGLE maintains the innate format of the referring visual prompts as colored patches rendered on the given image for conducting the instruction tuning. Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM, with the semantic comprehension of these regions originating from the MLLM itself. Besides, we also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM’s region-level comprehension with the specific formats of referring visual prompts. Extensive experiments are conducted to prove the effectiveness of our proposed method.

[CV-60] Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP2024

链接: https://arxiv.org/abs/2409.16718
作者: Ming Li,Jike Zhong,Chenxin Li,Liuzhuozheng Li,Nie Lin,Masashi Sugiyama
关键词-EN: Recent advances, classic model fine-tuning, fine-tuning Vision-Language Models, prompt tuning, adapter tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \urlthis https URL.

[CV-61] Pose-Guided Fine-Grained Sign Language Video Generation ECCV2024

链接: https://arxiv.org/abs/2409.16709
作者: Tongkai Shi,Lianyu Hu,Fanhua Shang,Jichao Feng,Peidong Liu,Wei Feng
关键词-EN: Sign language, learning sign language, produce sign language, Sign language videos, sign language images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Sign language videos are an important medium for spreading and learning sign language. However, most existing human image synthesis methods produce sign language images with details that are distorted, blurred, or structurally incorrect. They also produce sign language video frames with poor temporal consistency, with anomalies such as flickering and abrupt detail changes between the previous and next frames. To address these limitations, we propose a novel Pose-Guided Motion Model (PGMM) for generating fine-grained and motion-consistent sign language videos. Firstly, we propose a new Coarse Motion Module (CMM), which completes the deformation of features by optical flow warping, thus transfering the motion of coarse-grained structures without changing the appearance; Secondly, we propose a new Pose Fusion Module (PFM), which guides the modal fusion of RGB and pose features, thus completing the fine-grained generation. Finally, we design a new metric, Temporal Consistency Difference (TCD) to quantitatively assess the degree of temporal consistency of a video by comparing the difference between the frames of the reconstructed video and the previous and next frames of the target video. Extensive qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in most benchmark tests, with visible improvements in details and temporal consistency.

[CV-62] Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

链接: https://arxiv.org/abs/2409.16706
作者: Youngwan Jin,Incheol Park,Hanbin Song,Hyeongjin Ju,Yagiz Nalcakan,Shiho Kim
关键词-EN: generating high-quality Near-Infrared, RGB inputs, Vision Foundation Model, paper proposes, high-quality Near-Infrared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages,12 figures

点击查看摘要

Abstract:This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our approach leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder-decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS dataset to demonstrate Pix2Next’s advantages in quantitative metrics and visual quality, improving the FID score by 34.81% compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.

[CV-63] Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2409.16689
作者: Shoma Iwai,Atsuki Osanai,Shunsuke Kitada,Shinichiro Omachi
关键词-EN: task to synthesize, synthesize a harmonious, Layout, harmonious layout, characterized by attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by ECCV2024, Project Page: this https URL

点击查看摘要

Abstract:Layout generation is a task to synthesize a harmonious layout with elements characterized by attributes such as category, position, and size. Human designers experiment with the placement and modification of elements to create aesthetic layouts, however, we observed that current discrete diffusion models (DDMs) struggle to correct inharmonious layouts after they have been generated. In this paper, we first provide novel insights into layout sticking phenomenon in DDMs and then propose a simple yet effective layout-assessment module Layout-Corrector, which works in conjunction with existing DDMs to address the layout sticking problem. We present a learning-based module capable of identifying inharmonious elements within layouts, considering overall layout harmony characterized by complex composition. During the generation process, Layout-Corrector evaluates the correctness of each token in the generated layout, reinitializing those with low scores to the ungenerated state. The DDM then uses the high-scored tokens as clues to regenerate the harmonized tokens. Layout-Corrector, tested on common benchmarks, consistently boosts layout-generation performance when in conjunction with various state-of-the-art DDMs. Furthermore, our extensive analysis demonstrates that the Layout-Corrector (1) successfully identifies erroneous tokens, (2) facilitates control over the fidelity-diversity trade-off, and (3) significantly mitigates the performance drop associated with fast sampling.

[CV-64] Skyeyes: Ground Roaming using Aerial View Images

链接: https://arxiv.org/abs/2409.16685
作者: Zhiyuan Gao,Wenbin Teng,Gonglin Chen,Jinsen Wu,Ningli Xu,Rongjun Qin,Andrew Feng,Yajie Zhao
关键词-EN: Integrating aerial imagery-based, creating detailed content, gaming enhances realism, Integrating aerial, ensuring real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view inputs, thereby creating a ground roaming experience. More specifically, we combine a 3D representation with a view consistent generation model, which ensures coherence between generated images. This method allows for the creation of geometrically consistent ground view images, even with large view gaps. The images maintain improved spatial-temporal coherence and realism, enhancing scene comprehension and visualization from aerial perspectives. To the best of our knowledge, there are no publicly available datasets that contain pairwise geo-aligned aerial and ground view imagery. Therefore, we build a large, synthetic, and geo-aligned dataset using Unreal Engine. Both qualitative and quantitative analyses on this synthetic dataset display superior results compared to other leading synthesis approaches. See the project page for more results: this https URL.

[CV-65] alkinNeRF: Animatable Neural Fields for Full-Body Talking Humans ECCV

链接: https://arxiv.org/abs/2409.16666
作者: Aggelina Chatziagapi,Bindita Chaudhuri,Amit Kumar,Rakesh Ranjan,Dimitris Samaras,Nikolaos Sarafianos
关键词-EN: dynamic neural radiance, neural radiance field, dynamic neural, neural radiance, body
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCVW 2024. Project page: this https URL

点击查看摘要

Abstract:We introduce a novel framework that learns a dynamic neural radiance field (NeRF) for full-body talking humans from monocular videos. Prior work represents only the body pose or the face. However, humans communicate with their full body, combining body pose, hand gestures, as well as facial expressions. In this work, we propose TalkinNeRF, a unified NeRF-based network that represents the holistic 4D human motion. Given a monocular video of a subject, we learn corresponding modules for the body, face, and hands, that are combined together to generate the final result. To capture complex finger articulation, we learn an additional deformation field for the hands. Our multi-identity representation enables simultaneous training for multiple subjects, as well as robust animation under completely unseen poses. It can also generalize to novel identities, given only a short video as input. We demonstrate state-of-the-art performance for animating full-body talking humans, with fine-grained hand articulation and facial expressions.

[CV-66] Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models ICRA2025

链接: https://arxiv.org/abs/2409.16663
作者: Alexander Popov,Alperen Degirmenci,David Wehr,Shashank Hegde,Ryan Oldja,Alexey Kamenev,Bertrand Douillard,David Nistér,Urs Muller,Ruchi Bhargava,Stan Birchfield,Nikolai Smolyanskiy
关键词-EN: latent space generative, space generative world, generative world models, covariate shift problem, latent space
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 6 figures, for ICRA 2025 conference, for associated video file, see this https URL

点击查看摘要

Abstract:We propose the use of latent space generative world models to address the covariate shift problem in autonomous driving. A world model is a neural network capable of predicting an agent’s next state given past states and actions. By leveraging a world model during training, the driving policy effectively mitigates covariate shift without requiring an excessive amount of training data. During end-to-end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations, so that at runtime it can recover from perturbations outside the training distribution. Additionally, we introduce a novel transformer-based perception encoder that employs multi-view cross-attention and a learned scene query. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed-loop testing in the CARLA simulator, as well as showing the ability to handle perturbations in both CARLA and NVIDIA’s DRIVE Sim.

[CV-67] Progressive Representation Learning for Real-Time UAV Tracking IROS2024

链接: https://arxiv.org/abs/2409.16652
作者: Changhong Fu,Xiang Lei,Haobo Zuo,Liangliang Yao,Guangze Zheng,Jia Pan
关键词-EN: unmanned aerial vehicles, significantly promoted autonomous, promoted autonomous applications, Visual object tracking, Visual object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Visual object tracking has significantly promoted autonomous applications for unmanned aerial vehicles (UAVs). However, learning robust object representations for UAV tracking is especially challenging in complex dynamic environments, when confronted with aspect ratio change and occlusion. These challenges severely alter the original information of the object. To handle the above issues, this work proposes a novel progressive representation learning framework for UAV tracking, i.e., PRL-Track. Specifically, PRL-Track is divided into coarse representation learning and fine representation learning. For coarse representation learning, two innovative regulators, which rely on appearance and semantic information, are designed to mitigate appearance interference and capture semantic information. Furthermore, for fine representation learning, a new hierarchical modeling generator is developed to intertwine coarse object representations. Exhaustive experiments demonstrate that the proposed PRL-Track delivers exceptional performance on three authoritative UAV tracking benchmarks. Real-world tests indicate that the proposed PRL-Track realizes superior tracking performance with 42.6 frames per second on the typical UAV platform equipped with an edge smart camera. The code, model, and demo videos are available at \urlthis https URL.

[CV-68] Enhancing Nighttime UAV Tracking with Light Distribution Suppression

链接: https://arxiv.org/abs/2409.16631
作者: Liangliang Yao,Changhong Fu,Yiheng Wang,Haobo Zuo,Kunhan Lu
关键词-EN: Visual object tracking, nighttime UAV tracking, unmanned aerial vehicles, boosted extensive intelligent, extensive intelligent applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual object tracking has boosted extensive intelligent applications for unmanned aerial vehicles (UAVs). However, the state-of-the-art (SOTA) enhancers for nighttime UAV tracking always neglect the uneven light distribution in low-light images, inevitably leading to excessive enhancement in scenarios with complex illumination. To address these issues, this work proposes a novel enhancer, i.e., LDEnhancer, enhancing nighttime UAV tracking with light distribution suppression. Specifically, a novel image content refinement module is developed to decompose the light distribution information and image content information in the feature space, allowing for the targeted enhancement of the image content information. Then this work designs a new light distribution generation module to capture light distribution effectively. The features with light distribution information and image content information are fed into the different parameter estimation modules, respectively, for the parameter map prediction. Finally, leveraging two parameter maps, an innovative interweave iteration adjustment is proposed for the collaborative pixel-wise adjustment of low-light images. Additionally, a challenging nighttime UAV tracking dataset with uneven light distribution, namely NAT2024-2, is constructed to provide a comprehensive evaluation, which contains 40 challenging sequences with over 74K frames in total. Experimental results on the authoritative UAV benchmarks and the proposed NAT2024-2 demonstrate that LDEnhancer outperforms other SOTA low-light enhancers for nighttime UAV tracking. Furthermore, real-world tests on a typical UAV platform with an NVIDIA Orin NX confirm the practicality and efficiency of LDEnhancer. The code is available at this https URL.

[CV-69] Stochastic Subsampling With Average Pooling

链接: https://arxiv.org/abs/2409.16630
作者: Bum Jun Kim,Sang Woo Kim
关键词-EN: deep neural networks, deep neural, higher generalization performance, neural networks, achieve higher generalization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Regularization of deep neural networks has been an important issue to achieve higher generalization performance without overfitting problems. Although the popular method of Dropout provides a regularization effect, it causes inconsistent properties in the output, which may degrade the performance of deep neural networks. In this study, we propose a new module called stochastic average pooling, which incorporates Dropout-like stochasticity in pooling. We describe the properties of stochastic subsampling and average pooling and leverage them to design a module without any inconsistency problem. The stochastic average pooling achieves a regularization effect without any potential performance degradation due to the inconsistency issue and can easily be plugged into existing architectures of deep neural networks. Experiments demonstrate that replacing existing average pooling with stochastic average pooling yields consistent improvements across a variety of tasks, datasets, and models.

[CV-70] DeformStream: Deformation-based Adaptive Volumetric Video Streaming

链接: https://arxiv.org/abs/2409.16615
作者: Boyan Li,Yongting Chen,Dayou Zhang,Fangxin Wang
关键词-EN: Volumetric video streaming, faces significant challenges, significant challenges due, Adaptive Volumetric Video, streaming offers immersive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Volumetric video streaming offers immersive 3D experiences but faces significant challenges due to high bandwidth requirements and latency issues in transmitting detailed content in real time. Traditional methods like point cloud streaming compromise visual quality when zoomed in, and neural rendering techniques are too computationally intensive for real-time use. Though mesh-based streaming stands out by preserving surface detail and connectivity, offering a more refined representation for 3D content, traditional mesh streaming methods typically transmit data on a per-frame basis, failing to take full advantage of temporal redundancies across frames. This results in inefficient bandwidth usage and poor adaptability to fluctuating network conditions. We introduce Deformation-based Adaptive Volumetric Video Streaming, a novel framework that enhances volumetric video streaming performance by leveraging the inherent deformability of mesh-based representations. DeformStream uses embedded deformation to reconstruct subsequent frames from inter-frame motion, significantly reducing bandwidth usage while ensuring visual coherence between frames. To address frame reconstruction overhead and network adaptability, we formulate a new QoE model that accounts for client-side deformation latency and design a dynamic programming algorithm to optimize the trade-off between visual quality and bandwidth consumption under varying network conditions. Our evaluation demonstrates that Deformation-based Adaptive Volumetric Video Streaming outperforms existing mesh-based streaming systems in both bandwidth efficiency and visual quality, offering a robust solution for real-time volumetric video applications.

[CV-71] Semi-LLIE: Semi-supervised Contrastive Learning with Mamba-based Low-light Image Enhancement

链接: https://arxiv.org/abs/2409.16604
作者: Guanlin Li,Ke Zhang,Ting Wang,Ming Li,Bin Zhao,Xuelong Li
关键词-EN: impressive advancements made, low-light image enhancement, recent low-light image, image enhancement, impressive advancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the impressive advancements made in recent low-light image enhancement techniques, the scarcity of paired data has emerged as a significant obstacle to further advancements. This work proposes a mean-teacher-based semi-supervised low-light enhancement (Semi-LLIE) framework that integrates the unpaired data into model training. The mean-teacher technique is a prominent semi-supervised learning method, successfully adopted for addressing high-level and low-level vision tasks. However, two primary issues hinder the naive mean-teacher method from attaining optimal performance in low-light image enhancement. Firstly, pixel-wise consistency loss is insufficient for transferring realistic illumination distribution from the teacher to the student model, which results in color cast in the enhanced images. Secondly, cutting-edge image enhancement approaches fail to effectively cooperate with the mean-teacher framework to restore detailed information in dark areas due to their tendency to overlook modeling structured information within local regions. To mitigate the above issues, we first introduce a semantic-aware contrastive loss to faithfully transfer the illumination distribution, contributing to enhancing images with natural colors. Then, we design a Mamba-based low-light image enhancement backbone to effectively enhance Mamba’s local region pixel relationship representation ability with a multi-scale feature learning scheme, facilitating the generation of images with rich textural details. Further, we propose novel perceptive loss based on the large-scale vision-language Recognize Anything Model (RAM) to help generate enhanced images with richer textual details. The experimental results indicate that our Semi-LLIE surpasses existing methods in both quantitative and qualitative metrics.

[CV-72] FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation ECCV2024

链接: https://arxiv.org/abs/2409.16600
作者: Jingyi Tang,Gu Wang,Zeyu Chen,Shengquan Li,Xiu Li,Xiangyang Ji
关键词-EN: achieved great success, remains challenging due, objects remains challenging, complex underwater environment, great success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Although methods for estimating the pose of objects in indoor scenes have achieved great success, the pose estimation of underwater objects remains challenging due to difficulties brought by the complex underwater environment, such as degraded illumination, blurring, and the substantial cost of obtaining real annotations. In response, we introduce FAFA, a Frequency-Aware Flow-Aided self-supervised framework for 6D pose estimation of unmanned underwater vehicles (UUVs). Essentially, we first train a frequency-aware flow-based pose estimator on synthetic data, where an FFT-based augmentation approach is proposed to facilitate the network in capturing domain-invariant features and target domain styles from a frequency perspective. Further, we perform self-supervised training by enforcing flow-aided multi-level consistencies to adapt it to the real-world underwater environment. Our framework relies solely on the 3D model and RGB images, alleviating the need for any real pose annotations or other-modality data like depths. We evaluate the effectiveness of FAFA on common underwater object pose benchmarks and showcase significant performance improvements compared to state-of-the-art methods. Code is available at this http URL.

[CV-73] EventHallusion: Diagnosing Event Hallucinations in Video LLMs

链接: https://arxiv.org/abs/2409.16597
作者: Jiacheng Zhang,Yang Jiao,Shaoxiang Chen,Jingjing Chen,Yu-Gang Jiang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, made significant progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field. Despite remarkable content reasoning and instruction following capabilities they demonstrated, the hallucination problem of these VideoLLMs is less explored compared with its counterpart in the image domain. To mitigate this gap, we first propose EventHallusion, a novel benchmark that focuses on assessing the VideoLMMs’ hallucination phenomenon on video event comprehension. Based on the observation that existing VideoLLMs are entangled with the priors stemming from their foundation models, our EventHallusion is curated by meticulously collecting videos and annotating questions to intentionally mislead the VideoLLMs into interpreting events based on these priors rather than accurately understanding the video content. On the other hand, we also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs. The proposed TCD suppresses the model’s preference toward their priors by comparing the original video with a constructed counterpart, whose temporal cues are disrupted, during the autoregressive decoding stage. Through comprehensive evaluation of eight open-source and two closed-source VideoLLMs on the proposed EventHallusion benchmark, we find that the open-source models suffer significantly from hallucination problems, whereas the closed-source models perform markedly better. By further equipping open-sourced VideoLLMs with the proposed TCD approach, evident performance improvements are achieved across most metrics in the EventHallusion benchmark. Our codes and benchmark data are available at this https URL.

[CV-74] SelectiveKD: A semi-supervised framework for cancer detection in DBT through Knowledge Distillation and Pseudo-labeling

链接: https://arxiv.org/abs/2409.16581
作者: Laurent Dillard,Hyeonsoo Lee,Weonsuk Lee,Tae Soo Kim,Ali Diba,Thijs Kooi
关键词-EN: Digital Breast Tomosynthesis, developing Computer Aided, Breast Tomosynthesis, Digital Breast, Computer Aided Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 2 figures, 1 table

点击查看摘要

Abstract:When developing Computer Aided Detection (CAD) systems for Digital Breast Tomosynthesis (DBT), the complexity arising from the volumetric nature of the modality poses significant technical challenges for obtaining large-scale accurate annotations. Without access to large-scale annotations, the resulting model may not generalize to different domains. Given the costly nature of obtaining DBT annotations, how to effectively increase the amount of data used for training DBT CAD systems remains an open challenge. In this paper, we present SelectiveKD, a semi-supervised learning framework for building cancer detection models for DBT, which only requires a limited number of annotated slices to reach high performance. We achieve this by utilizing unlabeled slices available in a DBT stack through a knowledge distillation framework in which the teacher model provides a supervisory signal to the student model for all slices in the DBT volume. Our framework mitigates the potential noise in the supervisory signal from a sub-optimal teacher by implementing a selective dataset expansion strategy using pseudo labels. We evaluate our approach with a large-scale real-world dataset of over 10,000 DBT exams collected from multiple device manufacturers and locations. The resulting SelectiveKD process effectively utilizes unannotated slices from a DBT stack, leading to significantly improved cancer classification performance (AUC) and generalization performance. Comments: 10 pages, 2 figures, 1 table Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T45, 92C55 68T45, 92C55 ACMclasses: I.4.9; I.5.4 Cite as: arXiv:2409.16581 [cs.CV] (or arXiv:2409.16581v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.16581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-75] FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

链接: https://arxiv.org/abs/2409.16578
作者: Jiaheng Hu,Rose Hendrix,Ali Farhadi,Aniruddha Kembhavi,Roberto Martin-Martin,Peter Stone,Kuo-Hao Zeng,Kiana Ehsan
关键词-EN: multi-task Behavior Cloning, Robotics field, building generalist robot, Behavior Cloning, recent years
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these policies have led to unsatisfactory performance, where the policy struggles with unseen states and tasks. How can we break through the performance plateau of these models and elevate their capabilities to new heights? In this paper, we propose FLaRe, a large-scale Reinforcement Learning fine-tuning framework that integrates robust pre-trained representations, large-scale training, and gradient stabilization techniques. Our method aligns pre-trained policies towards task completion, achieving state-of-the-art (SoTA) performance both on previously demonstrated and on entirely novel tasks and embodiments. Specifically, on a set of long-horizon mobile manipulation tasks, FLaRe achieves an average success rate of 79.5% in unseen environments, with absolute improvements of +23.6% in simulation and +30.7% on real robots over prior SoTA methods. By utilizing only sparse rewards, our approach can enable generalizing to new capabilities beyond the pretraining data with minimal human effort. Moreover, we demonstrate rapid adaptation to new embodiments and behaviors with less than a day of fine-tuning. Videos can be found on the project website at this https URL

[CV-76] Source-Free Domain Adaptation for YOLO Object Detection ECCV2024

链接: https://arxiv.org/abs/2409.16538
作者: Simon Varailhon,Masih Aminbeidokhti,Marco Pedersoli,Eric Granger
关键词-EN: efficiency reasons, object detection, privacy and efficiency, Source-free domain adaptation, proposed SFDA method
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV 2024: European Conference on Computer Vision - Workshop on Out-of-Distribution Generalization in Computer Vision Foundation Models, Milan Italy

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.

[CV-77] Prompt Sliders for Fine-Grained Control Editing and Erasing of Concepts in Diffusion Models ECCV’24

链接: https://arxiv.org/abs/2409.16535
作者: Deepak Sridhar,Nuno Vasconcelos
关键词-EN: recently surpassed GANs, offering superior image, superior image quality, offering superior, quality and diversity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV’24 - Unlearning and Model Editing Workshop. Code: this https URL

点击查看摘要

Abstract:Diffusion models have recently surpassed GANs in image synthesis and editing, offering superior image quality and diversity. However, achieving precise control over attributes in generated images remains a challenge. Concept Sliders introduced a method for fine-grained image control and editing by learning concepts (attributes/objects). However, this approach adds parameters and increases inference time due to the loading and unloading of Low-Rank Adapters (LoRAs) used for learning concepts. These adapters are model-specific and require retraining for different architectures, such as Stable Diffusion (SD) v1.5 and SD-XL. In this paper, we propose a straightforward textual inversion method to learn concepts through text embeddings, which are generalizable across models that share the same text encoder, including different versions of the SD model. We refer to our method as Prompt Sliders. Besides learning new concepts, we also show that Prompt Sliders can be used to erase undesirable concepts such as artistic styles or mature content. Our method is 30% faster than using LoRAs because it eliminates the need to load and unload adapters and introduces no additional parameters aside from the target concept text embedding. Each concept embedding only requires 3KB of storage compared to the 8922KB or more required for each LoRA adapter, making our approach more computationally efficient. Project Page: this https URL

[CV-78] Low Latency Point Cloud Rendering with Learned Splatting CVPR2024

链接: https://arxiv.org/abs/2409.16504
作者: Yueyu Hu,Ran Gong,Qi Sun,Yao Wang
关键词-EN: emerging applications, Point cloud, Point, point cloud rendering, rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at CVPR 2024 Workshop on AIS: Vision, Graphics and AI for Streaming ( this https URL )

点击查看摘要

Abstract:Point cloud is a critical 3D representation with many emerging applications. Because of the point sparsity and irregularity, high-quality rendering of point clouds is challenging and often requires complex computations to recover the continuous surface representation. On the other hand, to avoid visual discomfort, the motion-to-photon latency has to be very short, under 10 ms. Existing rendering solutions lack in either quality or speed. To tackle these challenges, we present a framework that unlocks interactive, free-viewing and high-fidelity point cloud rendering. We train a generic neural network to estimate 3D elliptical Gaussians from arbitrary point clouds and use differentiable surface splatting to render smooth texture and surface normal for arbitrary views. Our approach does not require per-scene optimization, and enable real-time rendering of dynamic point cloud. Experimental results demonstrate the proposed solution enjoys superior visual quality and speed, as well as generalizability to different scene content and robustness to compression artifacts. The code is available at this https URL .

[CV-79] GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization

链接: https://arxiv.org/abs/2409.16502
作者: Gennady Sidorov,Malik Mohrat,Ksenia Lebedeva,Ruslan Rakhimov,Sergey Kolyubin
关键词-EN: extensive optimization requirements, high memory consumption, localization approaches exist, visual localization approaches, approaches exist
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project website at this https URL

点击查看摘要

Abstract:Although various visual localization approaches exist, such as scene coordinate and pose regression, these methods often struggle with high memory consumption or extensive optimization requirements. To address these challenges, we utilize recent advancements in novel view synthesis, particularly 3D Gaussian Splatting (3DGS), to enhance localization. 3DGS allows for the compact encoding of both 3D geometry and scene appearance with its spatial features. Our method leverages the dense description maps produced by XFeat’s lightweight keypoint detection and description model. We propose distilling these dense keypoint descriptors into 3DGS to improve the model’s spatial understanding, leading to more accurate camera pose predictions through 2D-3D correspondences. After estimating an initial pose, we refine it using a photometric warping loss. Benchmarking on popular indoor and outdoor datasets shows that our approach surpasses state-of-the-art Neural Render Pose (NRP) methods, including NeRFMatch and PNeRFLoc.

[CV-80] Real-Time Detection of Electronic Components in Waste Printed Circuit Boards: A Transformer-Based Approach

链接: https://arxiv.org/abs/2409.16496
作者: Muhammad Mohsin,Stefano Rovetta,Francesco Masulli,Alberto Cabri
关键词-EN: Critical Raw Materials, Critical Raw, Raw Materials, Printed Circuit Boards, Waste Printed Circuit
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Applications in Electronics Pervading Industry, Environment and Society (ApplePies2024). Proceedings are published in the Springer Lecture Notes in Electrical Engineering

点击查看摘要

Abstract:Critical Raw Materials (CRMs) such as copper, manganese, gallium, and various rare earths have great importance for the electronic industry. To increase the concentration of individual CRMs and thus make their extraction from Waste Printed Circuit Boards (WPCBs) convenient, we have proposed a practical approach that involves selective disassembling of the different types of electronic components from WPCBs using mechatronic systems guided by artificial vision techniques. In this paper we evaluate the real-time accuracy of electronic component detection and localization of the Real-Time DEtection TRansformer model architecture. Transformers have recently become very popular for the extraordinary results obtained in natural language processing and machine translation. Also in this case, the transformer model achieves very good performances, often superior to those of the latest state of the art object detection and localization models YOLOv8 and YOLOv9.

[CV-81] A Unified Hallucination Mitigation Framework for Large Vision-Language Models

链接: https://arxiv.org/abs/2409.16494
作者: Yue Chang,Liqiang Jing,Xiaopeng Zhang,Yue Zhang
关键词-EN: Large Vision-Language Models, problem for Large, Large Vision-Language, difficult to eradicate, common problem
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Accepted by TMLR

点击查看摘要

Abstract:Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM.

[CV-82] Proactive Schemes: A Survey of Adversarial Attacks for Social Good

链接: https://arxiv.org/abs/2409.16491
作者: Vishal Asnani,Xi Yin,Xiaoming Liu
关键词-EN: introducing subtle perturbations, machine learning models, deep learning, predictions or classifications, introducing subtle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted for review

点击查看摘要

Abstract:Adversarial attacks in computer vision exploit the vulnerabilities of machine learning models by introducing subtle perturbations to input data, often leading to incorrect predictions or classifications. These attacks have evolved in sophistication with the advent of deep learning, presenting significant challenges in critical applications, which can be harmful for society. However, there is also a rich line of research from a transformative perspective that leverages adversarial techniques for social good. Specifically, we examine the rise of proactive schemes-methods that encrypt input data using additional signals termed templates, to enhance the performance of deep learning models. By embedding these imperceptible templates into digital media, proactive schemes are applied across various applications, from simple image enhancements to complicated deep learning frameworks to aid performance, as compared to the passive schemes, which don’t change the input data distribution for their framework. The survey delves into the methodologies behind these proactive schemes, the encryption and learning processes, and their application to modern computer vision and natural language processing applications. Additionally, it discusses the challenges, potential vulnerabilities, and future directions for proactive schemes, ultimately highlighting their potential to foster the responsible and secure advancement of deep learning technologies.

[CV-83] Frequency-based View Selection in Gaussian Splatting Reconstruction

链接: https://arxiv.org/abs/2409.16470
作者: Monica M.Q. Li,Pierre-Yves Lajoie,Giovanni Beltrame
关键词-EN: Gaussian Splatting reconstructions, Gaussian Splatting, Three-dimensional reconstruction, robotics perception, fundamental problem
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Three-dimensional reconstruction is a fundamental problem in robotics perception. We examine the problem of active view selection to perform 3D Gaussian Splatting reconstructions with as few input images as possible. Although 3D Gaussian Splatting has made significant progress in image rendering and 3D reconstruction, the quality of the reconstruction is strongly impacted by the selection of 2D images and the estimation of camera poses through Structure-from-Motion (SfM) algorithms. Current methods to select views that rely on uncertainties from occlusions, depth ambiguities, or neural network predictions directly are insufficient to handle the issue and struggle to generalize to new scenes. By ranking the potential views in the frequency domain, we are able to effectively estimate the potential information gain of new viewpoints without ground truth data. By overcoming current constraints on model architecture and efficacy, our method achieves state-of-the-art results in view selection, demonstrating its potential for efficient image-based 3D reconstruction.

[CV-84] Initialization of Monocular Visual Navigation for Autonomous Agents Using Modified Structure from Small Motion

链接: https://arxiv.org/abs/2409.16465
作者: Juan-Diego Florez,Mehregan Dor,Panagiotis Tsiotras
关键词-EN: visual Simultaneous Localization, Localization and Mapping, Simultaneous Localization, monocular visual Simultaneous, robots in space
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 1 page for references, 6 figures, 1 table, IEEEtran format This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:We propose a standalone monocular visual Simultaneous Localization and Mapping (vSLAM) initialization pipeline for autonomous robots in space. Our method, a state-of-the-art factor graph optimization pipeline, enhances classical Structure from Small Motion (SfSM) to robustly initialize a monocular agent in weak-perspective projection scenes. Furthermore, it overcomes visual estimation challenges introduced by spacecraft inspection trajectories, such as: center-pointing motion, which exacerbates the bas-relief ambiguity, and the presence of a dominant plane in the scene, which causes motion estimation degeneracies in classical Structure from Motion (SfM). We validate our method on realistic, simulated satellite inspection images exhibiting weak-perspective projection, and we demonstrate its effectiveness and improved performance compared to other monocular initialization procedures.

[CV-85] Underground Mapping and Localization Based on Ground-Penetrating Radar

链接: https://arxiv.org/abs/2409.16446
作者: Jinchang Zhang,Guoyu Lu
关键词-EN: gained increasing attention, recent years, gained increasing, increasing attention, attention in recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D object reconstruction based on deep neural networks has gained increasing attention in recent years. However, 3D reconstruction of underground objects to generate point cloud maps remains a challenge. Ground Penetrating Radar (GPR) is one of the most powerful and extensively used tools for detecting and locating underground objects such as plant root systems and pipelines, with its cost-effectiveness and continuously evolving technology. This paper introduces a parabolic signal detection network based on deep convolutional neural networks, utilizing B-scan images from GPR sensors. The detected keypoints can aid in accurately fitting parabolic curves used to interpret the original GPR B-scan images as cross-sections of the object model. Additionally, a multi-task point cloud network was designed to perform both point cloud segmentation and completion simultaneously, filling in sparse point cloud maps. For unknown locations, GPR A-scan data can be used to match corresponding A-scan data in the constructed map, pinpointing the position to verify the accuracy of the map construction by the model. Experimental results demonstrate the effectiveness of our method.

[CV-86] Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition

链接: https://arxiv.org/abs/2409.16434
作者: Zheda Mai,Ping Zhang,Cheng-Hao Tu,Hong-You Chen,Li Zhang,Wei-Lun Chao
关键词-EN: Parameter-efficient transfer learning, attracted significant attention, Parameter-efficient transfer, PETL, transfer learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Parameter-efficient transfer learning (PETL) has attracted significant attention lately, due to the increasing size of pre-trained models and the need to fine-tune (FT) them for superior downstream performance. This community-wide enthusiasm has sparked a plethora of new methods. Nevertheless, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like when to apply PETL and which method to use largely unanswered. In this paper, we conduct a unifying empirical study of representative PETL methods in the context of Vision Transformers. We systematically tune their hyper-parameters to fairly compare their accuracy on downstream tasks. Our study not only offers a valuable user guide but also unveils several new insights. First, if tuned carefully, different PETL methods can obtain quite similar accuracy in the low-shot benchmark VTAB-1K. This includes simple methods like FT the bias terms that were reported inferior. Second, though with similar accuracy, we find that PETL methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementariness) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PETL is also useful in many-shot regimes – it achieves comparable and sometimes better accuracy than full FT, using much fewer learnable parameters. Last but not least, we investigate PETL’s ability to preserve a pre-trained model’s robustness to distribution shifts (e.g., a CLIP backbone). Perhaps not surprisingly, PETL methods outperform full FT alone. However, with weight-space ensembles, the fully FT model can achieve a better balance between downstream and out-of-distribution performance, suggesting a future research direction for PETL.

[CV-87] Hand Gesture Classification Based on Forearm Ultrasound Video Snippets Using 3D Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.16431
作者: Keshav Bimbraw,Ankit Talele,Haichong K. Zhang
关键词-EN: human-machine interaction, hand movement estimation, crucial area, area of research, research with applications
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: Accepted to IUS 2024

点击查看摘要

Abstract:Ultrasound based hand movement estimation is a crucial area of research with applications in human-machine interaction. Forearm ultrasound offers detailed information about muscle morphology changes during hand movement which can be used to estimate hand gestures. Previous work has focused on analyzing 2-Dimensional (2D) ultrasound image frames using techniques such as convolutional neural networks (CNNs). However, such 2D techniques do not capture temporal features from segments of ultrasound data corresponding to continuous hand movements. This study uses 3D CNN based techniques to capture spatio-temporal patterns within ultrasound video segments for gesture recognition. We compared the performance of a 2D convolution-based network with (2+1)D convolution-based, 3D convolution-based, and our proposed network. Our methodology enhanced the gesture classification accuracy to 98.8 +/- 0.9%, from 96.5 +/- 2.3% compared to a network trained with 2D convolution layers. These results demonstrate the advantages of using ultrasound video snippets for improving hand gesture classification performance.

[CV-88] Leveraging Local Structure for Improving Model Explanations: An Information Propagation Approach

链接: https://arxiv.org/abs/2409.16429
作者: Ruo Yang,Binghui Wang,Mustafa Bilgic
关键词-EN: deep neural network, Numerous explanation methods, Numerous explanation, neural network, recently developed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous explanation methods have been recently developed to interpret the decisions made by deep neural network (DNN) models. For image classifiers, these methods typically provide an attribution score to each pixel in the image to quantify its contribution to the prediction. However, most of these explanation methods appropriate attribution scores to pixels independently, even though both humans and DNNs make decisions by analyzing a set of closely related pixels simultaneously. Hence, the attribution score of a pixel should be evaluated jointly by considering itself and its structurally-similar pixels. We propose a method called IProp, which models each pixel’s individual attribution score as a source of explanatory information and explains the image prediction through the dynamic propagation of information across all pixels. To formulate the information propagation, IProp adopts the Markov Reward Process, which guarantees convergence, and the final status indicates the desired pixels’ attribution scores. Furthermore, IProp is compatible with any existing attribution-based explanation method. Extensive experiments on various explanation methods and DNN models verify that IProp significantly improves them on a variety of interpretability metrics.

[CV-89] Improving Intersession Reproducibility for Forearm Ultrasound based Hand Gesture Classification through an Incremental Learning Approach

链接: https://arxiv.org/abs/2409.16415
作者: Keshav Bimbraw,Jack Rothenberg,Haichong K. Zhang
关键词-EN: human machine interfaces, fine tuning, developing human machine, machine interfaces, human machine
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to IUS 2024

点击查看摘要

Abstract:Ultrasound images of the forearm can be used to classify hand gestures towards developing human machine interfaces. In our previous work, we have demonstrated gesture classification using ultrasound on a single subject without removing the probe before evaluation. This has limitations in usage as once the probe is removed and replaced, the accuracy declines since the classifier performance is sensitive to the probe location on the arm. In this paper, we propose training a model on multiple data collection sessions to create a generalized model, utilizing incremental learning through fine tuning. Ultrasound data was acquired for 5 hand gestures within a session (without removing and putting the probe back on) and across sessions. A convolutional neural network (CNN) with 5 cascaded convolution layers was used for this study. A pre-trained CNN was fine tuned with the convolution blocks acting as a feature extractor, and the parameters of the remaining layers updated in an incremental fashion. Fine tuning was done using different session splits within a session and between multiple sessions. We found that incremental fine tuning can help enhance classification accuracy with more fine tuning sessions. After 2 fine tuning sessions for each experiment, we found an approximate 10% increase in classification accuracy. This work demonstrates that incremental learning through fine tuning on ultrasound based hand gesture classification can be used improves accuracy while saving storage, processing power, and time. It can be expanded to generalize between multiple subjects and towards developing personalized wearable devices.

[CV-90] Vision-based Xylem Wetness Classification in Stem Water Potential Determination

链接: https://arxiv.org/abs/2409.16412
作者: Pamodya Peiris,Aritra Samanta,Caio Mucchiani,Cody Simons,Amit Roy-Chowdhury,Konstantinos Karydis
关键词-EN: making efficient management, overused in irrigation, efficient management, stem detection, Scholander Pressure Chamber
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Water is often overused in irrigation, making efficient management of it crucial. Precision Agriculture emphasizes tools like stem water potential (SWP) analysis for better plant status determination. However, such tools often require labor-intensive in-situ sampling. Automation and machine learning can streamline this process and enhance outcomes. This work focused on automating stem detection and xylem wetness classification using the Scholander Pressure Chamber, a widely used but demanding method for SWP measurement. The aim was to refine stem detection and develop computer-vision-based methods to better classify water emergence at the xylem. To this end, we collected and manually annotated video data, applying vision- and learning-based methods for detection and classification. Additionally, we explored data augmentation and fine-tuned parameters to identify the most effective models. The identified best-performing models for stem detection and xylem wetness classification were evaluated end-to-end over 20 SWP measurements. Learning-based stem detection via YOLOv8n combined with ResNet50-based classification achieved a Top-1 accuracy of 80.98%, making it the best-performing approach for xylem wetness classification.

[CV-91] Modern Hopfield Networks meet Encoded Neural Representations – Addressing Practical Considerations NEURIPS

链接: https://arxiv.org/abs/2409.16408
作者: Satyananda Kashyap,Niharika S. D’Souza,Luyao Shi,Ken C. L. Wong,Hongzhi Wang,Tanveer Syeda-Mahmood
关键词-EN: Modern Hopfield Networks, storage faces challenges, Content-addressable memories, human declarative memory, Modern Hopfield
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
*备注: 17 pages, 8 figures, workshop submission to Neurips

点击查看摘要

Abstract:Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.

[CV-92] Patch-Based Contrastive Learning and Memory Consolidation for Online Unsupervised Continual Learning

链接: https://arxiv.org/abs/2409.16391
作者: Cameron Taylor,Vassilis Vassiliades,Constantine Dovrolis
关键词-EN: Online Unsupervised Continual, unexplored learning paradigm, Unsupervised Continual Learning, receives a non-stationary, number of classes
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Conference on Lifelong Learning Agents (COLLAS) 2024

点击查看摘要

Abstract:We focus on a relatively unexplored learning paradigm known as \em Online Unsupervised Continual Learning (O-UCL), where an agent receives a non-stationary, unlabeled data stream and progressively learns to identify an increasing number of classes. This paradigm is designed to model real-world applications where encountering novelty is the norm, such as exploring a terrain with several unknown and time-varying entities. Unlike prior work in unsupervised, continual, or online learning, O-UCL combines all three areas into a single challenging and realistic learning paradigm. In this setting, agents are frequently evaluated and must aim to maintain the best possible representation at any point of the data stream, rather than at the end of pre-specified offline tasks. The proposed approach, called \textbfPatch-based \textbfContrastive learning and \textbfMemory \textbfConsolidation (PCMC), builds a compositional understanding of data by identifying and clustering patch-level features. Embeddings for these patch-level features are extracted with an encoder trained via patch-based contrastive learning. PCMC incorporates new data into its distribution while avoiding catastrophic forgetting, and it consolidates memory examples during ``sleep" periods. We evaluate PCMC’s performance on streams created from the ImageNet and Places365 datasets. Additionally, we explore various versions of the PCMC algorithm and compare its performance against several existing methods and simple baselines.

[CV-93] Camera Calibration and Stereo via a Single Image of a Spherical Mirror

链接: https://arxiv.org/abs/2409.16386
作者: Nissim Barzilay,Ofek Narinsky,Michael Werman
关键词-EN: technique for camera, view that incorporates, camera calibration, achieving precise calibration, single view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:This paper presents a novel technique for camera calibration using a single view that incorporates a spherical mirror. Leveraging the distinct characteristics of the sphere’s contour visible in the image and its reflections, we showcase the effectiveness of our method in achieving precise calibration. Furthermore, the reflection from the mirrored surface provides additional information about the surrounding scene beyond the image frame. Our method paves the way for the development of simple catadioptric stereo systems. We explore the challenges and opportunities associated with employing a single mirrored sphere, highlighting the potential applications of this setup in practical scenarios. The paper delves into the intricacies of the geometry and calibration procedures involved in catadioptric stereo utilizing a spherical mirror. Experimental results, encompassing both synthetic and real-world data, are presented to illustrate the feasibility and accuracy of our approach. Comments: 12 pages, 11 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.16386 [cs.CV] (or arXiv:2409.16386v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.16386 Focus to learn more arXiv-issued DOI via DataCite

[CV-94] owards Synthetic Data Generation for Improved Pain Recognition in Videos under Patient Constraints

链接: https://arxiv.org/abs/2409.16382
作者: Jonas Nasimzada,Jens Kleesiek,Ken Herrmann,Alina Roitberg,Constantin Seibold
关键词-EN: patient-computer interaction systems, improving patient-computer interaction, domain raises significant, raises significant ethical, traditional data collection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Pain Recognition Synthetic Data Video Analysis Privacy Preserving

点击查看摘要

Abstract:Recognizing pain in video is crucial for improving patient-computer interaction systems, yet traditional data collection in this domain raises significant ethical and logistical challenges. This study introduces a novel approach that leverages synthetic data to enhance video-based pain recognition models, providing an ethical and scalable alternative. We present a pipeline that synthesizes realistic 3D facial models by capturing nuanced facial movements from a small participant pool, and mapping these onto diverse synthetic avatars. This process generates 8,600 synthetic faces, accurately reflecting genuine pain expressions from varied angles and perspectives. Utilizing advanced facial capture techniques, and leveraging public datasets like CelebV-HQ and FFHQ-UV for demographic diversity, our new synthetic dataset significantly enhances model training while ensuring privacy by anonymizing identities through facial replacements. Experimental results demonstrate that models trained on combinations of synthetic data paired with a small amount of real participants achieve superior performance in pain recognition, effectively bridging the gap between synthetic simulations and real-world applications. Our approach addresses data scarcity and ethical concerns, offering a new solution for pain detection and opening new avenues for research in privacy-preserving dataset generation. All resources are publicly available to encourage further innovation in this field. Comments: Pain Recognition Synthetic Data Video Analysis Privacy Preserving Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: J.3 Cite as: arXiv:2409.16382 [cs.CV] (or arXiv:2409.16382v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.16382 Focus to learn more arXiv-issued DOI via DataCite

[CV-95] Instance Segmentation of Reinforced Concrete Bridges with Synthetic Point Clouds

链接: https://arxiv.org/abs/2409.16381
作者: Asad Ur Rahman,Vedhus Hoskere
关键词-EN: Standards require detailed, Inspection Standards require, National Bridge Inspection, Bridge Inspection Standards, Standards require
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 33 pages, 12 figures, Submitted to “Automation in Construction”

点击查看摘要

Abstract:The National Bridge Inspection Standards require detailed element-level bridge inspections. Traditionally, inspectors manually assign condition ratings by rating structural components based on damage, but this process is labor-intensive and time-consuming. Automating the element-level bridge inspection process can facilitate more comprehensive condition documentation to improve overall bridge management. While semantic segmentation of bridge point clouds has been studied, research on instance segmentation of bridge elements is limited, partly due to the lack of annotated datasets, and the difficulty in generalizing trained models. To address this, we propose a novel approach for generating synthetic data using three distinct methods. Our framework leverages the Mask3D transformer model, optimized with hyperparameter tuning and a novel occlusion technique. The model achieves state-of-the-art performance on real LiDAR and photogrammetry bridge point clouds, respectively, demonstrating the potential of the framework for automating element-level bridge inspections.

[CV-96] Development and Application of a Sentinel-2 Satellite Imagery Dataset for Deep-Learning Driven Forest Wildfire Detection

链接: https://arxiv.org/abs/2409.16380
作者: Valeria Martin,K.Brent Venable,Derek Morgan
关键词-EN: increasing global challenge, demands advanced analytical, advanced analytical methods, Forest loss due, natural events
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forest loss due to natural events, such as wildfires, represents an increasing global challenge that demands advanced analytical methods for effective detection and mitigation. To this end, the integration of satellite imagery with deep learning (DL) methods has become essential. Nevertheless, this approach requires substantial amounts of labeled data to produce accurate results. In this study, we use bi-temporal Sentinel-2 satellite imagery sourced from Google Earth Engine (GEE) to build the California Wildfire GeoImaging Dataset (CWGID), a high-resolution labeled satellite imagery dataset with over 100,000 labeled before and after forest wildfire image pairs for wildfire detection through DL. Our methods include data acquisition from authoritative sources, data processing, and an initial dataset analysis using three pre-trained Convolutional Neural Network (CNN) architectures. Our results show that the EF EfficientNet-B0 model achieves the highest accuracy of over 92% in detecting forest wildfires. The CWGID and the methodology used to build it, prove to be a valuable resource for training and testing DL architectures for forest wildfire detection.

[CV-97] Damage detection in an uncertain nonlinear beam based on stochastic Volterra series: an experimental application

链接: https://arxiv.org/abs/2409.16305
作者: Luis Gustavo Gioacon Villani,Samuel da Silva,Americo Cunha Jr,Michael D. Todd
关键词-EN: damage detection problem, intrinsically nonlinear behavior, experimental data variation, data variation, nonlinear
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The damage detection problem becomes a more difficult task when the intrinsically nonlinear behavior of the structures and the natural data variation are considered in the analysis because both phenomena can be confused with damage if linear and deterministic approaches are implemented. Therefore, this work aims the experimental application of a stochastic version of the Volterra series combined with a novelty detection approach to detect damage in an initially nonlinear system taking into account the measured data variation, caused by the presence of uncertainties. The experimental setup is composed by a cantilever beam operating in a nonlinear regime of motion, even in the healthy condition, induced by the presence of a magnet near to the free extremity. The damage associated with mass changes in a bolted connection (nuts loosed) is detected based on the comparison between linear and nonlinear contributions of the stochastic Volterra kernels in the total response, estimated in the reference and damaged conditions. The experimental measurements were performed on different days to add natural variation to the data measured. The results obtained through the stochastic proposed approach are compared with those obtained by the deterministic version of the Volterra series, showing the advantage of the stochastic model use when we consider the experimental data variation with the capability to detect the presence of the damage with statistical confidence. Besides, the nonlinear metric used presented a higher sensitivity to the occurrence of the damage compared with the linear one, justifying the application of a nonlinear metric when the system exhibits intrinsically nonlinear behavior.

[CV-98] LiDAR-3DGS: LiDAR Reinforced 3D Gaussian Splatting for Multimodal Radiance Field Rendering

链接: https://arxiv.org/abs/2409.16296
作者: Hansol Lim,Hanbeom Chang,Jongseong Brad Choi,Chul Min Yeum
关键词-EN: Gaussian Splatting, Radiance Field Rendering, based Radiance Field, explore the capabilities, capabilities of multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In this paper, we explore the capabilities of multimodal inputs to 3D Gaussian Splatting (3DGS) based Radiance Field Rendering. We present LiDAR-3DGS, a novel method of reinforcing 3DGS inputs with LiDAR generated point clouds to significantly improve the accuracy and detail of 3D models. We demonstrate a systematic approach of LiDAR reinforcement to 3DGS to enable capturing of important features such as bolts, apertures, and other details that are often missed by image-based features alone. These details are crucial for engineering applications such as remote monitoring and maintenance. Without modifying the underlying 3DGS algorithm, we demonstrate that even a modest addition of LiDAR generated point cloud significantly enhances the perceptual quality of the models. At 30k iterations, the model generated by our method resulted in an increase of 7.064% in PSNR and 0.565% in SSIM, respectively. Since the LiDAR used in this research was a commonly used commercial-grade device, the improvements observed were modest and can be further enhanced with higher-grade LiDAR systems. Additionally, these improvements can be supplementary to other derivative works of Radiance Field Rendering and also provide a new insight for future LiDAR and computer vision integrated modeling.

[CV-99] GenCAD: Image-Conditioned Computer-Aided Design Generation with Transformer-Based Contrastive Representation and Diffusion Priors

链接: https://arxiv.org/abs/2409.16294
作者: Md Ferdous Alam,Faez Ahmed
关键词-EN: unintuitive design tools, remains a highly, solids and unintuitive, CAD, creation of manufacturable
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:The creation of manufacturable and editable 3D shapes through Computer-Aided Design (CAD) remains a highly manual and time-consuming task, hampered by the complex topology of boundary representations of 3D solids and unintuitive design tools. This paper introduces GenCAD, a generative model that employs autoregressive transformers and latent diffusion models to transform image inputs into parametric CAD command sequences, resulting in editable 3D shape representations. GenCAD integrates an autoregressive transformer-based architecture with a contrastive learning framework, enhancing the generation of CAD programs from input images and providing a representation learning framework for multiple data modalities relevant to engineering designs. Extensive evaluations demonstrate that GenCAD significantly outperforms existing state-of-the-art methods in terms of the precision and modifiability of generated 3D shapes. Notably, GenCAD shows a marked improvement in the accuracy of 3D shape generation for long sequences, supporting its application in complex design tasks. Additionally, the contrastive embedding feature of GenCAD facilitates the retrieval of CAD models using image queries from databases which is a critical challenge within the CAD community. While most work in the 3D shape generation literature focuses on representations like meshes, voxels, or point clouds, practical engineering applications demand modifiability and the ability for multi-modal conditional generation. Our results provide a significant step forward in this direction, highlighting the potential of generative models to expedite the entire design-to-production pipeline and seamlessly integrate different design modalities.

[CV-100] Explaining Human Comparisons using Alignment-Importance Heatmaps

链接: https://arxiv.org/abs/2409.16292
作者: Nhut Truong,Dario Pesenti,Uri Hasson
关键词-EN: computational explainability approach, Deep Neural Network, Alignment Importance Score, Alignment Importance, AIS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present a computational explainability approach for human comparison tasks, using Alignment Importance Score (AIS) heatmaps derived from deep-vision models. The AIS reflects a feature-map’s unique contribution to the alignment between Deep Neural Network’s (DNN) representational geometry and that of humans. We first validate the AIS by showing that prediction of out-of-sample human similarity judgments is improved when constructing representations using only higher-scoring AIS feature maps identified from a training set. We then compute image-specific heatmaps that visually indicate the areas that correspond to feature-maps with higher AIS scores. These maps provide an intuitive explanation of which image areas are more important when it is compared to other images in a cohort. We observe a correspondence between these heatmaps and saliency maps produced by a gaze-prediction model. However, in some cases, meaningful differences emerge, as the dimensions relevant for comparison are not necessarily the most visually salient. To conclude, Alignment Importance improves prediction of human similarity judgments from DNN embeddings, and provides interpretable insights into the relevant information in image space.

[CV-101] Classification of Gleason Grading in Prostate Cancer Histopathology Images Using Deep Learning Techniques: YOLO Vision Transformers and Vision Mamba

链接: https://arxiv.org/abs/2409.17122
作者: Amin Malekmohammadi,Ali Badiezadeh,Seyed Mostafa Mirhassani,Parisa Gifani,Majid Vafaeezadeh
关键词-EN: issues impacting men, leading health issues, health issues impacting, Gleason scoring system, scoring system serving
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prostate cancer ranks among the leading health issues impacting men, with the Gleason scoring system serving as the primary method for diagnosis and prognosis. This system relies on expert pathologists to evaluate samples of prostate tissue and assign a Gleason grade, a task that requires significant time and manual effort. To address this challenge, artificial intelligence (AI) solutions have been explored to automate the grading process. In light of these challenges, this study evaluates and compares the effectiveness of three deep learning methodologies, YOLO, Vision Transformers, and Vision Mamba, in accurately classifying Gleason grades from histopathology images. The goal is to enhance diagnostic precision and efficiency in prostate cancer management. This study utilized two publicly available datasets, Gleason2019 and SICAPv2, to train and test the performance of YOLO, Vision Transformers, and Vision Mamba models. Each model was assessed based on its ability to classify Gleason grades accurately, considering metrics such as false positive rate, false negative rate, precision, and recall. The study also examined the computational efficiency and applicability of each method in a clinical setting. Vision Mamba demonstrated superior performance across all metrics, achieving high precision and recall rates while minimizing false positives and negatives. YOLO showed promise in terms of speed and efficiency, particularly beneficial for real-time analysis. Vision Transformers excelled in capturing long-range dependencies within images, although they presented higher computational complexity compared to the other models. Vision Mamba emerges as the most effective model for Gleason grade classification in histopathology images, offering a balance between accuracy and computational efficiency.

[CV-102] Automated Surgical Skill Assessment in Endoscopic Pituitary Surgery using Real-time Instrument Tracking on a High-fidelity Bench-top Phantom

链接: https://arxiv.org/abs/2409.17025
作者: Adrito Das,Bilal Sidiqi,Laurent Mennillo,Zhehua Mao,Mikael Brudfors,Miguel Xochicale,Danyal Z. Khan,Nicola Newall,John G. Hanrahan,Matthew J. Clarkson,Danail Stoyanov,Hani J. Marcus,Sophia Bano
关键词-EN: improved patient outcomes, domain specific expertise, requires domain specific, improved patient, Improved surgical skill
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Improved surgical skill is generally associated with improved patient outcomes, although assessment is subjective; labour-intensive; and requires domain specific expertise. Automated data driven metrics can alleviate these difficulties, as demonstrated by existing machine learning instrument tracking models in minimally invasive surgery. However, these models have been tested on limited datasets of laparoscopic surgery, with a focus on isolated tasks and robotic surgery. In this paper, a new public dataset is introduced, focusing on simulated surgery, using the nasal phase of endoscopic pituitary surgery as an exemplar. Simulated surgery allows for a realistic yet repeatable environment, meaning the insights gained from automated assessment can be used by novice surgeons to hone their skills on the simulator before moving to real surgery. PRINTNet (Pituitary Real-time INstrument Tracking Network) has been created as a baseline model for this automated assessment. Consisting of DeepLabV3 for classification and segmentation; StrongSORT for tracking; and the NVIDIA Holoscan SDK for real-time performance, PRINTNet achieved 71.9% Multiple Object Tracking Precision running at 22 Frames Per Second. Using this tracking output, a Multilayer Perceptron achieved 87% accuracy in predicting surgical skill level (novice or expert), with the “ratio of total procedure time to instrument visible time” correlated with higher surgical skill. This therefore demonstrates the feasibility of automated surgical skill assessment in simulated endoscopic pituitary surgery. The new publicly available dataset can be found here: this https URL.

[CV-103] PitRSDNet: Predicting Intra-operative Remaining Surgery Duration in Endoscopic Pituitary Surgery MICCAI

链接: https://arxiv.org/abs/2409.16998
作者: Anjana Wijekoon,Adrito Das,Roxana R. Herrera,Danyal Z. Khan,John Hanrahan,Eleanor Carter,Valpuri Luoma,Danail Stoyanov,Hani J. Marcus,Sophia Bano
关键词-EN: Accurate intra-operative Remaining, Remaining Surgery Duration, intra-operative Remaining Surgery, administer anaesthetic agents, notify hospital staff
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to the Augmented Environments for Computer-Assisted Interventions (AE-CAI) Workshop at the Medical Image Computing and Computer-Assisted Interventions (MICCAI) Conference 2024

点击查看摘要

Abstract:Accurate intra-operative Remaining Surgery Duration (RSD) predictions allow for anaesthetists to more accurately decide when to administer anaesthetic agents and drugs, as well as to notify hospital staff to send in the next patient. Therefore RSD plays an important role in improving patient care and minimising surgical theatre costs via efficient scheduling. In endoscopic pituitary surgery, it is uniquely challenging due to variable workflow sequences with a selection of optional steps contributing to high variability in surgery duration. This paper presents PitRSDNet for predicting RSD during pituitary surgery, a spatio-temporal neural network model that learns from historical data focusing on workflow sequences. PitRSDNet integrates workflow knowledge into RSD prediction in two forms: 1) multi-task learning for concurrently predicting step and RSD; and 2) incorporating prior steps as context in temporal learning and inference. PitRSDNet is trained and evaluated on a new endoscopic pituitary surgery dataset with 88 videos to show competitive performance improvements over previous statistical and machine learning methods. The findings also highlight how PitRSDNet improve RSD precision on outlier cases utilising the knowledge of prior steps.

[CV-104] Going Beyond U-Net: Assessing Vision Transformers for Semantic Segmentation in Microscopy Image Analysis ECCV2024

链接: https://arxiv.org/abs/2409.16940
作者: Illia Tsiporenko,Pavel Chizhov,Dmytro Fishman
关键词-EN: crucial step, microscopy image analysis, Segmentation, Swin Transformer model, microscopy image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: to be published in ECCV 2024 BioImage Computing Workshop

点击查看摘要

Abstract:Segmentation is a crucial step in microscopy image analysis. Numerous approaches have been developed over the past years, ranging from classical segmentation algorithms to advanced deep learning models. While U-Net remains one of the most popular and well-established models for biomedical segmentation tasks, recently developed transformer-based models promise to enhance the segmentation process of microscopy images. In this work, we assess the efficacy of transformers, including UNETR, the Segment Anything Model, and Swin-UPerNet, and compare them with the well-established U-Net model across various image modalities such as electron microscopy, brightfield, histopathology, and phase-contrast. Our evaluation identifies several limitations in the original Swin Transformer model, which we address through architectural modifications to optimise its performance. The results demonstrate that these modifications improve segmentation performance compared to the classical U-Net model and the unmodified Swin-UPerNet. This comparative analysis highlights the promise of transformer models for advancing biomedical image segmentation. It demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.

[CV-105] Moner: Motion Correction in Undersampled Radial MRI with Unsupervised Neural Representation

链接: https://arxiv.org/abs/2409.16921
作者: Qing Wu,Chenhe Du,XuanYu Tian,Jingyi Yu,Yuyao Zhang,Hongjiang Wei
关键词-EN: challenging problem due, Motion correction, Motion, radial MRI, MoCo
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 13 pages

点击查看摘要

Abstract:Motion correction (MoCo) in radial MRI is a challenging problem due to the unpredictability of subject’s motion. Current state-of-the-art (SOTA) MoCo algorithms often use extensive high-quality MR images to pre-train neural networks, obtaining excellent reconstructions. However, the need for large-scale datasets significantly increases costs and limits model generalization. In this work, we propose Moner, an unsupervised MoCo method that jointly solves artifact-free MR images and accurate motion from undersampled, rigid motion-corrupted k-space data, without requiring training data. Our core idea is to leverage the continuous prior of implicit neural representation (INR) to constrain this ill-posed inverse problem, enabling ideal solutions. Specifically, we incorporate a quasi-static motion model into the INR, granting its ability to correct subject’s motion. To stabilize model optimization, we reformulate radial MRI as a back-projection problem using the Fourier-slice theorem. Additionally, we propose a novel coarse-to-fine hash encoding strategy, significantly enhancing MoCo accuracy. Experiments on multiple MRI datasets show our Moner achieves performance comparable to SOTA MoCo techniques on in-domain data, while demonstrating significant improvements on out-of-domain data.

[CV-106] owards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

链接: https://arxiv.org/abs/2409.16818
作者: Yulin Wang,Honglin Xiong,Kaicong Sun,Shuwei Bai,Ling Dai,Zhongxiang Ding,Jiameng Liu,Qian Wang,Qian Liu,Dinggang Shen
关键词-EN: Multimodal brain magnetic, brain magnetic resonance, image synthesis, magnetic resonance, neuroscience and neurology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:Multimodal brain magnetic resonance (MR) imaging is indispensable in neuroscience and neurology. However, due to the accessibility of MRI scanners and their lengthy acquisition time, multimodal MR images are not commonly available. Current MR image synthesis approaches are typically trained on independent datasets for specific tasks, leading to suboptimal performance when applied to novel datasets and tasks. Here, we present TUMSyn, a Text-guided Universal MR image Synthesis generalist model, which can flexibly generate brain MR images with demanded imaging metadata from routinely acquired scans guided by text prompts. To ensure TUMSyn’s image synthesis precision, versatility, and generalizability, we first construct a brain MR database comprising 31,407 3D images with 7 MRI modalities from 13 centers. We then pre-train an MRI-specific text encoder using contrastive learning to effectively control MR image synthesis based on text prompts. Extensive experiments on diverse datasets and physician assessments indicate that TUMSyn can generate clinically meaningful MR images with specified imaging metadata in supervised and zero-shot scenarios. Therefore, TUMSyn can be utilized along with acquired MR scan(s) to facilitate large-scale MRI-based screening and diagnosis of brain diseases.

[CV-107] Let There Be Light: Robust Lensless Imaging Under External Illumination With Deep Learning

链接: https://arxiv.org/abs/2409.16766
作者: Eric Bezzam,Stefan Peters,Martin Vetterli
关键词-EN: Lensless cameras relax, shifting image formation, digital post-processing, constraints of traditional, formation from analog
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 4 pages, dataset: this https URL

点击查看摘要

Abstract:Lensless cameras relax the design constraints of traditional cameras by shifting image formation from analog optics to digital post-processing. While new camera designs and applications can be enabled, lensless imaging is very sensitive to unwanted interference (other sources, noise, etc.). In this work, we address a prevalent noise source that has not been studied for lensless imaging: external illumination e.g. from ambient and direct lighting. Being robust to a variety of lighting conditions would increase the practicality and adoption of lensless imaging. To this end, we propose multiple recovery approaches that account for external illumination by incorporating its estimate into the image recovery process. At the core is a physics-based reconstruction that combines learnable image recovery and denoisers, all of whose parameters are trained using experimentally gathered data. Compared to standard reconstruction methods, our approach yields significant qualitative and quantitative improvements. We open-source our implementations and a 25K dataset of measurements under multiple lighting conditions.

[CV-108] he Effect of Lossy Compression on 3D Medical Images Segmentation with Deep Learning MICCAI

链接: https://arxiv.org/abs/2409.16733
作者: Anvar Kurmukov,Bogdan Zavolovich,Aleksandra Dalechina,Vladislav Proskurov,Boris Shirokikh
关键词-EN: critical tool, tool in decreasing, decreasing the cost, cost of storage, storage and improving
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures, 2 tables; accepted on MICCAI Workshop on Advancing Data Solutions in Medical Imaging AI

点击查看摘要

Abstract:Image compression is a critical tool in decreasing the cost of storage and improving the speed of transmission over the internet. While deep learning applications for natural images widely adopts the usage of lossy compression techniques, it is not widespread for 3D medical images. Using three CT datasets (17 tasks) and one MRI dataset (3 tasks) we demonstrate that lossy compression up to 20 times have no negative impact on segmentation quality with deep neural networks (DNN). In addition, we demonstrate the ability of DNN models trained on compressed data to predict on uncompressed data and vice versa with no quality deterioration.

[CV-109] SDCL: Students Discrepancy-Informed Correction Learning for Semi-supervised Medical Image Segmentation MICCAI2024

链接: https://arxiv.org/abs/2409.16728
作者: Bentao Song,Qingfeng Wang
关键词-EN: Semi-supervised medical image, medical labeled data, limited medical labeled, based SSMIS methods, SSMIS methods due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Semi-supervised medical image segmentation (SSMIS) has been demonstrated the potential to mitigate the issue of limited medical labeled data. However, confirmation and cognitive biases may affect the prevalent teacher-student based SSMIS methods due to erroneous pseudo-labels. To tackle this challenge, we improve the mean teacher approach and propose the Students Discrepancy-Informed Correction Learning (SDCL) framework that includes two students and one non-trainable teacher, which utilizes the segmentation difference between the two students to guide the self-correcting learning. The essence of SDCL is to identify the areas of segmentation discrepancy as the potential bias areas, and then encourage the model to review the correct cognition and rectify their own biases in these areas. To facilitate the bias correction learning with continuous review and rectification, two correction loss functions are employed to minimize the correct segmentation voxel distance and maximize the erroneous segmentation voxel entropy. We conducted experiments on three public medical image datasets: two 3D datasets (CT and MRI) and one 2D dataset (MRI). The results show that our SDCL surpasses the current State-of-the-Art (SOTA) methods by 2.57%, 3.04%, and 2.34% in the Dice score on the Pancreas, LA, and ACDC datasets, respectively. In addition, the accuracy of our method is very close to the fully supervised method on the ACDC dataset, and even exceeds the fully supervised method on the Pancreas and LA dataset. (Code available at \urlthis https URL).

[CV-110] 3DDX: Bone Surface Reconstruction from a Single Standard-Geometry Radiograph via Dual-Face Depth Estimation MICCAI2024

链接: https://arxiv.org/abs/2409.16702
作者: Yi Gu,Yoshito Otake,Keisuke Uemura,Masaki Takao,Mazen Soufi,Seiji Okada,Nobuhiko Sugano,Hugues Talbot,Yoshinobu Sato
关键词-EN: low radiation exposure, Radiography is widely, radiation exposure, affordability and low, low radiation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024. 12 pages, 4 figures

点击查看摘要

Abstract:Radiography is widely used in orthopedics for its affordability and low radiation exposure. 3D reconstruction from a single radiograph, so-called 2D-3D reconstruction, offers the possibility of various clinical applications, but achieving clinically viable accuracy and computational efficiency is still an unsolved challenge. Unlike other areas in computer vision, X-ray imaging’s unique properties, such as ray penetration and fixed geometry, have not been fully exploited. We propose a novel approach that simultaneously learns multiple depth maps (front- and back-surface of multiple bones) derived from the X-ray image to computed tomography registration. The proposed method not only leverages the fixed geometry characteristic of X-ray imaging but also enhances the precision of the reconstruction of the whole surface. Our study involved 600 CT and 2651 X-ray images (4 to 5 posed X-ray images per patient), demonstrating our method’s superiority over traditional approaches with a surface reconstruction error reduction from 4.78 mm to 1.96 mm. This significant accuracy improvement and enhanced computational efficiency suggest our approach’s potential for clinical application.

[CV-111] SBP: Improving Object Detection in Histology Images via Test-time Self-guided Bounding-box Propagation MICCAI2024

链接: https://arxiv.org/abs/2409.16678
作者: Tingting Yang,Liang Xiao,Yizhe Zhang
关键词-EN: Earth Mover Distance, detection, TSBP, Test-time Self-guided Bounding-box, Self-guided Bounding-box Propagation
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024

点击查看摘要

Abstract:A global threshold (e.g., 0.5) is often applied to determine which bounding boxes should be included in the final results for an object detection task. A higher threshold reduces false positives but may result in missing a significant portion of true positives. A lower threshold can increase detection recall but may also result in more false positives. Because of this, using a preset global threshold (e.g., 0.5) applied to all the bounding box candidates may lead to suboptimal solutions. In this paper, we propose a Test-time Self-guided Bounding-box Propagation (TSBP) method, leveraging Earth Mover’s Distance (EMD) to enhance object detection in histology images. TSBP utilizes bounding boxes with high confidence to influence those with low confidence, leveraging visual similarities between them. This propagation mechanism enables bounding boxes to be selected in a controllable, explainable, and robust manner, which surpasses the effectiveness of using simple thresholds and uncertainty calibration methods. Importantly, TSBP does not necessitate additional labeled samples for model training or parameter estimation, unlike calibration methods. We conduct experiments on gland detection and cell detection tasks in histology images. The results show that our proposed TSBP significantly improves detection outcomes when working in conjunction with state-of-the-art deep learning-based detection networks. Compared to other methods such as uncertainty calibration, TSBP yields more robust and accurate object detection predictions while using no additional labeled samples. The code is available at this https URL.

[CV-112] Deep-Learning Recognition of Scanning Transmission Electron Microscopy: Quantifying and Mitigating the Influence of Gaussian Noises

链接: https://arxiv.org/abs/2409.16637
作者: Hanlei Zhang,Jincheng Bai,Xiabo Chen,Can Li,Chuanjian Zhong,Jiye Fang,Guangwen Zhou
关键词-EN: Scanning transmission electron, transmission electron microscopy, attracting intensive interests, Scanning transmission, electron microscopy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scanning transmission electron microscopy (STEM) is a powerful tool to reveal the morphologies and structures of materials, thereby attracting intensive interests from the scientific and industrial communities. The outstanding spatial (atomic level) and temporal (ms level) resolutions of the STEM techniques generate fruitful amounts of high-definition data, thereby enabling the high-volume and high-speed analysis of materials. On the other hand, processing of the big dataset generated by STEM is time-consuming and beyond the capability of human-based manual work, which urgently calls for computer-based automation. In this work, we present a deep-learning mask region-based neural network (Mask R-CNN) for the recognition of nanoparticles imaged by STEM, as well as generating the associated dimensional analysis. The Mask R-CNN model was tested on simulated STEM-HAADF results with different Gaussian noises, particle shapes and particle sizes, and the results indicated that Gaussian noise has determining influence on the accuracy of recognition. By applying Gaussian and Non-Local Means filters on the noise-containing STEM-HAADF results, the influences of noises are largely mitigated, and recognition accuracy is significantly improved. This filtering-recognition approach was further applied to experimental STEM-HAADF results, which yields satisfying accuracy compared with the traditional threshold methods. The deep-learning-based method developed in this work has great potentials in analysis of the complicated structures and large data generated by STEM-HAADF.

[CV-113] Diffusion Models to Enhance the Resolution of Microscopy Images: A Tutorial

链接: https://arxiv.org/abs/2409.16488
作者: Harshith Bachimanchi,Giovanni Volpe
关键词-EN: translation and super-resolution, neural networks, making their mark, generative modeling, modeling with neural
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
*备注: 45 pages, 8 figures

点击查看摘要

Abstract:Diffusion models have emerged as a prominent technique in generative modeling with neural networks, making their mark in tasks like text-to-image translation and super-resolution. In this tutorial, we provide a comprehensive guide to build denoising diffusion probabilistic models (DDPMs) from scratch, with a specific focus on transforming low-resolution microscopy images into their corresponding high-resolution versions. We provide the theoretical background, mathematical derivations, and a detailed Python code implementation using PyTorch, along with techniques to enhance model performance.

[CV-114] A novel open-source ultrasound dataset with deep learning benchmarks for spinal cord injury localization and anatomical segmentation

链接: https://arxiv.org/abs/2409.16441
作者: Avisha Kumar,Kunal Kotkar,Kelly Jiang,Meghana Bhimreddy,Daniel Davidar,Carly Weber-Levine,Siddharth Krishnan,Max J. Kerensky,Ruixing Liang,Kelley Kempski Leadingham,Denis Routkevitch,Andrew M. Hersh,Kimberly Ashayeri,Betty Tyler,Ian Suk,Jennifer Son,Nicholas Theodore,Nitish Thakor,Amir Manbachi
关键词-EN: numerous domains, acquisition and annotation, catalyzed breakthroughs, breakthroughs across numerous, broader adoption
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While deep learning has catalyzed breakthroughs across numerous domains, its broader adoption in clinical settings is inhibited by the costly and time-intensive nature of data acquisition and annotation. To further facilitate medical machine learning, we present an ultrasound dataset of 10,223 Brightness-mode (B-mode) images consisting of sagittal slices of porcine spinal cords (N=25) before and after a contusion injury. We additionally benchmark the performance metrics of several state-of-the-art object detection algorithms to localize the site of injury and semantic segmentation models to label the anatomy for comparison and creation of task-specific architectures. Finally, we evaluate the zero-shot generalization capabilities of the segmentation models on human ultrasound spinal cord images to determine whether training on our porcine dataset is sufficient for accurately interpreting human data. Our results show that the YOLOv8 detection model outperforms all evaluated models for injury localization, achieving a mean Average Precision (mAP50-95) score of 0.606. Segmentation metrics indicate that the DeepLabv3 segmentation model achieves the highest accuracy on unseen porcine anatomy, with a Mean Dice score of 0.587, while SAMed achieves the highest Mean Dice score generalizing to human anatomy (0.445). To the best of our knowledge, this is the largest annotated dataset of spinal cord ultrasound images made publicly available to researchers and medical professionals, as well as the first public report of object detection and segmentation architectures to assess anatomical markers in the spinal cord for methodology development and clinical applications.

[CV-115] Future-Proofing Medical Imaging with Privacy-Preserving Federated Learning and Uncertainty Quantification: A Review

链接: https://arxiv.org/abs/2409.16340
作者: Nikolas Koutsoubis,Asim Waqas,Yasin Yilmaz,Ravi P. Ramachandran,Matthew Schabath,Ghulam Rasool
关键词-EN: Artificial Intelligence, robust Artificial intelligence, Artificial intelligence models, medical imaging tasks, demonstrated significant potential
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 5 figures, 4 tables, Review paper, preprint to Radiology AI. arXiv admin note: text overlap with arXiv:2406.12815

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated significant potential in automating various medical imaging tasks, which could soon become routine in clinical practice for disease diagnosis, prognosis, treatment planning, and post-treatment surveillance. However, the privacy concerns surrounding patient data present a major barrier to the widespread adoption of AI in medical imaging, as large, diverse training datasets are essential for developing accurate, generalizable, and robust Artificial intelligence models. Federated Learning (FL) offers a solution that enables organizations to train AI models collaboratively without sharing sensitive data. federated learning exchanges model training information, such as gradients, between the participating sites. Despite its promise, federated learning is still in its developmental stages and faces several challenges. Notably, sensitive information can still be inferred from the gradients shared during model training. Quantifying AI models’ uncertainty is vital due to potential data distribution shifts post-deployment, which can affect model performance. Uncertainty quantification (UQ) in FL is particularly challenging due to data heterogeneity across participating sites. This review provides a comprehensive examination of FL, privacy-preserving FL (PPFL), and UQ in FL. We identify key gaps in current FL methodologies and propose future research directions to enhance data privacy and trustworthiness in medical imaging applications.

[CV-116] Predicting Distance matrix with large language models

链接: https://arxiv.org/abs/2409.16333
作者: Jiaxing Yang
关键词-EN: drawn significant attention, RNA, protein related research, long been considered, considered critical
类目: Biomolecules (q-bio.BM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Structural prediction has long been considered critical in RNA research, especially following the success of AlphaFold2 in protein studies, which has drawn significant attention to the field. While recent advances in machine learning and data accumulation have effectively addressed many biological tasks, particularly in protein related research. RNA structure prediction remains a significant challenge due to data limitations. Obtaining RNA structural data is difficult because traditional methods such as nuclear magnetic resonance spectroscopy, Xray crystallography, and electron microscopy are expensive and time consuming. Although several RNA 3D structure prediction methods have been proposed, their accuracy is still limited. Predicting RNA structural information at another level, such as distance maps, remains highly valuable. Distance maps provide a simplified representation of spatial constraints between nucleotides, capturing essential relationships without requiring a full 3D model. This intermediate level of structural information can guide more accurate 3D modeling and is computationally less intensive, making it a useful tool for improving structural predictions. In this work, we demonstrate that using only primary sequence information, we can accurately infer the distances between RNA bases by utilizing a large pretrained RNA language model coupled with a well trained downstream transformer.

[CV-117] MRI Radiomics for IDH Genotype Prediction in Glioblastoma Diagnosis

链接: https://arxiv.org/abs/2409.16329
作者: Stanislav Kozák
关键词-EN: utilises automatically identified, automatically identified features, radiological scans, field which utilises, utilises automatically
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:Radiomics is a relatively new field which utilises automatically identified features from radiological scans. It has found a widespread application, particularly in oncology because many of the important oncological biomarkers are not visible to the naked eye. The recent advent of big data, including in medical imaging, and the development of new ML techniques brought the possibility of faster and more accurate oncological diagnosis. Furthermore, standardised mathematical feature extraction based on radiomics helps to eliminate possible radiologist bias. This paper reviews the recent development in the oncological use of MRI radiomic features. It focuses on the identification of the isocitrate dehydrogenase (IDH) mutation status, which is an important biomarker for the diagnosis of glioblastoma and grade IV astrocytoma.

[CV-118] Developing a Thailand solar irradiance map using Himawari-8 satellite imageries and deep learning models

链接: https://arxiv.org/abs/2409.16320
作者: Suwichaya Suwanwimolkul,Natanon Tongamrak,Nuttamon Thungka,Naebboon Hoonchareon,Jitkomut Songsiri
关键词-EN: Thailand solar irradiance, shows Thailand solar, GHI, presents an online, online platform
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 14 figures

点击查看摘要

Abstract:This paper presents an online platform that shows Thailand’s solar irradiance map every 30 minutes. It is available at this https URL. The methodology for estimating global horizontal irradiance (GHI) across Thailand relies on cloud index extracted from Himawari-8 satellite imagery, Ineichen clear-sky model with locally-tuned Linke turbidity, and machine learning models. The methods take clear-sky irradiance, cloud index, re-analyzed GHI and temperature data from the MERRA-2 database, and date-time as inputs for GHI estimation models, including LightGBM, LSTM, Informer, and Transformer. These are benchmarked with the estimate from the SolCast service by evaluation of 15-minute ground GHI data from 53 ground stations over 1.5 years during 2022-2023. The results show that the four models have competitive performances and outperform the SolCast service. The best model is LightGBM, with an MAE of 78.58 W/sqm and RMSE of 118.97 W/sqm. Obtaining re-analyzed MERRA-2 data for Thailand is not economically feasible for deployment. When removing these features, the Informer model has a winning performance of 78.67 W/sqm in MAE. The obtained performance aligns with existing literature by taking the climate zone and time granularity of data into consideration. As the map shows an estimate of GHI over 93,000 grids with a frequent update, the paper also describes a computational framework for displaying the entire map. It tests the runtime performance of deep learning models in the GHI estimation process.

[CV-119] Computer Aided Detection and Classification of mammograms using Convolutional Neural Network

链接: https://arxiv.org/abs/2409.16290
作者: Kashif Ishaq,Muhammad Mustagis
关键词-EN: Breast cancer, death among women, cancer, Breast, lung cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Breast cancer is one of the most major causes of death among women, after lung cancer. Breast cancer detection advancements can increase the survival rate of patients through earlier detection. Breast cancer that can be detected by using mammographic imaging is now considered crucial step for computer aided systems. Researchers have explained many techniques for the automatic detection of initial tumors. The early breast cancer symptoms include masses and micro-calcifications. Because there is the variation in the tumor shape, size and position it is difficult to extract abnormal region from normal tissues. So, machine learning can help medical professionals make more accurate diagnoses of the disease whereas deep learning or neural networks are one of the methods that can be used to distinguish regular and irregular breast identification. In this study the extraction method for the classification of breast masses as normal and abnormal we have used is convolutional neural network (CNN) on mammograms. DDSM dataset has been used in which nearly 460 images are of normal and 920 of abnormal breasts.

机器学习

[LG-0] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

链接: https://arxiv.org/abs/2409.17146
作者: Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Jen Dumas,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi
关键词-EN: Today most advanced, advanced multimodal models, multimodal models remain, advanced multimodal, models remain proprietary
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today’s most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild QA and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2409.17146 [cs.CV] (or arXiv:2409.17146v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.17146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion ALT

链接: https://arxiv.org/abs/2409.17145
作者: Yukun Huang,Jianan Wang,Ailing Zeng,Zheng-Jun Zha,Lei Zhang,Xihui Liu
关键词-EN: shown promising results, Leveraging pretrained, score distillation sampling, Skeleton-guided Score Distillation, Gaussian Avatar representation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.

[LG-2] Differential Privacy Regularization: Protecting Training Data Through Loss Function Regularization

链接: https://arxiv.org/abs/2409.17144
作者: Francisco Aguilera-Martínez,Fernando Berzal
关键词-EN: Training machine learning, networks requires large, neural networks requires, machine learning models, learning models based
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Training machine learning models based on neural networks requires large datasets, which may contain sensitive information. The models, however, should not expose private information from these datasets. Differentially private SGD [DP-SGD] requires the modification of the standard stochastic gradient descent [SGD] algorithm for training new models. In this short paper, a novel regularization strategy is proposed to achieve the same goal in a more efficient manner.

[LG-3] FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression

链接: https://arxiv.org/abs/2409.17141
作者: Fazal Mittu,Yihuan Bu,Akshat Gupta,Ashok Devireddy,Alp Eren Ozdarendeli,Anant Singh,Gopala Anumanchipalli
关键词-EN: language modeling objective, text compression, compression, text compression systems, practical text compression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the language modeling objective has been shown to be deeply connected with compression, it is surprising that modern LLMs are not employed in practical text compression systems. In this paper, we provide an in-depth analysis of neural network and transformer-based compression techniques to answer this question. We compare traditional text compression systems with neural network and LLM-based text compression methods. Although LLM-based systems significantly outperform conventional compression methods, they are highly impractical. Specifically, LLMZip, a recent text compression system using Llama3-8B requires 9.5 days to compress just 10 MB of text, although with huge improvements in compression ratios. To overcome this, we present FineZip - a novel LLM-based text compression system that combines ideas of online memorization and dynamic context to reduce the compression time immensely. FineZip can compress the above corpus in approximately 4 hours compared to 9.5 days, a 54 times improvement over LLMZip and comparable performance. FineZip outperforms traditional algorithmic compression methods with a large margin, improving compression ratios by approximately 50%. With this work, we take the first step towards making lossless text compression with LLMs a reality. While FineZip presents a significant step in that direction, LLMs are still not a viable solution for large-scale text compression. We hope our work paves the way for future research and innovation to solve this problem.

[LG-4] Learning with Dynamics: Autonomous Regulation of UAV Based Communication Networks with Dynamic UAV Crew

链接: https://arxiv.org/abs/2409.17139
作者: Ran Zhang,Bowei Li,Liyuan Zhang,Jiang(Linda)Xie,Miao Wang
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, based communication networks, future mobile networking
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 6 figures, magazine paper

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) are a key component in future mobile networking. To handle the dynamic environments in UCNs, reinforcement learning (RL) has been a promising solution attributed to its strong capability of adaptive decision-making free of the environment models. However, most existing RL-based research focus on control strategy design assuming a fixed set of UAVs. Few works have investigated how UCNs should be adaptively regulated when the serving UAVs change dynamically. This article discusses RL-based strategy design for adaptive UCN regulation given a dynamic UAV set, addressing both reactive strategies in general UCNs and proactive strategies in solar-powered UCNs. An overview of the UCN and the RL framework is first provided. Potential research directions with key challenges and possible solutions are then elaborated. Some of our recent works are presented as case studies to inspire innovative ways to handle dynamic UAV crew with different RL algorithms.

[LG-5] PACE: marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization NEURIPS2024

链接: https://arxiv.org/abs/2409.17137
作者: Yao Ni,Shan Zhang,Piotr Koniusz
关键词-EN: effectively adapts pre-trained, adapts pre-trained vision, pre-trained vision transformers, effectively adapts, vision transformers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024 as a spotlight. This preliminary version will soon be extended with the experiments and analyses from the rebuttal

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained vision transformers to downstream tasks. However, the optimization for tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue, we theoretically connect smaller weight gradient norms during training and larger datasets to the improved model generalization. Motivated by this connection, we propose reducing gradient norms for enhanced generalization and aligning fine-tuned model with the pre-trained counterpart to retain knowledge from large-scale pre-training data. Yet, naive alignment does not guarantee gradient reduction and can potentially cause gradient explosion, complicating efforts to manage gradients. To address such issues, we propose PACE, marrying generalization of PArameter-efficient fine-tuning with Consistency rEgularization. We perturb features learned from the adapter with the multiplicative noise and ensure the fine-tuned model remains consistent for same sample under different perturbations. Theoretical analysis shows that PACE not only implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. Experimental evidence supports our theories. PACE outperforms existing PEFT methods in four visual adaptation tasks: VTAB-1k, FGVC, few-shot learning and domain adaptation. Code will be available at this https URL

[LG-6] Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision Physics Simulation and a Robot with Reset

链接: https://arxiv.org/abs/2409.17126
作者: Andrew Goldberg,Kavish Kondap,Tianshuang Qiu,Zehan Ma,Letian Fu,Justin Kerr,Huang Huang,Kaiyuan Chen,Kuan Fang,Ken Goldberg
关键词-EN: shown impressive capabilities, creating text, shown impressive, impressive capabilities, capabilities in creating
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 7 Figures

点击查看摘要

Abstract:Generative AI systems have shown impressive capabilities in creating text, code, and images. Inspired by the rich history of research in industrial ‘‘Design for Assembly’’, we introduce a novel problem: Generative Design-for-Robot-Assembly (GDfRA). The task is to generate an assembly based on a natural language prompt (e.g., ‘‘giraffe’’) and an image of available physical components, such as 3D-printed blocks. The output is an assembly, a spatial arrangement of these components, and instructions for a robot to build this assembly. The output must 1) resemble the requested object and 2) be reliably assembled by a 6 DoF robot arm with a suction gripper. We then present Blox-Net, a GDfRA system that combines generative vision language models with well-established methods in computer vision, simulation, perturbation analysis, motion planning, and physical robot experimentation to solve a class of GDfRA problems with minimal human supervision. Blox-Net achieved a Top-1 accuracy of 63.5% in the ‘‘recognizability’’ of its designed assemblies (eg, resembling giraffe as judged by a VLM). These designs, after automated perturbation redesign, were reliably assembled by a robot, achieving near-perfect success across 10 consecutive assembly iterations with human intervention only during reset prior to assembly. Surprisingly, this entire design process from textual word (‘‘giraffe’’) to reliable physical assembly is performed with zero human intervention.

[LG-7] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Handy Appetizer

链接: https://arxiv.org/abs/2409.17120
作者: Benji Peng,Xuanhe Pan,Yizhu Wen,Ziqian Bi,Keyu Chen,Ming Li,Ming Liu,Qian Niu,Junyu Liu,Jinlang Wang,Sen Zhang,Jiawei Xu,Pohsun Feng
关键词-EN: Artificial Intelligence, Convolutional Neural Networks, Deep Learning, Machine Learning, role of Artificial
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: This book contains 93 pages and 60 figures

点击查看摘要

Abstract:This book explores the role of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) in driving the progress of big data analytics and management. The book focuses on simplifying the complex mathematical concepts behind deep learning, offering intuitive visualizations and practical case studies to help readers understand how neural networks and technologies like Convolutional Neural Networks (CNNs) work. It introduces several classic models and technologies such as Transformers, GPT, ResNet, BERT, and YOLO, highlighting their applications in fields like natural language processing, image recognition, and autonomous driving. The book also emphasizes the importance of pre-trained models and how they can enhance model performance and accuracy, with instructions on how to apply these models in various real-world scenarios. Additionally, it provides an overview of key big data management technologies like SQL and NoSQL databases, as well as distributed computing frameworks such as Apache Hadoop and Spark, explaining their importance in managing and processing vast amounts of data. Ultimately, the book underscores the value of mastering deep learning and big data management skills as critical tools for the future workforce, making it an essential resource for both beginners and experienced professionals.

[LG-8] Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

链接: https://arxiv.org/abs/2409.17115
作者: Fan Zhou,Zengzhi Wang,Qian Liu,Junlong Li,Pengfei Liu
关键词-EN: numerous rules developed, Large language model, Large language, resulting in numerous, developed to date
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 45 pages, 13 figures, 34 tables

点击查看摘要

Abstract:Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with 100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: this https URL

[LG-9] Characterizing stable regions in the residual stream of LLMs

链接: https://arxiv.org/abs/2409.17113
作者: Jett Janiak,Jacek Karwowski,Chatrik Singh Mangat,Giorgi Giglemiani,Nora Petrova,Stefan Heimersheim
关键词-EN: output remains insensitive, exhibits high sensitivity, model output remains, stream of Transformers, residual stream
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We identify “stable regions” in the residual stream of Transformers, where the model’s output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions.

[LG-10] Accumulator-Aware Post-Training Quantization

链接: https://arxiv.org/abs/2409.17092
作者: Ian Colbert,Fabian Grob,Giuseppe Franco,Jinjie Zhang,Rayan Saab
关键词-EN: investigated low-precision accumulation, studies have investigated, investigated low-precision, PTQ algorithms, recent studies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.

[LG-11] Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification

链接: https://arxiv.org/abs/2409.17091
作者: Xinrui Zhou,Yuhao Huang,Haoran Dou,Shijing Chen,Ao Chang,Jia Liu,Weiran Long,Jian Zheng,Erjiao Xu,Jie Ren,Ruobing Huang,Jun Cheng,Wufeng Xue,Dong Ni
关键词-EN: labor-intensive annotation processes, annotation processes hinder, deep models, limited availability, availability of large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, 7 tables

点击查看摘要

Abstract:In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.

[LG-12] Locally Regularized Sparse Graph by Fast Proximal Gradient Descent UAI2023

链接: https://arxiv.org/abs/2409.17090
作者: Dongfang Sun,Yingzhen Yang
关键词-EN: Sparse graphs built, Regularized Sparse Graph, sparse graph, sparse graph ignores, vanilla sparse graph
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by UAI2023

点击查看摘要

Abstract:Sparse graphs built by sparse representation has been demonstrated to be effective in clustering high-dimensional data. Albeit the compelling empirical performance, the vanilla sparse graph ignores the geometric information of the data by performing sparse representation for each datum separately. In order to obtain a sparse graph aligned with the local geometric structure of data, we propose a novel Support Regularized Sparse Graph, abbreviated as SRSG, for data clustering. SRSG encourages local smoothness on the neighborhoods of nearby data points by a well-defined support regularization term. We propose a fast proximal gradient descent method to solve the non-convex optimization problem of SRSG with the convergence matching the Nesterov’s optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Extensive experimental results on various real data sets demonstrate the superiority of SRSG over other competing clustering methods.

[LG-13] Efficient Feature Interactions with Transformers: Improving User Spending Propensity Predictions in Gaming

链接: https://arxiv.org/abs/2409.17077
作者: Ved Prakash,Kartavya Kothari
关键词-EN: real-life sports events, fantasy sports platform, virtual teams, teams for real-life, sports events
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Dream11 is a fantasy sports platform that allows users to create their own virtual teams for real-life sports events. We host multiple sports and matches for our 200M+ user base. In this RMG (real money gaming) setting, users pay an entry amount to participate in various contest products that we provide to users. In our current work, we discuss the problem of predicting the user’s propensity to spend in a gaming round, so it can be utilized for various downstream applications. e.g. Upselling users by incentivizing them marginally as per their spending propensity, or personalizing the product listing based on the user’s propensity to spend. We aim to model the spending propensity of each user based on past transaction data. In this paper, we benchmark tree-based and deep-learning models that show good results on structured data, and we propose a new architecture change that is specifically designed to capture the rich interactions among the input features. We show that our proposed architecture outperforms the existing models on the task of predicting the user’s propensity to spend in a gaming round. Our new transformer model surpasses the state-of-the-art FT-Transformer, improving MAE by 2.5% and MSE by 21.8%. Comments: 6 pages, 3 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.17077 [cs.LG] (or arXiv:2409.17077v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.17077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-14] he Effect of Perceptual Metrics on Music Representation Learning for Genre Classification

链接: https://arxiv.org/abs/2409.17069
作者: Tashi Namgyal,Alexander Hepburn,Raul Santos-Rodriguez,Valero Laparra,Jesus Malo
关键词-EN: objective perceptual metrics, perceptual metrics, subjective quality, approximated with objective, natural signals
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: arXiv admin note: text overlap with arXiv:2312.03455

点击查看摘要

Abstract:The subjective quality of natural signals can be approximated with objective perceptual metrics. Designed to approximate the perceptual behaviour of human observers, perceptual metrics often reflect structures found in natural signals and neurological pathways. Models trained with perceptual metrics as loss functions can capture perceptually meaningful features from the structures held within these metrics. We demonstrate that using features extracted from autoencoders trained with perceptual losses can improve performance on music understanding tasks, i.e. genre classification, over using these metrics directly as distances when learning a classifier. This result suggests improved generalisation to novel signals when using perceptual metrics as loss functions for representation learning.

[LG-15] Benchmarking Domain Generalization Algorithms in Computational Pathology

链接: https://arxiv.org/abs/2409.17063
作者: Neda Zamanitajeddin,Mostafa Jahanifar,Kesi Xu,Fouzia Siraj,Nasir Rajpoot
关键词-EN: shown immense promise, unseen data due, computational pathology, Deep learning, Deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.

[LG-16] DRIM: Learning Disentangled Representations from Incomplete Multimodal Healthcare Data

链接: https://arxiv.org/abs/2409.17055
作者: Lucas Robinet,Ahmad Berjaoui,Ziad Kheil,Elizabeth Cohen-Jonathan Moyal
关键词-EN: Real-life medical data, advanced deep learning, deep learning models, learning models capable, Real-life medical
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-life medical data is often multimodal and incomplete, fueling the growing need for advanced deep learning models capable of integrating them efficiently. The use of diverse modalities, including histopathology slides, MRI, and genetic data, offers unprecedented opportunities to improve prognosis prediction and to unveil new treatment pathways. Contrastive learning, widely used for deriving representations from paired data in multimodal tasks, assumes that different views contain the same task-relevant information and leverages only shared information. This assumption becomes restrictive when handling medical data since each modality also harbors specific knowledge relevant to downstream tasks. We introduce DRIM, a new multimodal method for capturing these shared and unique representations, despite data sparsity. More specifically, given a set of modalities, we aim to encode a representation for each one that can be divided into two components: one encapsulating patient-related information common across modalities and the other, encapsulating modality-specific details. This is achieved by increasing the shared information among different patient modalities while minimizing the overlap between shared and unique components within each modality. Our method outperforms state-of-the-art algorithms on glioma patients survival prediction tasks, while being robust to missing modalities. To promote reproducibility, the code is made publicly available at this https URL

[LG-17] Predictive Covert Communication Against Multi-UAV Surveillance Using Graph Koopman Autoencoder

链接: https://arxiv.org/abs/2409.17048
作者: Sivaram Krishnan,Jihong Park,Gregory Sherman,Benjamin Campbell,Jinho Choi
关键词-EN: Low Probability, radio frequency, signals to evade, achieving LPD communication, aims to obscure
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Low Probability of Detection (LPD) communication aims to obscure the presence of radio frequency (RF) signals to evade surveillance. In the context of mobile surveillance utilizing unmanned aerial vehicles (UAVs), achieving LPD communication presents significant challenges due to the UAVs’ rapid and continuous movements, which are characterized by unknown nonlinear dynamics. Therefore, accurately predicting future locations of UAVs is essential for enabling real-time LPD communication. In this paper, we introduce a novel framework termed predictive covert communication, aimed at minimizing detectability in terrestrial ad-hoc networks under multi-UAV surveillance. Our data-driven method synergistically integrates graph neural networks (GNN) with Koopman theory to model the complex interactions within a multi-UAV network and facilitating long-term predictions by linearizing the dynamics, even with limited historical data. Extensive simulation results substantiate that the predicted trajectories using our method result in at least 63%-75% lower probability of detection when compared to well-known state-of-the-art baseline approaches, showing promise in enabling low-latency covert operations in practical scenarios.

[LG-18] How to Connect Speech Foundation Models and Large Language Models ? What Matters and What Does Not

链接: https://arxiv.org/abs/2409.17044
作者: Francesco Verdini,Pierfrancesco Melucci,Stefano Perna,Francesco Cariaggi,Marco Gaido,Sara Papi,Szymon Mazurek,Marek Kasztelnik,Luisa Bentivogli,Sébastien Bratières,Paolo Merialdo,Simone Scardapane
关键词-EN: Large Language Models, Large Language, Speech Foundational Model, driven research efforts, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.

[LG-19] Counterfactual Token Generation in Large Language Models

链接: https://arxiv.org/abs/2409.17027
作者: Ivi Chatzi,Nina Corvelo Benz,Eleni Straitouri,Stratis Tsirtsis,Manuel Gomez-Rodriguez
关键词-EN: Maelstrom Fury, Captain Lyra stood, trusty ship, endless sea, Captain Lyra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:“Sure, I am happy to generate a story for you: Captain Lyra stood at the helm of her trusty ship, the Maelstrom’s Fury, gazing out at the endless sea. […] Lyra’s eyes welled up with tears as she realized the bitter truth - she had sacrificed everything for fleeting riches, and lost the love of her crew, her family, and herself.” Although this story, generated by a large language model, is captivating, one may wonder – how would the story have unfolded if the model had chosen “Captain Maeve” as the protagonist instead? We cannot know. State-of-the-art large language models are stateless – they maintain no internal memory or state. Given a prompt, they generate a sequence of tokens as an output using an autoregressive process. As a consequence, they cannot reason about counterfactual alternatives to tokens they have generated in the past. In this work, our goal is to enhance them with this functionality. To this end, we develop a causal model of token generation that builds upon the Gumbel-Max structural causal model. Our model allows any large language model to perform counterfactual token generation at almost no cost in comparison with vanilla token generation, it is embarrassingly simple to implement, and it does not require any fine-tuning nor prompt engineering. We implement our model on Llama 3 8B-instruct and conduct both qualitative and quantitative analyses of counterfactually generated text. We conclude with a demonstrative application of counterfactual token generation for bias detection, unveiling interesting insights about the model of the world constructed by large language models.

[LG-20] CombU: A Combined Unit Activation for Fitting Mathematical Expressions with Neural Networks

链接: https://arxiv.org/abs/2409.17021
作者: Jiayu Li,Zilong Zhao,Kevin Yee,Uzair Javaid,Biplab Sikdar
关键词-EN: complex data relations, approximate complex data, enabling deep networks, data relationships, data relations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The activation functions are fundamental to neural networks as they introduce non-linearity into data relationships, thereby enabling deep networks to approximate complex data relations. Existing efforts to enhance neural network performance have predominantly focused on developing new mathematical functions. However, we find that a well-designed combination of existing activation functions within a neural network can also achieve this objective. In this paper, we introduce the Combined Units activation (CombU), which employs different activation functions at various dimensions across different layers. This approach can be theoretically proven to fit most mathematical expressions accurately. The experiments conducted on four mathematical expression datasets, compared against six State-Of-The-Art (SOTA) activation function algorithms, demonstrate that CombU outperforms all SOTA algorithms in 10 out of 16 metrics and ranks in the top three for the remaining six metrics.

[LG-21] CNN Mixture-of-Depths ACCV

链接: https://arxiv.org/abs/2409.17016
作者: Rinor Cakaj,Jens Mehnert,Bin Yang
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, selectively processing channels, processing channels based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Conference Paper of the Asian Conference on Computer Vision (ACCV) 2024

点击查看摘要

Abstract:We introduce Mixture-of-Depths (MoD) for Convolutional Neural Networks (CNNs), a novel approach that enhances the computational efficiency of CNNs by selectively processing channels based on their relevance to the current prediction. This method optimizes computational resources by dynamically selecting key channels in feature maps for focused processing within the convolutional blocks (Conv-Blocks), while skipping less relevant channels. Unlike conditional computation methods that require dynamic computation graphs, CNN MoD uses a static computation graph with fixed tensor sizes which improve hardware efficiency. It speeds up the training and inference processes without the need for customized CUDA kernels, unique loss functions, or finetuning. CNN MoD either matches the performance of traditional CNNs with reduced inference times, GMACs, and parameters, or exceeds their performance while maintaining similar inference times, GMACs, and parameters. For example, on ImageNet, ResNet86-MoD exceeds the performance of the standard ResNet50 by 0.45% with a 6% speedup on CPU and 5% on GPU. Moreover, ResNet75-MoD achieves the same performance as ResNet50 with a 25% speedup on CPU and 15% on GPU.

[LG-22] INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

链接: https://arxiv.org/abs/2409.16997
作者: Shimao Chen,Zirui Liu,Zhiying Wu,Ce Zheng,Peizhuang Cong,Zihan Jiang,Lei Su,Tong Yang
关键词-EN: self-attention module faces, large language models, language models, self-attention module, sequence length
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.

[LG-23] What is the relationship between Slow Feature Analysis and the Successor Representation?

链接: https://arxiv.org/abs/2409.16991
作者: Eddie Seabrook,Laurenz Wiskott
关键词-EN: slow feature analysis, progress. Feedback, feature analysis, successor representation, analytical comparison
类目: Machine Learning (cs.LG)
*备注: 52 pages, 5 figures

点击查看摘要

Abstract:(This is a work in progress. Feedback is welcome) An analytical comparison is made between slow feature analysis (SFA) and the successor representation (SR). While SFA and the SR stem from distinct areas of machine learning, they share important properties, both in terms of their mathematics and the types of information they are sensitive to. This work studies their connection along these two axes. In particular, multiple variants of the SFA algorithm are explored analytically and then applied to the setting of an MDP, leading to a family of eigenvalue problems involving the SR and other related quantities. These resulting eigenvalue problems are then illustrated in the toy setting of a gridworld, where it is demonstrated that the place- and grid-like fields often associated to the SR can equally be generated using SFA.

[LG-24] owards User-Focused Research in Training Data Attribution for Human-Centered Explainable AI

链接: https://arxiv.org/abs/2409.16978
作者: Elisa Nguyen,Johannes Bertram,Evgenii Kortukov,Jean Y. Song,Seong Joon Oh
关键词-EN: Training Data Attribution, aims to make, make AI understandable, criticised for relying, mathematical soundness
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Explainable AI (XAI) aims to make AI understandable and useful to humans, it has been criticised for relying too much on formalism and solutionism, focusing more on mathematical soundness than user needs. We propose an alternative to this bottom-up approach inspired by design thinking: the XAI research community should adopt a top-down, user-focused perspective to ensure user relevance. We illustrate this with a relatively young subfield of XAI, Training Data Attribution (TDA). With the surge in TDA research and growing competition, the field risks repeating the same patterns of solutionism. We conducted a needfinding study with a diverse group of AI practitioners to identify potential user needs related to TDA. Through interviews (N=10) and a systematic survey (N=31), we uncovered new TDA tasks that are currently largely overlooked. We invite the TDA and XAI communities to consider these novel tasks and improve the user relevance of their research outcomes.

[LG-25] Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

链接: https://arxiv.org/abs/2409.16973
作者: Rafael Mendoza,Isabella Cruz,Richard Liu,Aarav Deshmukh,David Williams,Jesscia Peng,Rohan Iyer
关键词-EN: Large language models, Large language, interact with technology, significant challenge, individual user preferences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: First ASLS

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.

[LG-26] Bridge to Real Environment with Hardware-in-the-loop for Wireless Artificial Intelligence Paradigms

链接: https://arxiv.org/abs/2409.16968
作者: Jeffrey Redondo,Nauman Aslam,Juan Zhang,Zhenhui Yuan
关键词-EN: Vehicular Adhoc Network, Adhoc Network, Vehicular Adhoc, machine learning, wireless standard
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Nowadays, many machine learning (ML) solutions to improve the wireless standard IEEE802.11p for Vehicular Adhoc Network (VANET) are commonly evaluated in the simulated world. At the same time, this approach could be cost-effective compared to real-world testing due to the high cost of vehicles. There is a risk of unexpected outcomes when these solutions are implemented in the real world, potentially leading to wasted resources. To mitigate this challenge, the hardware-in-the-loop is the way to move forward as it enables the opportunity to test in the real world and simulated worlds together. Therefore, we have developed what we believe is the pioneering hardware-in-the-loop for testing artificial intelligence, multiple services, and HD map data (LiDAR), in both simulated and real-world settings.

[LG-27] ABCFair: an Adaptable Benchmark approach for Comparing Fairness Methods

链接: https://arxiv.org/abs/2409.16965
作者: MaryBeth Defrance,Maarten Buyl,Tijl De Bie
关键词-EN: Numerous methods, sensitive features, implemented that pursue, mitigating biases, machine learning
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Numerous methods have been implemented that pursue fairness with respect to sensitive features by mitigating biases in machine learning. Yet, the problem settings that each method tackles vary significantly, including the stage of intervention, the composition of sensitive features, the fairness notion, and the distribution of the output. Even in binary classification, these subtle differences make it highly complicated to benchmark fairness methods, as their performance can strongly depend on exactly how the bias mitigation problem was originally framed. Hence, we introduce ABCFair, a benchmark approach which allows adapting to the desiderata of the real-world problem setting, enabling proper comparability between methods for any use case. We apply ABCFair to a range of pre-, in-, and postprocessing methods on both large-scale, traditional datasets and on a dual label (biased and unbiased) dataset to sidestep the fairness-accuracy trade-off. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2409.16965 [cs.LG] (or arXiv:2409.16965v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.16965 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Informed deep hierarchical classification: a non-standard analysis inspired approach

链接: https://arxiv.org/abs/2409.16956
作者: Lorenzo Fiaschi,Marco Cococcioni
关键词-EN: rigid parent-child structure, multiple labels organized, deep neural network, parent-child structure, work proposes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic (math.LO)
*备注:

点击查看摘要

Abstract:This work proposes a novel approach to the deep hierarchical classification task, i.e., the problem of classifying data according to multiple labels organized in a rigid parent-child structure. It consists in a multi-output deep neural network equipped with specific projection operators placed before each output layer. The design of such an architecture, called lexicographic hybrid deep neural network (LH-DNN), has been possible by combining tools from different and quite distant research fields: lexicographic multi-objective optimization, non-standard analysis, and deep learning. To assess the efficacy of the approach, the resulting network is compared against the B-CNN, a convolutional neural network tailored for hierarchical classification tasks, on the CIFAR10, CIFAR100 (where it has been originally and recently proposed before being adopted and tuned for multiple real-world applications) and Fashion-MNIST benchmarks. Evidence states that an LH-DNN can achieve comparable if not superior performance, especially in the learning of the hierarchical relations, in the face of a drastic reduction of the learning parameters, training epochs, and computational time, without the need for ad-hoc loss functions weighting values.

[LG-29] Dynamic Obstacle Avoidance through Uncertainty-Based Adaptive Planning with Diffusion

链接: https://arxiv.org/abs/2409.16950
作者: Vineet Punyamoorty,Pascal Jutras-Dubé,Ruqi Zhang,Vaneet Aggarwal,Damon Conover,Aniket Bera
关键词-EN: framing reinforcement learning, sequence modeling problem, modeling problem, recent work, framing reinforcement
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While replanning at every timestep could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences – a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts replanning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant replanning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and a 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates and an improved ability to navigate the environment safely.

[LG-30] Decomposition of Equivariant Maps via Invariant Maps: Application to Universal Approximation under Symmetry

链接: https://arxiv.org/abs/2409.16922
作者: Akiyoshi Sannai,Yuuki Takai,Matthieu Cordonnier
关键词-EN: equivariant, maps, equivariant maps, invariant, invariant maps
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we develop a theory about the relationship between invariant and equivariant maps with regard to a group G . We then leverage this theory in the context of deep neural networks with group symmetries in order to obtain novel insight into their mechanisms. More precisely, we establish a one-to-one relationship between equivariant maps and certain invariant maps. This allows us to reduce arguments for equivariant maps to those for invariant maps and vice versa. As an application, we propose a construction of universal equivariant architectures built from universal invariant networks. We, in turn, explain how the universal architectures arising from our construction differ from standard equivariant architectures known to be universal. Furthermore, we explore the complexity, in terms of the number of free parameters, of our models, and discuss the relation between invariant and equivariant networks’ complexity. Finally, we also give an approximation rate for G-equivariant deep neural networks with ReLU activation functions for finite group G.

[LG-31] Discriminative Anchor Learning for Efficient Multi-view Clustering

链接: https://arxiv.org/abs/2409.16904
作者: Yalan Qin,Nan Pu,Hanzhou Wu,Nicu Sebe
关键词-EN: Multi-view clustering aims, anchor graph, shared anchor graph, underlying structure, aims to study
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted by TMM

点击查看摘要

Abstract:Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods.

[LG-32] Revisiting Space Mission Planning: A Reinforcement Learning-Guided Approach for Multi-Debris Rendezvous

链接: https://arxiv.org/abs/2409.16882
作者: Agni Bandyopadhyay,Guenther Waxenegger-Wilfing
关键词-EN: Proximal Policy Optimization, masked Proximal Policy, masked Proximal, Policy Optimization, Proximal Policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for publication at the 2024 International Conference on Space Robotics (iSpaRo)

点击查看摘要

Abstract:This research introduces a novel application of a masked Proximal Policy Optimization (PPO) algorithm from the field of deep reinforcement learning (RL), for determining the most efficient sequence of space debris visitation, utilizing the Lambert solver as per Izzo’s adaptation for individual rendezvous. The aim is to optimize the sequence in which all the given debris should be visited to get the least total time for rendezvous for the entire mission. A neural network (NN) policy is developed, trained on simulated space missions with varying debris fields. After training, the neural network calculates approximately optimal paths using Izzo’s adaptation of Lambert maneuvers. Performance is evaluated against standard heuristics in mission planning. The reinforcement learning approach demonstrates a significant improvement in planning efficiency by optimizing the sequence for debris rendezvous, reducing the total mission time by an average of approximately 10.96% and 13.66% compared to the Genetic and Greedy algorithms, respectively. The model on average identifies the most time-efficient sequence for debris visitation across various simulated scenarios with the fastest computational speed. This approach signifies a step forward in enhancing mission planning strategies for space debris clearance.

[LG-33] Feedforward Controllers from Learned Dynamic Local Model Networks with Application to Excavator Assistance Functions

链接: https://arxiv.org/abs/2409.16875
作者: Leon Greiser,Ozan Demir,Benjamin Hartmann,Henrik Hose,Sebastian Trimpe
关键词-EN: Complicated first principles, expensive for high-mix, low-volume products, prohibitively slow, slow and expensive
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complicated first principles modelling and controller synthesis can be prohibitively slow and expensive for high-mix, low-volume products such as hydraulic excavators. Instead, in a data-driven approach, recorded trajectories from the real system can be used to train local model networks (LMNs), for which feedforward controllers are derived via feedback linearization. However, previous works required LMNs without zero dynamics for feedback linearization, which restricts the model structure and thus modelling capacity of LMNs. In this paper, we overcome this restriction by providing a criterion for when feedback linearization of LMNs with zero dynamics yields a valid controller. As a criterion we propose the bounded-input bounded-output stability of the resulting controller. In two additional contributions, we extend this approach to consider measured disturbance signals and multiple inputs and outputs. We illustrate the effectiveness of our contributions in a hydraulic excavator control application with hardware experiments. To this end, we train LMNs from recorded, noisy data and derive feedforward controllers used as part of a leveling assistance system on the excavator. In our experiments, incorporating disturbance signals and multiple inputs and outputs enhances tracking performance of the learned controller. A video of our experiments is available at this https URL.

[LG-34] Ethical and Scalable Automation: A Governance and Compliance Framework for Business Applications

链接: https://arxiv.org/abs/2409.16872
作者: Haocheng Lin
关键词-EN: poses significant challenges, significant challenges relating, businesses poses significant, legal compliance, popularisation of applying
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The popularisation of applying AI in businesses poses significant challenges relating to ethical principles, governance, and legal compliance. Although businesses have embedded AI into their day-to-day processes, they lack a unified approach for mitigating its potential risks. This paper introduces a framework ensuring that AI must be ethical, controllable, viable, and desirable. Balancing these factors ensures the design of a framework that addresses its trade-offs, such as balancing performance against explainability. A successful framework provides practical advice for businesses to meet regulatory requirements in sectors such as finance and healthcare, where it is critical to comply with standards like GPDR and the EU AI Act. Different case studies validate this framework by integrating AI in both academic and practical environments. For instance, large language models are cost-effective alternatives for generating synthetic opinions that emulate attitudes to environmental issues. These case studies demonstrate how having a structured framework could enhance transparency and maintain performance levels as shown from the alignment between synthetic and expected distributions. This alignment is quantified using metrics like Chi-test scores, normalized mutual information, and Jaccard indexes. Future research should explore the framework’s empirical validation in diverse industrial settings further, ensuring the model’s scalability and adaptability.

[LG-35] Quantifying Visual Properties of GAM Shape Plots: Impact on Perceived Cognitive Load and Interpretability

链接: https://arxiv.org/abs/2409.16870
作者: Sven Kruschel,Lasse Bohlen,Julian Rosenberger,Patrick Zschech,Mathias Kraus
关键词-EN: Generalized Additive Models, Generalized Additive, Additive Models, offer a balance, machine learning
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: to be published in proceedings of the 58th Hawaii International Conference on System Sciences (HICSS)

点击查看摘要

Abstract:Generalized Additive Models (GAMs) offer a balance between performance and interpretability in machine learning. The interpretability aspect of GAMs is expressed through shape plots, representing the model’s decision-making process. However, the visual properties of these plots, e.g. number of kinks (number of local maxima and minima), can impact their complexity and the cognitive load imposed on the viewer, compromising interpretability. Our study, including 57 participants, investigates the relationship between the visual properties of GAM shape plots and cognitive load they induce. We quantify various visual properties of shape plots and evaluate their alignment with participants’ perceived cognitive load, based on 144 plots. Our results indicate that the number of kinks metric is the most effective, explaining 86.4% of the variance in users’ ratings. We develop a simple model based on number of kinks that provides a practical tool for predicting cognitive load, enabling the assessment of one aspect of GAM interpretability without direct user involvement.

[LG-36] Risk-averse learning with delayed feedback

链接: https://arxiv.org/abs/2409.16866
作者: Siyi Wang,Zifan Wang,Karl Henrik Johansson,Sandra Hirche
关键词-EN: risk-averse learning, risk-averse learning algorithm, two-point risk-averse learning, manifest immediately, impacts of decisions
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In real-world scenarios, the impacts of decisions may not manifest immediately. Taking these delays into account facilitates accurate assessment and management of risk in real-world environments, thereby ensuring the efficacy of strategies. In this paper, we investigate risk-averse learning using Conditional Value at Risk (CVaR) as risk measure, while incorporating delayed feedback with unknown but bounded delays. We develop two risk-averse learning algorithms that rely on one-point and two-point zeroth-order optimization approaches, respectively. The regret achieved by the algorithms is analyzed in terms of the cumulative delay and the number of total samplings. The results suggest that the two-point risk-averse learning achieves a smaller regret bound than the one-point algorithm. Furthermore, the one-point risk-averse learning algorithm attains sublinear regret under certain delay conditions, and the two-point risk-averse learning algorithm achieves sublinear regret with minimal restrictions on the delay. We provide numerical experiments on a dynamic pricing problem to demonstrate the performance of the proposed algorithms.

[LG-37] Demo2Vec: Learning Region Embedding with Demographic Information

链接: https://arxiv.org/abs/2409.16837
作者: Ya Wen,Yulun Zhou
关键词-EN: integrated demographic information, generate region embedding, education level, region embedding, studies have integrated
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Demographic data, such as income, education level, and employment rate, contain valuable information of urban regions, yet few studies have integrated demographic information to generate region embedding. In this study, we show how the simple and easy-to-access demographic data can improve the quality of state-of-the-art region embedding and provide better predictive performances in urban areas across three common urban tasks, namely check-in prediction, crime rate prediction, and house price prediction. We find that existing pre-train methods based on KL divergence are potentially biased towards mobility information and propose to use Jenson-Shannon divergence as a more appropriate loss function for multi-view representation learning. Experimental results from both New York and Chicago show that mobility + income is the best pre-train data combination, providing up to 10.22% better predictive performances than existing models. Considering that mobility big data can be hardly accessible in many developing cities, we suggest geographic proximity + income to be a simple but effective data combination for region embedding pre-training.

[LG-38] Asynchronous Fractional Multi-Agent Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing

链接: https://arxiv.org/abs/2409.16832
作者: Lyudong Jin,Ming Tang,Jiayu Pan,Meng Zhang,Hao Wang
关键词-EN: Age of Information, emerging real-time networked, real-time networked applications, realm of emerging, emerging real-time
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In the realm of emerging real-time networked applications like cyber-physical systems (CPS), the Age of Information (AoI) has merged as a pivotal metric for evaluating the timeliness. To meet the high computational demands, such as those in intelligent manufacturing within CPS, mobile edge computing (MEC) presents a promising solution for optimizing computing and reducing AoI. In this work, we study the timeliness of computational-intensive updates and explores jointly optimize the task updating and offloading policies to minimize AoI. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The fractional objective introduced by AoI and the semi-Markov game nature of the problem render this challenge particularly difficult, with existing approaches not directly applicable. To this end, we present a comprehensive framework to fractional reinforcement learning (RL). We first introduce a fractional single-agent RL framework and prove its linear convergence. We then extend this to a fractional multi-agent RL framework with a convergence analysis. To tackle the challenge of asynchronous control in semi-Markov game, we further design an asynchronous model-free fractional multi-agent RL algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 52.6% compared with the best baseline algorithm in our experiments.

[LG-39] Learning phase-space flows using time-discrete implicit Runge-Kutta PINNs

链接: https://arxiv.org/abs/2409.16826
作者: Álvaro Fernández Corral,Nicolás Mendoza,Armin Iske,Andrey Yachmenev,Jochen Küpper
关键词-EN: Informed Neural Networks, implicit Runge-Kutta Physics, Informed Neural, Neural Networks, obtaining multidimensional phase-space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注: 10 pages, 4 figures, published in the International Conference on Scientific Computing and Machine Learning, see this http URL

点击查看摘要

Abstract:We present a computational framework for obtaining multidimensional phase-space solutions of systems of non-linear coupled differential equations, using high-order implicit Runge-Kutta Physics- Informed Neural Networks (IRK-PINNs) schemes. Building upon foundational work originally solving differential equations for fields depending on coordinates [J. Comput. Phys. 378, 686 (2019)], we adapt the scheme to a context where the coordinates are treated as functions. This modification enables us to efficiently solve equations of motion for a particle in an external field. Our scheme is particularly useful for explicitly time-independent and periodic fields. We apply this approach to successfully solve the equations of motion for a mass particle placed in a central force field and a charged particle in a periodic electric field.

[LG-40] Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability

链接: https://arxiv.org/abs/2409.16824
作者: Carlos E. Luis,Alessandro G. Bottero,Julia Vinogradska,Felix Berkenkamp,Jan Peters
关键词-EN: Kalman filter, environment hidden state, partial observability requires, Kalman filter layer, Optimal decision-making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal decision-making under partial observability requires reasoning about the uncertainty of the environment’s hidden state. However, most reinforcement learning architectures handle partial observability with sequence models that have no internal mechanism to incorporate uncertainty in their hidden state representation, such as recurrent neural networks, deterministic state-space models and transformers. Inspired by advances in probabilistic world models for reinforcement learning, we propose a standalone Kalman filter layer that performs closed-form Gaussian inference in linear state-space models and train it end-to-end within a model-free architecture to maximize returns. Similar to efficient linear recurrent layers, the Kalman filter layer processes sequential data using a parallel scan, which scales logarithmically with the sequence length. By design, Kalman filter layers are a drop-in replacement for other recurrent layers in standard model-free architectures, but importantly they include an explicit mechanism for probabilistic filtering of the latent state representation. Experiments in a wide variety of tasks with partial observability show that Kalman filter layers excel in problems where uncertainty reasoning is key for decision-making, outperforming other stateful models.

[LG-41] A parametric framework for kernel-based dynamic mode decomposition using deep learning

链接: https://arxiv.org/abs/2409.16817
作者: Konstantinos Kevopoulos,Dongwei Ye
关键词-EN: Surrogate modelling, many-query scenarios, design optimisation, large-scale computational models, modelling is widely
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Surrogate modelling is widely applied in computational science and engineering to mitigate computational efficiency issues for the real-time simulations of complex and large-scale computational models or for many-query scenarios, such as uncertainty quantification and design optimisation. In this work, we propose a parametric framework for kernel-based dynamic mode decomposition method based on the linear and nonlinear disambiguation optimization (LANDO) algorithm. The proposed parametric framework consists of two stages, offline and online. The offline stage prepares the essential component for prediction, namely a series of LANDO models that emulate the dynamics of the system with particular parameters from a training dataset. The online stage leverages those LANDO models to generate new data at a desired time instant, and approximate the mapping between parameters and the state with the data using deep learning techniques. Moreover, dimensionality reduction technique is applied to high-dimensional dynamical systems to reduce the computational cost of training. Three numerical examples including Lotka-Volterra model, heat equation and reaction-diffusion equation are presented to demonstrate the efficiency and effectiveness of the proposed framework.

[LG-42] Accelerating TinyML Inference on Microcontrollers through Approximate Kernels

链接: https://arxiv.org/abs/2409.16815
作者: Giorgos Armeniakos,Georgios Mentzos,Dimitrios Soudris
关键词-EN: microcontroller-based IoT devices, Tiny Machine Learning, numerous applications, personalized healthcare, rapid growth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of microcontroller-based IoT devices has opened up numerous applications, from smart manufacturing to personalized healthcare. Despite the widespread adoption of energy-efficient microcontroller units (MCUs) in the Tiny Machine Learning (TinyML) domain, they still face significant limitations in terms of performance and memory (RAM, Flash). In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on MCUs. Our kernel-based approximation framework firstly unpacks the operands of each convolution layer and then conducts an offline calculation to determine the significance of each operand. Subsequently, through a design space exploration, it employs a computation skipping approximation strategy based on the calculated significance. Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our Pareto optimal solutions can feature on average 21% latency reduction with no degradation in Top-1 classification accuracy, while for lower accuracy requirements, the corresponding reduction becomes even more pronounced.

[LG-43] Large Language Model Predicts Above Normal All India Summer Monsoon Rainfall in 2024

链接: https://arxiv.org/abs/2409.16799
作者: Ujjawal Sharma,Madhav Biyani,Akhil Dev Suresh,Debi Prasad Bhuyan,Saroj Kanta Mishra,Tanmoy Chakraborty
关键词-EN: India Summer Monsoon, India Summer, Summer Monsoon Rainfall, Reliable prediction, Summer Monsoon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 3 figures

点击查看摘要

Abstract:Reliable prediction of the All India Summer Monsoon Rainfall (AISMR) is pivotal for informed policymaking for the country, impacting the lives of billions of people. However, accurate simulation of AISMR has been a persistent challenge due to the complex interplay of various muti-scale factors and the inherent variability of the monsoon system. This research focuses on adapting and fine-tuning the latest LLM model, PatchTST, to accurately predict AISMR with a lead time of three months. The fine-tuned PatchTST model, trained with historical AISMR data, the Niño3.4 index, and categorical Indian Ocean Dipole values, outperforms several popular neural network models and statistical models. This fine-tuned LLM model exhibits an exceptionally low RMSE percentage of 0.07% and a Spearman correlation of 0.976. This is particularly impressive, since it is nearly 80% more accurate than the best-performing NN models. The model predicts an above-normal monsoon for the year 2024, with an accumulated rainfall of 921.6 mm in the month of June-September for the entire country.

[LG-44] Scalable Ensemble Diversification for OOD Generalization and Detection

链接: https://arxiv.org/abs/2409.16797
作者: Alexander Rubinstein,Luca Scimeca,Damien Teney,Seong Joon Oh
关键词-EN: Bayesian principles, OOD, OOD samples, practical applications, providing candidates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Training a diverse ensemble of models has several practical applications such as providing candidates for model selection with better out-of-distribution (OOD) generalization, and enabling the detection of OOD samples via Bayesian principles. An existing approach to diverse ensemble training encourages the models to disagree on provided OOD samples. However, the approach is computationally expensive and it requires well-separated ID and OOD examples, such that it has only been demonstrated in small-scale settings. \textbfMethod. This work presents a method for Scalable Ensemble Diversification (SED) applicable to large-scale settings (e.g. ImageNet) that does not require OOD samples. Instead, SED identifies hard training samples on the fly and encourages the ensemble members to disagree on these. To improve scaling, we show how to avoid the expensive computations in existing methods of exhaustive pairwise disagreements across models. \textbfResults. We evaluate the benefits of diversification with experiments on ImageNet. First, for OOD generalization, we observe large benefits from the diversification in multiple settings including output-space (classical) ensembles and weight-space ensembles (model soups). Second, for OOD detection, we turn the diversity of ensemble hypotheses into a novel uncertainty score estimator that surpasses a large number of OOD detection baselines. Code is available here: this https URL. Comments: Under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.16797 [cs.LG] (or arXiv:2409.16797v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.16797 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander Rubinstein [view email] [v1] Wed, 25 Sep 2024 10:30:24 UTC (3,449 KB)

[LG-45] Symbolic State Partition for Reinforcement Learning

链接: https://arxiv.org/abs/2409.16791
作者: Mohsen Ghaffari,Mahsa Varshosaz,Einar Broch Johnsen,Andrzej Wąsowski
关键词-EN: Tabular reinforcement learning, Tabular reinforcement, state space, methods cannot operate, operate directly
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular reinforcement learning methods cannot operate directly on continuous state spaces. One solution for this problem is to partition the state space. A good partitioning enables generalization during learning and more efficient exploitation of prior experiences. Consequently, the learning process becomes faster and produces more reliable policies. However, partitioning introduces approximation, which is particularly harmful in the presence of nonlinear relations between state components. An ideal partition should be as coarse as possible, while capturing the key structure of the state space for the given problem. This work extracts partitions from the environment dynamics by symbolic execution. We show that symbolic partitioning improves state space coverage with respect to environmental behavior and allows reinforcement learning to perform better for sparse rewards. We evaluate symbolic state space partitioning with respect to precision, scalability, learning agent performance and state space coverage for the learnt policies.

[LG-46] Enhancing Feature Selection and Interpretability in AI Regression Tasks Through Feature Attribution

链接: https://arxiv.org/abs/2409.16787
作者: Alexander Hinterleitner,Thomas Bartz-Beielstein,Richard Schulz,Sebastian Spengler,Thomas Winter,Christoph Leitenmeier
关键词-EN: Explainable Artificial Intelligence, Research in Explainable, Artificial Intelligence, Explainable Artificial, make deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Research in Explainable Artificial Intelligence (XAI) is increasing, aiming to make deep learning models more transparent. Most XAI methods focus on justifying the decisions made by Artificial Intelligence (AI) systems in security-relevant applications. However, relatively little attention has been given to using these methods to improve the performance and robustness of deep learning algorithms. Additionally, much of the existing XAI work primarily addresses classification problems. In this study, we investigate the potential of feature attribution methods to filter out uninformative features in input data for regression problems, thereby improving the accuracy and stability of predictions. We introduce a feature selection pipeline that combines Integrated Gradients with k-means clustering to select an optimal set of variables from the initial data space. To validate the effectiveness of this approach, we apply it to a real-world industrial problem - blade vibration analysis in the development process of turbo machinery.

[LG-47] World Model-based Perception for Visual Legged Locomotion

链接: https://arxiv.org/abs/2409.16784
作者: Hang Lai,Jiahang Cao,Jiafeng Xu,Hongtao Wu,Yunfeng Lin,Tao Kong,Yong Yu,Weinan Zhang
关键词-EN: Legged locomotion, requires precise perception, proprioception and vision, challenging and requires, requires precise
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher’s behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: this https URL.

[LG-48] Super Level Sets and Exponential Decay: A Synergistic Approach to Stable Neural Network Training

链接: https://arxiv.org/abs/2409.16769
作者: Jatin Chaudhary,Dipak Nidhi,Jukka Heikkonen,Haari Merisaari,Rajiv Kanth
关键词-EN: advanced anti-overfitting strategies, integrates exponential decay, effectively integrates exponential, dynamic learning rate, anti-overfitting strategies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The objective of this paper is to enhance the optimization process for neural networks by developing a dynamic learning rate algorithm that effectively integrates exponential decay and advanced anti-overfitting strategies. Our primary contribution is the establishment of a theoretical framework where we demonstrate that the optimization landscape, under the influence of our algorithm, exhibits unique stability characteristics defined by Lyapunov stability principles. Specifically, we prove that the superlevel sets of the loss function, as influenced by our adaptive learning rate, are always connected, ensuring consistent training dynamics. Furthermore, we establish the “equiconnectedness” property of these superlevel sets, which maintains uniform stability across varying training conditions and epochs. This paper contributes to the theoretical understanding of dynamic learning rate mechanisms in neural networks and also pave the way for the development of more efficient and reliable neural optimization techniques. This study intends to formalize and validate the equiconnectedness of loss function as superlevel sets in the context of neural network training, opening newer avenues for future research in adaptive machine learning algorithms. We leverage previous theoretical discoveries to propose training mechanisms that can effectively handle complex and high-dimensional data landscapes, particularly in applications requiring high precision and reliability.

[LG-49] Interpreting Deep Neural Network-Based Receiver Under Varying Signal-To-Noise Ratios

链接: https://arxiv.org/abs/2409.16768
作者: Marko Tuononen,Dani Korpi,Ville Hautamäki
关键词-EN: convolutional neural network-based, focusing on convolutional, network-based receiver model, interpreting neural networks, neural network-based receiver
类目: Machine Learning (cs.LG)
*备注: 7+1 pages, 8 figures

点击查看摘要

Abstract:We propose a novel method for interpreting neural networks, focusing on convolutional neural network-based receiver model. The method identifies which unit or units of the model contain most (or least) information about the channel parameter(s) of the interest, providing insights at both global and local levels – with global explanations aggregating local ones. Experiments on link-level simulations demonstrate the method’s effectiveness in identifying units that contribute most (and least) to signal-to-noise ratio processing. Although we focus on a radio receiver model, the method generalizes to other neural network architectures and applications, offering robust estimation even in high-dimensional settings.

[LG-50] Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training

链接: https://arxiv.org/abs/2409.16767
作者: Kun Song,Zhiquan Tan,Bochao Zou,Jiansheng Chen,Huimin Ma,Weiran Huang
关键词-EN: classification head weights, analyze supervised learning, information content, classification head, information
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.03999

点击查看摘要

Abstract:In this paper, we utilize information-theoretic metrics like matrix entropy and mutual information to analyze supervised learning. We explore the information content of data representations and classification head weights and their information interplay during supervised training. Experiments show that matrix entropy cannot solely describe the interaction of the information content of data representation and classification head weights but it can effectively reflect the similarity and clustering behavior of the data. Inspired by this, we propose a cross-modal alignment loss to improve the alignment between the representations of the same class from different modalities. Moreover, in order to assess the interaction of the information content of data representation and classification head weights more accurately, we utilize new metrics like matrix mutual information ratio (MIR) and matrix information entropy difference ratio (HDR). Through theory and experiment, we show that HDR and MIR can not only effectively describe the information interplay of supervised training but also improve the performance of supervised and semi-supervised learning.

[LG-51] MaViLS a Benchmark Dataset for Video-to-Slide Alignment Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech OCR and Visual Features

链接: https://arxiv.org/abs/2409.16765
作者: Katharina Anderer,Andreas Reich,Matthias Wölfel
关键词-EN: paper presents, presents a benchmark, benchmark dataset, dataset for aligning, multimodal algorithm leveraging
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper presents a benchmark dataset for aligning lecture videos with corresponding slides and introduces a novel multimodal algorithm leveraging features from speech, text, and images. It achieves an average accuracy of 0.82 in comparison to SIFT (0.56) while being approximately 11 times faster. Using dynamic programming the algorithm tries to determine the optimal slide sequence. The results show that penalizing slide transitions increases accuracy. Features obtained via optical character recognition (OCR) contribute the most to a high matching accuracy, followed by image features. The findings highlight that audio transcripts alone provide valuable information for alignment and are beneficial if OCR data is lacking. Variations in matching accuracy across different lectures highlight the challenges associated with video quality and lecture style. The novel multimodal algorithm demonstrates robustness to some of these challenges, underscoring the potential of the approach.

[LG-52] Offline and Distributional Reinforcement Learning for Radio Resource Management

链接: https://arxiv.org/abs/2409.16764
作者: Eslam Eldeeb,Hirley Alves
关键词-EN: intelligent wireless networks, future intelligent wireless, Reinforcement learning, wireless networks, future intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has proved to have a promising role in future intelligent wireless networks. Online RL has been adopted for radio resource management (RRM), taking over traditional schemes. However, due to its reliance on online interaction with the environment, its role becomes limited in practical, real-world problems where online interaction is not feasible. In addition, traditional RL stands short in front of the uncertainties and risks in real-world stochastic environments. In this manner, we propose an offline and distributional RL scheme for the RRM problem, enabling offline training using a static dataset without any interaction with the environment and considering the sources of uncertainties using the distributions of the return. Simulation results demonstrate that the proposed scheme outperforms conventional resource management models. In addition, it is the only scheme that surpasses online RL and achieves a 16 % gain over online RL.

[LG-53] GB-RVFL: Fusion of Randomized Neural Network and Granular Ball Computing

链接: https://arxiv.org/abs/2409.16735
作者: M. Sajid,A. Quadir,M. Tanveer
关键词-EN: vector functional link, strong generalization ability, random vector functional, prominent classification model, functional link
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The random vector functional link (RVFL) network is a prominent classification model with strong generalization ability. However, RVFL treats all samples uniformly, ignoring whether they are pure or noisy, and its scalability is limited due to the need for inverting the entire training matrix. To address these issues, we propose granular ball RVFL (GB-RVFL) model, which uses granular balls (GBs) as inputs instead of training samples. This approach enhances scalability by requiring only the inverse of the GB center matrix and improves robustness against noise and outliers through the coarse granularity of GBs. Furthermore, RVFL overlooks the dataset’s geometric structure. To address this, we propose graph embedding GB-RVFL (GE-GB-RVFL) model, which fuses granular computing and graph embedding (GE) to preserve the topological structure of GBs. The proposed GB-RVFL and GE-GB-RVFL models are evaluated on KEEL, UCI, NDC and biomedical datasets, demonstrating superior performance compared to baseline models.

[LG-54] Verified Relative Safety Margins for Neural Network Twins

链接: https://arxiv.org/abs/2409.16726
作者: Anahita Baninajjar,Kamran Hosseini,Ahmed Rezine,Amir Aminifar
关键词-EN: Deep Neural Network, Deep Neural, Relative Safety Margins, Neural Network, DNN
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given two Deep Neural Network (DNN) classifiers with the same input and output domains, our goal is to quantify the robustness of the two networks in relation to each other. Towards this, we introduce the notion of Relative Safety Margins (RSMs). Intuitively, given two classes and a common input, RSM of one classifier with respect to another reflects the relative margins with which decisions are made. The proposed notion is relevant in the context of several applications domains, including to compare a trained network and its corresponding compact network (e.g., pruned, quantized, distilled network). Not only can RSMs establish whether decisions are preserved, but they can also quantify their qualities. We also propose a framework to establish safe bounds on RSM gains or losses given an input and a family of perturbations. We evaluate our approach using the MNIST, CIFAR10, and two real-world medical datasets, to show the relevance of our results.

[LG-55] PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

链接: https://arxiv.org/abs/2409.16722
作者: Qibin Wang,Xiaolin Hu,Weikai Xu,Wei Liu,Jian Luan,Bin Wang
关键词-EN: avoid excessive inference, excessive inference costs, Low-rank adaptation, Matrices Skeleton Selection, variants have recently
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) and its variants have recently gained much interest due to their ability to avoid excessive inference costs. However, LoRA still encounters the following challenges: (1) Limitation of low-rank assumption; and (2) Its initialization method may be suboptimal. To this end, we propose PMSS(Pre-trained Matrices Skeleton Selection), which enables high-rank updates with low costs while leveraging semantic and linguistic information inherent in pre-trained weight. It achieves this by selecting skeletons from the pre-trained weight matrix and only learning a small matrix instead. Experiments demonstrate that PMSS outperforms LoRA and other fine-tuning methods across tasks with much less trainable parameters. We demonstrate its effectiveness, especially in handling complex tasks such as DROP benchmark(+3.4%/+5.9% on LLaMA2-7B/13B) and math reasoning(+12.89%/+5.61%/+3.11% on LLaMA2-7B, Mistral-7B and Gemma-7B of GSM8K). The code and model will be released soon.

[LG-56] Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.16720
作者: Xian Wang,Jin Zhou,Yuanli Feng,Jiahao Mei,Jiming Chen,Shuo Li
关键词-EN: Recent innovations, innovations in autonomous, autonomous drones, drones have facilitated, configurations and enhanced
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations and enhanced maneuverability in multi-drone systems through the application of optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network for time-optimal multi-drone flight using multi-agent reinforcement learning. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision penalty inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training, while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade-offs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with low collision rates. Real-world experiments validate our method, with two quadrotors using the same network as simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m * 5.5 m * 2.0 m space across various tracks, relying entirely on onboard computation.

[LG-57] Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification EMNLP2024

链接: https://arxiv.org/abs/2409.16718
作者: Ming Li,Jike Zhong,Chenxin Li,Liuzhuozheng Li,Nie Lin,Masashi Sugiyama
关键词-EN: Recent advances, classic model fine-tuning, fine-tuning Vision-Language Models, prompt tuning, adapter tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: EMNLP 2024 Main Conference

点击查看摘要

Abstract:Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve the performance of zero-shot CLIP by 7.27% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that low-level text bias layers and the first layer normalization layer change much more than other layers. The code is available at \urlthis https URL.

[LG-58] Numerical Approximation Capacity of Neural Networks with Bounded Parameters: Do Limits Exist and How Can They Be Measured?

链接: https://arxiv.org/abs/2409.16697
作者: Li Liu,Tengchao Yu,Heng Yong
关键词-EN: Approximation Theorem posits, Universal Approximation Theorem, Theorem posits, possess unlimited approximation, Approximation Theorem
类目: Machine Learning (cs.LG)
*备注: Universal Approximation; Bounded Weights; Analytic Function; Numerical Span Dimension; Infinite Width Neural Network}

点击查看摘要

Abstract:The Universal Approximation Theorem posits that neural networks can theoretically possess unlimited approximation capacity with a suitable activation function and a freely chosen or trained set of parameters. However, a more practical scenario arises when these neural parameters, especially the nonlinear weights and biases, are bounded. This leads us to question: \textbfDoes the approximation capacity of a neural network remain universal, or does it have a limit when the parameters are practically bounded? And if it has a limit, how can it be measured? Our theoretical study indicates that while universal approximation is theoretically feasible, in practical numerical scenarios, Deep Neural Networks (DNNs) with any analytic activation functions (such as Tanh and Sigmoid) can only be approximated by a finite-dimensional vector space under a bounded nonlinear parameter space (NP space), whether in a continuous or discrete sense. Based on this study, we introduce the concepts of \textit \epsilon outer measure and \textitNumerical Span Dimension (NSdim) to quantify the approximation capacity limit of a family of networks both theoretically and practically. Furthermore, drawing on our new theoretical study and adopting a fresh perspective, we strive to understand the relationship between back-propagation neural networks and random parameter networks (such as the Extreme Learning Machine (ELM)) with both finite and infinite width. We also aim to provide fresh insights into regularization, the trade-off between width and depth, parameter space, width redundancy, condensation, and other related important issues. Comments: Universal Approximation; Bounded Weights; Analytic Function; Numerical Span Dimension; Infinite Width Neural Network Subjects: Machine Learning (cs.LG) Cite as: arXiv:2409.16697 [cs.LG] (or arXiv:2409.16697v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.16697 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] A Survey of Low-bit Large Language Models : Basics Systems and Algorithms

链接: https://arxiv.org/abs/2409.16694
作者: Ruihao Gong,Yifu Ding,Zining Wang,Chengtao Lv,Xingyu Zheng,Jinyang Du,Haotong Qin,Jinyang Guo,Michele Magno,Xianglong Liu
关键词-EN: Large language models, natural language processing, showcasing exceptional performance, Large language, achieved remarkable advancements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Ruihao Gong leads the overall organization of the survey, with Yifu Ding and Jinyang Du contributing to Sections 2 and 3. Xingyu Zheng is responsible for authoring Section 4, while Chengtao Lv and Zining Wang collaborate on Section 5. Haotong Qin, Jinyang Guo, Michele Magno, and Xianglong Liu provide guidance during the whole process and assist in refining the final manuscript

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable advancements in natural language processing, showcasing exceptional performance across various tasks. However, the expensive memory and computational requirements present significant challenges for their practical deployment. Low-bit quantization has emerged as a critical approach to mitigate these challenges by reducing the bit-width of model parameters, activations, and gradients, thus decreasing memory usage and computational demands. This paper presents a comprehensive survey of low-bit quantization methods tailored for LLMs, covering the fundamental principles, system implementations, and algorithmic strategies. An overview of basic concepts and new data formats specific to low-bit LLMs is first introduced, followed by a review of frameworks and systems that facilitate low-bit LLMs across various hardware platforms. Then, we categorize and analyze techniques and toolkits for efficient low-bit training and inference of LLMs. Finally, we conclude with a discussion of future trends and potential advancements of low-bit LLMs. Our systematic overview from basic, system, and algorithm perspectives can offer valuable insights and guidelines for future works to enhance the efficiency and applicability of LLMs through low-bit quantization.

[LG-60] Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model ECCV2024

链接: https://arxiv.org/abs/2409.16689
作者: Shoma Iwai,Atsuki Osanai,Shunsuke Kitada,Shinichiro Omachi
关键词-EN: task to synthesize, synthesize a harmonious, Layout, harmonious layout, characterized by attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted by ECCV2024, Project Page: this https URL

点击查看摘要

Abstract:Layout generation is a task to synthesize a harmonious layout with elements characterized by attributes such as category, position, and size. Human designers experiment with the placement and modification of elements to create aesthetic layouts, however, we observed that current discrete diffusion models (DDMs) struggle to correct inharmonious layouts after they have been generated. In this paper, we first provide novel insights into layout sticking phenomenon in DDMs and then propose a simple yet effective layout-assessment module Layout-Corrector, which works in conjunction with existing DDMs to address the layout sticking problem. We present a learning-based module capable of identifying inharmonious elements within layouts, considering overall layout harmony characterized by complex composition. During the generation process, Layout-Corrector evaluates the correctness of each token in the generated layout, reinitializing those with low scores to the ungenerated state. The DDM then uses the high-scored tokens as clues to regenerate the harmonized tokens. Layout-Corrector, tested on common benchmarks, consistently boosts layout-generation performance when in conjunction with various state-of-the-art DDMs. Furthermore, our extensive analysis demonstrates that the Layout-Corrector (1) successfully identifies erroneous tokens, (2) facilitates control over the fidelity-diversity trade-off, and (3) significantly mitigates the performance drop associated with fast sampling.

[LG-61] Erase then Rectify: A Training-Free Parameter Editing Approach for Cost-Effective Graph Unlearning

链接: https://arxiv.org/abs/2409.16684
作者: Zhe-Rui Yang,Jindong Han,Chang-Dong Wang,Hao Liu
关键词-EN: Graph Neural Network, trained Graph Neural, Neural Network, Graph Neural, Graph unlearning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph unlearning, which aims to eliminate the influence of specific nodes, edges, or attributes from a trained Graph Neural Network (GNN), is essential in applications where privacy, bias, or data obsolescence is a concern. However, existing graph unlearning techniques often necessitate additional training on the remaining data, leading to significant computational costs, particularly with large-scale graphs. To address these challenges, we propose a two-stage training-free approach, Erase then Rectify (ETR), designed for efficient and scalable graph unlearning while preserving the model utility. Specifically, we first build a theoretical foundation showing that masking parameters critical for unlearned samples enables effective unlearning. Building on this insight, the Erase stage strategically edits model parameters to eliminate the impact of unlearned samples and their propagated influence on intercorrelated nodes. To further ensure the GNN’s utility, the Rectify stage devises a gradient approximation method to estimate the model’s gradient on the remaining dataset, which is then used to enhance model performance. Overall, ETR achieves graph unlearning without additional training or full training data access, significantly reducing computational overhead and preserving data privacy. Extensive experiments on seven public datasets demonstrate the consistent superiority of ETR in model utility, unlearning efficiency, and unlearning effectiveness, establishing it as a promising solution for real-world graph unlearning challenges.

[LG-62] CryptoTrain: Fast Secure Training on Encrypted Datase CCS

链接: https://arxiv.org/abs/2409.16675
作者: Jiaqi Xue,Yancheng Zhang,Yanshan Wang,Xueqiang Wang,Hao Zheng,Qian Lou
关键词-EN: typically incurs significant, Fully Homomorphic Encryption, incurs significant training, Traditional Fully Homomorphic, typically incurs
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted by CCS-LAMPS 2024

点击查看摘要

Abstract:Secure training, while protecting the confidentiality of both data and model weights, typically incurs significant training overhead. Traditional Fully Homomorphic Encryption (FHE)-based non-inter-active training models are heavily burdened by computationally demanding bootstrapping. To develop an efficient secure training system, we established a foundational framework, CryptoTrain-B, utilizing a hybrid cryptographic protocol that merges FHE with Oblivious Transfer (OT) for handling linear and non-linear operations, respectively. This integration eliminates the need for costly bootstrapping. Although CryptoTrain-B sets a new baseline in performance, reducing its training overhead remains essential. We found that ciphertext-ciphertext multiplication (CCMul) is a critical bottleneck in operations involving encrypted inputs and models. Our solution, the CCMul-Precompute technique, involves precomputing CCMul offline and resorting to the less resource-intensive ciphertext-plaintext multiplication (CPMul) during private training. Furthermore, conventional polynomial convolution in FHE systems tends to encode irrelevant and redundant values into polynomial slots, necessitating additional polynomials and ciphertexts for input representation and leading to extra multiplications. Addressing this, we introduce correlated polynomial convolution, which encodes only related input values into polynomials, thus drastically reducing the number of computations and overheads. By integrating CCMul-Precompute and correlated polynomial convolution into CryptoTrain-B, we facilitate a rapid and efficient secure training framework, CryptoTrain. Extensive experiments demonstrate that CryptoTrain achieves a ~5.3X training time reduction compared to prior methods.

[LG-63] SWE2: SubWord Enriched and Significant Word Emphasized Framework for Hate Speech Detection CIKM2020

链接: https://arxiv.org/abs/2409.16673
作者: Guanyi Mou,Pengyi Ye,Kyumin Lee
关键词-EN: emerging hot topics, online social networks, Hate speech detection, Hate speech, hate speech makes
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published in CIKM 2020

点击查看摘要

Abstract:Hate speech detection on online social networks has become one of the emerging hot topics in recent years. With the broad spread and fast propagation speed across online social networks, hate speech makes significant impacts on society by increasing prejudice and hurting people. Therefore, there are aroused attention and concern from both industry and academia. In this paper, we address the hate speech problem and propose a novel hate speech detection framework called SWE2, which only relies on the content of messages and automatically identifies hate speech. In particular, our framework exploits both word-level semantic information and sub-word knowledge. It is intuitively persuasive and also practically performs well under a situation with/without character-level adversarial attack. Experimental results show that our proposed model achieves 0.975 accuracy and 0.953 macro F1, outperforming 7 state-of-the-art baselines under no adversarial attack. Our model robustly and significantly performed well under extreme adversarial attack (manipulation of 50% messages), achieving 0.967 accuracy and 0.934 macro F1.

[LG-64] Wildlife Product Trading in Online Social Networks: A Case Study on Ivory-Related Product Sales Promotion Posts

链接: https://arxiv.org/abs/2409.16671
作者: Guanyi Mou,Yun Yue,Kyumin Lee,Ziming Zhang
关键词-EN: utilizing e-commerce websites, online social networks, social networks, wildlife product, online social
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: ICWSM 2024

点击查看摘要

Abstract:Wildlife trafficking (WLT) has emerged as