本篇博文主要展示 2024-09-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。【邮箱发送异常,暂不增加!!!!!】

目录

概览 (2024-09-23)

今日共更新388篇论文,其中:

  • 自然语言处理56篇(Computation and Language (cs.CL))
  • 人工智能94篇(Artificial Intelligence (cs.AI))
  • 计算机视觉103篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习115篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] he Impact of Large Language Models in Academia: from Writing to Speaking
该论文试图解决的问题是大型语言模型(LLMs)如何影响人类社会中的文本信息交流,特别是写作和口语表达中的词汇使用。解决方案的关键在于通过分析超过30,000篇论文和1,000场机器学习会议的演讲,比较写作和口语中使用的词汇,首次大规模研究LLMs对同一群体内两种主要交流方式的影响。实证结果表明,LLM风格的词汇如“significant”在摘要和口头报告中使用频率增加,暗示LLMs对口语表达的影响正在显现,并可能在未来进一步扩大,强调了LLMs对人类社会的隐性影响和涟漪效应。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13686
作者: Mingmeng Geng,Caixi Chen,Yanru Wu,Dongping Chen,Yao Wan,Pan Zhou
关键词-EN: Large language models, increasingly impacting human, Large language, impacting human society, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly impacting human society, particularly in textual information. Based on more than 30,000 papers and 1,000 presentations from machine learning conferences, we examined and compared the words used in writing and speaking, representing the first large-scale investigating study of how LLMs influence the two main modes of verbal communication and expression within the same group of people. Our empirical results show that LLM-style words such as “significant” have been used more frequently in abstracts and oral presentations. The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.
摘要:大语言模型 (LLMs) 正日益影响着人类社会,尤其是在文本信息领域。基于来自机器学习会议的超过 30,000 篇论文和 1,000 个演讲,我们研究并比较了写作和演讲中使用的词汇,这代表了首次大规模研究 LLMs 如何影响同一群体内口头交流和表达的两种主要模式。我们的实证结果显示,诸如 “significant” 等 LLM 风格的词汇在摘要和口头演讲中使用频率更高。对演讲的影响已经开始显现,并可能在将来进一步扩大,这引起了人们对 LLMs 对人类社会隐性影响和涟漪效应的关注。

[NLP-1] ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation
该论文试图解决机器人长时间在复杂环境中导航时,如何有效处理和回答与历史行为相关的问题,如事件发生的位置、时间或持续时间。解决方案的关键是引入了一个名为ReMEmbR的系统,该系统通过构建和查询增强记忆来实现长时程视频问答。ReMEmbR采用结构化的方法,包括记忆构建和查询阶段,利用时间信息、空间信息和图像来高效处理不断增长的机器人历史记录。实验结果表明,ReMEmbR在长时程推理中表现优异,且具有低延迟特性,能够处理多样化的查询。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13682
作者: Abrar Anwar,John Welsh,Joydeep Biswas,Soha Pouya,Yan Chang
关键词-EN: understanding complex environments, Navigating and understanding, understanding complex, complex environments, environments over extended
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: this https URL
摘要:在长时间内导航和理解复杂环境对机器人来说是一个重大挑战。与机器人互动的人可能会提出诸如某事发生在哪里、何时发生或已经过去多久等问题,这要求机器人能够对长时间的部署历史进行推理。为了解决这个问题,我们引入了用于具身机器人的检索增强记忆系统,即 ReMEmbR,这是一个为机器人导航的长时视频问答设计的系统。为了评估 ReMEmbR,我们引入了 NaVQA 数据集,在该数据集中我们对长时机器人导航视频标注了空间、时间和描述性问题。ReMEmbR 采用了一种结构化的方法,包括记忆构建和查询阶段,利用时间信息、空间信息和图像来高效处理不断增长的机器人历史。我们的实验表明,ReMEmbR 优于大语言模型 (LLM) 和视觉语言模型 (VLM) 基线,使得 ReMEmbR 能够以低延迟实现有效的长时推理。此外,我们将 ReMEmbR 部署在机器人上,并展示了我们的方法能够处理多样化的查询。数据集、代码、视频和其他材料可以在以下链接中找到:this https URL

[NLP-2] Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning EMNLP2024
该论文试图解决当前大型语言模型(LLMs)在训练过程中依赖于高成本、复杂且资源密集的标准交叉熵损失、大量数据和人工反馈的问题。解决方案的关键在于引入语义分割损失函数(如Focal Loss和Lovász Loss)来替代传统的交叉熵损失,以实现更高效、可扩展且实用的微调方法。研究表明,使用这些替代损失函数在数学问题求解和问答任务中,无需额外数据或人工反馈,即可显著提升模型性能,平均精确匹配率提高了42%。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13641
作者: Daniele Rege Cambrin,Giuseppe Gallipoli,Irene Benedetto,Luca Cagliero,Paolo Garza
关键词-EN: Large Language Models, demonstrated impressive performance, Large Language, demonstrated impressive, Large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in EMNLP 2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various tasks. However, current training approaches combine standard cross-entropy loss with extensive data, human feedback, or ad hoc methods to enhance performance. These solutions are often not scalable or feasible due to their associated costs, complexity, or resource requirements. This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.
摘要:大语言模型 (LLMs) 在各种任务中展示了令人印象深刻的表现。然而,当前的训练方法结合了标准的交叉熵损失 (cross-entropy loss) 与大量数据、人类反馈或临时方法来提升性能。这些解决方案由于其相关的成本、复杂性或资源需求,往往不具备可扩展性或可行性。本研究探讨了在自然语言生成中使用已建立的语义分割损失函数,以创建一种多功能、实用且可扩展的微调不同架构的解决方案。我们评估了这些方法在解决数学应用题和跨不同规模模型的问题回答中的有效性。对于分析的任务,我们发现传统的交叉熵损失代表了一个次优选择,而通过最小化替代(任务依赖)损失,如 Focal 或 Lovász,模型在精确匹配上实现了平均 +42% 的提升,且无需额外数据或人类反馈。这些发现为更高效和更易获取的训练过程提供了有前景的路径。

[NLP-3] Advancing Event Causality Identification via Heuristic Semantic Dependency Inquiry Network
该论文试图解决事件因果关系识别(ECI)中因果特征不明确和外部知识引入偏差的问题。解决方案的关键在于提出了一种名为SemDI的语义依赖查询网络,该网络通过统一的编码器捕捉上下文中的语义依赖关系,并利用Cloze分析器生成填充标记,以此来查询两个事件之间的因果关系。这种方法在三个广泛使用的基准测试中表现优异,超越了现有的最先进方法。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13621
作者: Haoran Li,Qiang Gao,Hongmei Wu,Li Huang
关键词-EN: Event Causality Identification, Causality Identification, Event Causality, focuses on extracting, Dependency Inquiry Network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Event Causality Identification (ECI) focuses on extracting causal relations between events in texts. Existing methods for ECI primarily rely on causal features and external knowledge. However, these approaches fall short in two dimensions: (1) causal features between events in a text often lack explicit clues, and (2) external knowledge may introduce bias, while specific problems require tailored analyses. To address these issues, we propose SemDI - a simple and effective Semantic Dependency Inquiry Network for ECI. SemDI captures semantic dependencies within the context using a unified encoder. Then, it utilizes a Cloze Analyzer to generate a fill-in token based on comprehensive context understanding. Finally, this fill-in token is used to inquire about the causal relation between two events. Extensive experiments demonstrate the effectiveness of SemDI, surpassing state-of-the-art methods on three widely used benchmarks. Code is available at this https URL.
摘要:事件因果关系识别 (Event Causality Identification, ECI) 专注于从文本中提取事件之间的因果关系。现有的 ECI 方法主要依赖于因果特征和外部知识。然而,这些方法在两个维度上存在不足:(1) 文本中事件之间的因果特征往往缺乏明确的线索;(2) 外部知识可能引入偏见,而特定问题需要定制化的分析。为了解决这些问题,我们提出了 SemDI——一种简单且有效的语义依赖查询网络 (Semantic Dependency Inquiry Network),用于 ECI。SemDI 通过统一的编码器捕捉上下文中的语义依赖关系。然后,它利用 Cloze 分析器基于全面的上下文理解生成填空 Token。最后,这个填空 Token 用于查询两个事件之间的因果关系。大量实验表明,SemDI 的有效性,在三个广泛使用的基准测试中超越了最先进的方法。代码可在以下链接获取:https URL。

[NLP-4] MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension EMNLP2024
该论文试图解决在引用表达理解(REC)任务中,通过全量微调预训练模型导致计算成本高且破坏了预训练模型中嵌入的丰富先验知识的问题。解决方案的关键在于提出了一种名为MaPPER的新框架,该框架结合了多模态先验引导的参数高效调优方法。具体来说,MaPPER包括动态先验适配器(Dynamic Prior Adapters)和局部卷积适配器(Local Convolution Adapters),前者利用对齐的先验知识,后者用于提取精确的局部语义以增强视觉感知。此外,还提出了先验引导的文本模块(Prior-Guided Text module),以进一步利用先验知识促进跨模态对齐。实验结果表明,MaPPER在仅调整1.41%的可调参数的情况下,达到了比全量微调和其它参数高效调优方法更高的准确率。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13609
作者: Ting Liu,Zunnan Xu,Yue Hu,Liangtao Shi,Zhiqiang Wang,Quanjun Yin
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, local visual region, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.
摘要:指称表达理解 (Referring Expression Comprehension, REC) 旨在通过自然语言定位局部视觉区域,是一项高度依赖多模态对齐的任务。现有的大多数方法利用强大的预训练模型通过全量微调来转移视觉/语言知识。然而,全量微调整个骨干网络不仅会破坏预训练中嵌入的丰富先验知识,还会带来显著的计算成本。受近期出现的参数高效迁移学习 (Parameter-Efficient Transfer Learning, PETL) 方法的启发,我们旨在以有效且高效的方式解决 REC 任务。直接将这些 PETL 方法应用于 REC 任务是不合适的,因为它们缺乏针对精确局部视觉感知和视觉-语言对齐的特定领域能力。因此,我们提出了一种新的多模态先验引导的参数高效调优框架,即 MaPPER。具体来说,MaPPER 包括由对齐先验引导的动态先验适配器,以及用于提取精确局部语义以增强视觉感知的局部卷积适配器。此外,我们还提出了先验引导的文本模块,以进一步利用先验知识促进跨模态对齐。在三个广泛使用的基准测试上的实验结果表明,MaPPER 在仅调整 1.41% 的可调骨干参数的情况下,达到了比全量微调和其它 PETL 方法更高的准确率。

[NLP-5] Cross-Target Stance Detection: A Survey of Techniques Datasets and Challenges
该论文试图解决跨目标立场检测问题,即在模型已针对某些目标进行训练后,如何将其应用于新的、未见过的目标上。解决方案的关键在于从基本的统计方法演进到现代的神经网络和大型语言模型(LLM),通过引入主题分组注意力机制、对抗学习、零样本检测、微调技术以及外部知识的整合,显著提升了模型的准确性和适应性。此外,提示调优方法的运用进一步优化了模型性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13594
作者: Parisa Jamadi Khiabani,Arkaitz Zubiaga
关键词-EN: cross-target stance detection, Stance detection, cross-target stance, Stance, viewpoint expressed
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Stance detection is the task of determining the viewpoint expressed in a text towards a given target. A specific direction within the task focuses on cross-target stance detection, where a model trained on samples pertaining to certain targets is then applied to a new, unseen target. With the increasing need to analyze and mining viewpoints and opinions online, the task has recently seen a significant surge in interest. This review paper examines the advancements in cross-target stance detection over the last decade, highlighting the evolution from basic statistical methods to contemporary neural and LLM-based models. These advancements have led to notable improvements in accuracy and adaptability. Innovative approaches include the use of topic-grouped attention and adversarial learning for zero-shot detection, as well as fine-tuning techniques that enhance model robustness. Additionally, prompt-tuning methods and the integration of external knowledge have further refined model performance. A comprehensive overview of the datasets used for evaluating these models is also provided, offering valuable insights into the progress and challenges in the field. We conclude by highlighting emerging directions of research and by suggesting avenues for future work in the task.
摘要:立场检测(Stance Detection)是指确定文本中对给定目标所表达的观点的任务。该任务的一个特定方向关注跨目标立场检测,即在某些目标样本上训练的模型随后应用于新的、未见过的目标。随着在线分析和挖掘观点与意见的需求日益增加,该任务近年来引起了极大的关注。本文回顾了过去十年中跨目标立场检测的进展,突出了从基本统计方法到当代基于神经网络和大语言模型(LLM)的模型的演变。这些进展显著提高了准确性和适应性。创新方法包括使用主题分组注意力(Topic-grouped Attention)和对抗学习(Adversarial Learning)进行零样本检测(Zero-shot Detection),以及增强模型鲁棒性的微调技术(Fine-tuning Techniques)。此外,提示微调方法(Prompt-tuning Methods)和外部知识的整合进一步提升了模型性能。本文还提供了用于评估这些模型的数据集的全面概述,为该领域的进展和挑战提供了宝贵的见解。最后,我们强调了研究的新兴方向,并提出了该任务未来工作的途径。

[NLP-6] YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models EMNLP2024
该论文试图解决视觉语言模型在理解和识别讽刺图像方面的挑战。解决方案的关键在于提出了三个具体的任务:讽刺图像检测(判断图像是否为讽刺)、讽刺理解(生成图像讽刺的原因)和讽刺图像补全(根据一半图像从两个选项中选择另一半,使完整图像具有讽刺性)。论文通过发布高质量的数据集YesBut(包含2547张图像,其中1084张为讽刺图像,1463张为非讽刺图像)来评估这些任务,并展示了当前视觉语言模型在零样本设置下在这些任务上的表现不佳。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13592
作者: Abhilash Nandy,Yash Agarwal,Ashish Patwa,Millon Madhur Das,Aman Bansal,Ankit Raj,Pawan Goyal,Niloy Ganguly
关键词-EN: Understanding satire, Satirical Image Detection, current Vision-Language models, satire and humor, Image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2024 Main (Long), 18 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at this https URL.
摘要:理解讽刺和幽默对于当前的视觉-语言模型来说是一项极具挑战性的任务。本文提出了三个具有挑战性的任务:讽刺图像检测(检测图像是否具有讽刺意味)、理解(生成图像具有讽刺意味的原因)以及补全(给定图像的一半,从两个选项中选择另一半,使得完整图像具有讽刺意味),并发布了一个高质量的数据集 YesBut,该数据集包含 2547 张图像,其中 1084 张为讽刺图像,1463 张为非讽刺图像,涵盖了不同的艺术风格,用于评估这些任务。数据集中的每张讽刺图像都描绘了一个正常场景,同时伴随着一个冲突的场景,这个冲突场景既有趣又具有讽刺意味。尽管当前的视觉-语言模型在多模态任务如视觉问答和图像描述生成方面取得了成功,但我们的基准测试实验表明,这些模型在 YesBut 数据集上的零样本设置下,对于所提出的任务在自动化评估和人工评估方面表现不佳。此外,我们还发布了一个包含 119 张真实讽刺照片的数据集,以供进一步研究。数据集和代码可通过以下链接获取:https URL。

[NLP-7] Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis
该论文试图解决在线服务系统中日志异常检测后的人工故障诊断问题,即如何自动化地从日志中提取故障指示信息以辅助故障诊断。解决方案的关键在于提出了LoFI方法,该方法通过两个阶段实现自动化信息提取:首先进行粗粒度过滤,基于语义相似性收集与故障相关的日志;然后利用预训练语言模型结合新颖的提示调优方法,从收集的日志中提取细粒度的故障指示信息。实验结果表明,LoFI在识别故障指示信息方面显著优于现有方法,有效提升了故障诊断的效率和准确性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13561
作者: Junjie Huang,Zhihan Jiang,Jinyang Liu,Yintong Huo,Jiazhen Gu,Zhuangbin Chen,Cong Feng,Hui Dong,Zengyin Yang,Michael R. Lyu
关键词-EN: effective failure mitigation, online service systems, encompass important information, failure mitigation, maintenance of online
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: This paper has been accepted by the 35th IEEE International Symposium on Software Reliability Engineering (ISSRE’2024)

点击查看摘要

Abstract:Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first stage, LoFI performs coarse-grained filtering to collect logs related to the faults based on semantic similarity. In the second stage, LoFI leverages a pre-trained language model with a novel prompt-based tuning method to extract fine-grained information of interest from the collected logs. We evaluate LoFI on logs collected from Apache Spark and an industrial dataset from CloudA. The experimental results demonstrate that LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in recognizing fault-indicating information. Furthermore, the successful deployment of LoFI at CloudA and user studies validate the utility of our method. The code and data are available at this https URL.
摘要:日志在在线服务系统的维护中至关重要,通常包含有效故障缓解的重要信息。尽管现有的异常检测方法有助于在大量运行时数据中识别异常日志,但工程师手动检查日志消息以理解故障仍然是必要的,这一过程既耗时又容易出错。通过对 CloudA 基于日志的故障排查实践进行研究,我们发现工程师通常优先考虑两类日志信息进行诊断。这些信息包括故障指示描述,记录异常系统事件,以及故障指示参数,指定相关实体。基于这一发现,我们提出了一种自动从日志中提取此类故障指示信息以进行故障诊断的方法,命名为 LoFI。LoFI 包含两个关键阶段。在第一阶段,LoFI 基于语义相似性进行粗粒度过滤,收集与故障相关的日志。在第二阶段,LoFI 利用预训练语言模型结合一种新颖的基于提示的调优方法,从收集的日志中提取细粒度的感兴趣信息。我们在从 Apache Spark 和 CloudA 的工业数据集中收集的日志上评估了 LoFI。实验结果表明,LoFI 显著优于所有基线方法,在 F1 分数上相对于最佳基线方法 ChatGPT 实现了 25.8~37.9 的绝对提升,这突显了 LoFI 在识别故障指示信息方面的有效性。此外,LoFI 在 CloudA 的成功部署和用户研究验证了我们方法的实用性。代码和数据可在以下链接获取:https URL。

[NLP-8] Generating Visual Stories with Grounded and Coreferent Characters
该论文试图解决视觉叙事方法在生成故事时缺乏具体角色描述和角色一致性的问题。解决方案的关键在于提出了以角色为中心的故事生成任务,并开发了一个新的模型,该模型能够在预测视觉故事时持续地生成具有一致性和连贯性的角色描述。为此,研究者构建了一个基于VIST基准的新数据集,并通过自动化流程增强了数据集中的视觉和文本角色共指链。此外,论文还提出了新的评估指标来衡量故事中角色的丰富性和共指性。实验结果表明,该模型生成的故事在角色一致性和共指性方面优于基线和最先进的系统。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13555
作者: Danyang Liu,Mirella Lapata,Frank Keller
关键词-EN: Characters, stories, create emotional connections, Abstract, character mentions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story’s themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.
摘要:角色在叙事中至关重要。它们推动情节发展,建立情感联系,并体现故事的主题。视觉叙事方法更侧重于情节及其相关事件,而不围绕特定角色构建叙事。因此,生成的故事显得通用,角色的提及要么缺失,要么模糊,甚至错误。为了缓解这些问题,我们引入了以角色为中心的故事生成这一新任务,并提出了首个能够预测视觉故事并始终保持角色提及一致性和共指性的模型。我们的模型在一个新数据集上进行了微调,该数据集基于广泛使用的 VIST 基准构建。具体而言,我们开发了一个自动化流程,以视觉和文本角色共指链丰富 VIST。我们还提出了新的评估指标,用于衡量故事中角色的丰富性和共指性。实验结果表明,与基线和最先进的系统相比,我们的模型生成的故事中,角色重复出现,且在更大程度上保持一致性和共指性。

[NLP-9] Contextualized Data-Wrangling Code Generation in Computational Notebooks
该论文旨在解决数据科学中数据整理(data wrangling)过程自动化的问题,特别是如何准确生成数据整理代码以减少分析师的工作负担。解决方案的关键在于构建一个高质量的数据集(CoCoNote),该数据集包含了58,221个具有清晰多模态上下文依赖的数据整理代码生成示例。通过自动化工具CoCoMine,论文提出了一种方法来挖掘和提取这些示例,并利用数据流分析和笔记本回放技术来确保上下文的完整性。此外,论文还提出了DataCoder模型,该模型分别编码数据上下文和代码文本上下文,以增强代码生成的准确性。实验结果表明,结合数据上下文对数据整理代码生成具有重要意义,并且所提出的模型在性能上表现出色。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13551
作者: Junjie Huang,Daya Guo,Chenglong Wang,Jiazhen Gu,Shuai Lu,Jeevana Priya Inala,Cong Yan,Jianfeng Gao,Nan Duan,Michael R. Lyu
关键词-EN: Code, Code generation, Data wrangling, Data, preparing raw data
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Databases (cs.DB)
备注: To appear at ASE 2024

点击查看摘要

Abstract:Data wrangling, the process of preparing raw data for further analysis in computational notebooks, is a crucial yet time-consuming step in data science. Code generation has the potential to automate the data wrangling process to reduce analysts’ overhead by translating user intents into executable code. Precisely generating data wrangling code necessitates a comprehensive consideration of the rich context present in notebooks, including textual context, code context and data context. However, notebooks often interleave multiple non-linear analysis tasks into linear sequence of code blocks, where the contextual dependencies are not clearly reflected. Directly training models with source code blocks fails to fully exploit the contexts for accurate wrangling code generation. To bridge the gap, we aim to construct a high quality datasets with clear and rich contexts to help training models for data wrangling code generation tasks. In this work, we first propose an automated approach, CoCoMine to mine data-wrangling code generation examples with clear multi-modal contextual dependency. It first adopts data flow analysis to identify the code blocks containing data wrangling codes. Then, CoCoMine extracts the contextualized datawrangling code examples through tracing and replaying notebooks. With CoCoMine, we construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks. To demonstrate the effectiveness of our dataset, we finetune a range of pretrained code models and prompt various large language models on our task. Furthermore, we also propose DataCoder, which encodes data context and codetextual contexts separately to enhance code generation. Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation and the effectiveness of our model. We release code and data at url… Comments: To appear at ASE 2024 Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Databases (cs.DB) Cite as: arXiv:2409.13551 [cs.SE] (or arXiv:2409.13551v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.13551 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3691620.3695503 Focus to learn more DOI(s) linking to related resources
摘要:数据整理 (Data wrangling) 是将原始数据准备为可在计算笔记本中进行进一步分析的过程,是数据科学中至关重要但耗时的步骤。代码生成 (Code generation) 有潜力通过将用户意图转化为可执行代码来自动化数据整理过程,从而减少分析师的开销。精确生成数据整理代码需要全面考虑笔记本中存在的丰富上下文,包括文本上下文、代码上下文和数据上下文。然而,笔记本通常将多个非线性分析任务交织成线性代码块序列,其中上下文依赖关系并不明显。直接使用源代码块训练模型无法充分利用上下文来准确生成整理代码。为了弥合这一差距,我们旨在构建一个高质量的数据集,该数据集具有清晰且丰富的上下文,以帮助训练模型进行数据整理代码生成任务。在这项工作中,我们首先提出了一种自动化方法 CoCoMine,用于挖掘具有清晰多模态上下文依赖关系的数据整理代码生成示例。它首先采用数据流分析来识别包含数据整理代码的代码块。然后,CoCoMine 通过追踪和重放笔记本提取上下文化的数据整理代码示例。通过 CoCoMine,我们构建了 CoCoNote 数据集,该数据集包含 58,221 个用于笔记本中上下文化数据整理代码生成的示例。为了展示我们数据集的有效性,我们对一系列预训练代码模型进行了微调,并在我们的任务上提示了各种大语言模型。此外,我们还提出了 DataCoder,它分别编码数据上下文和代码文本上下文,以增强代码生成。实验结果表明,在数据整理代码生成中结合数据上下文的重要性以及我们模型的有效性。我们将在 url… 发布代码和数据。

评论:将出现在 ASE 2024 上 主题:软件工程 (cs.SE); 计算与语言 (cs.CL); 数据库 (cs.DB) 引用为:arXiv:2409.13551 [cs.SE] (或 arXiv:2409.13551v1 [cs.SE] 用于此版本) https://doi.org/10.48550/arXiv.2409.13551 聚焦以了解更多 arXiv 发布的 DOI 通过 DataCite (待注册) 相关 DOI: https://doi.org/10.1145/3691620.3695503 聚焦以了解更多 DOI 链接到相关资源

[NLP-10] ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources
该论文试图解决当前大型语言模型(LLMs)在处理农业等专业领域知识时的局限性问题。解决方案的关键在于提出了ShizishanGPT,这是一个基于检索增强生成(RAG)框架和代理架构的智能问答系统。ShizishanGPT通过五个关键模块的集成,包括通用GPT-4模块、搜索引擎模块、农业知识图谱模块、检索模块和农业代理模块,实现了对农业领域复杂问题的更准确和详细的回答。特别是通过RAG框架和农业知识图谱的结合,有效补充了LLMs在专业领域知识的不足,显著提升了问答系统的性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13537
作者: Shuting Yang,Zehui Liu,Wolfgang Mayer
关键词-EN: handle complex inquiries, Recent developments, intelligent dialogue systems’ability, complex inquiries, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages,3 figures, WISE2024

点击查看摘要

Abstract:Recent developments in large language models (LLMs) have led to significant improvements in intelligent dialogue systems’ability to handle complex inquiries. However, current LLMs still exhibit limitations in specialized domain knowledge, particularly in technical fields such as agriculture. To address this problem, we propose ShizishanGPT, an intelligent question answering system for agriculture based on the Retrieval Augmented Generation (RAG) framework and agent architecture. ShizishanGPT consists of five key modules: including a generic GPT-4 based module for answering general questions; a search engine module that compensates for the problem that the large language model’s own knowledge cannot be updated in a timely manner; an agricultural knowledge graph module for providing domain facts; a retrieval module which uses RAG to supplement domain knowledge; and an agricultural agent module, which invokes specialized models for crop phenotype prediction, gene expression analysis, and so on. We evaluated the ShizishanGPT using a dataset containing 100 agricultural questions specially designed for this study. The experimental results show that the tool significantly outperforms general LLMs as it provides more accurate and detailed answers due to its modular design and integration of different domain knowledge sources. Our source code, dataset, and model weights are publicly available at this https URL.
摘要:近年来,大语言模型 (LLM) 的发展显著提升了智能对话系统处理复杂查询的能力。然而,当前的 LLM 在专业领域知识方面仍存在局限性,特别是在农业等技术领域。为解决这一问题,我们提出了 ShizishanGPT,这是一个基于检索增强生成 (RAG) 框架和智能体架构的农业智能问答系统。ShizishanGPT 包含五个关键模块:包括一个基于 GPT-4 的通用模块,用于回答一般性问题;一个搜索引擎模块,弥补了大语言模型自身知识无法及时更新的问题;一个农业知识图谱模块,提供领域事实;一个检索模块,使用 RAG 补充领域知识;以及一个农业智能体模块,调用专用模型进行作物表型预测、基因表达分析等。我们使用包含 100 个专门为此研究设计的农业问题的数据集对 ShizishanGPT 进行了评估。实验结果表明,该工具由于其模块化设计和不同领域知识源的整合,显著优于一般 LLM,提供了更准确和详细的答案。我们的源代码、数据集和模型权重已在以下链接公开:https URL。

[NLP-11] EMMeTT: Efficient Multimodal Machine Translation Training ICASSP2025
该论文试图解决多模态基础语言模型的有效和高效训练问题,特别是在神经机器翻译(NMT)和自动语音翻译(AST)领域。解决方案的关键在于提出了一种名为EMMeTT的新型训练框架,该框架通过平衡语言、数据集和模态的采样,高效地迭代序列数据,并引入了一种新颖的2D分桶方案和批量大小优化器(OOMptimizer),从而提高了训练效率。实验结果表明,这种多模态训练方法在两种基础模型架构(GPT和T5)上均表现出色,不仅保留了原有的NMT能力,还在多个语言子集的FLORES和FLEURS基准测试中超越了AST基线。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13523
作者: Piotr Żelasko,Zhehuai Chen,Mengru Wang,Daniel Galvez,Oleksii Hrinchuk,Shuoyang Ding,Ke Hu,Jagadeesh Balam,Vitaly Lavrukhin,Boris Ginsburg
关键词-EN: models warrants discussion, rising interest, modality extension, warrants discussion, multimodal training approach
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages, submitted to ICASSP 2025

点击查看摘要

Abstract:A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B’s speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
摘要:随着对基础语言模型模态扩展的兴趣日益增长,探讨最有效且高效的多模态训练方法显得尤为重要。本文聚焦于神经机器翻译 (NMT),并提出了一种联合多模态训练机制——语音大语言模型 (Speech-LLM),以纳入自动语音翻译 (AST)。我们研究了两种不同的基础模型架构:仅解码器的 GPT 和编码器-解码器的 T5,并结合了 Canary-1B 的语音编码器进行扩展。为处理联合多模态训练,我们提出了一种名为 EMMeTT 的新型训练框架。EMMeTT 通过以下方式提升训练效率:跨语言、数据集和模态的平衡采样;高效的顺序数据迭代;以及一种新颖的二维分桶方案用于多模态数据,并辅以批量大小优化器 (OOMptimizer)。实验结果表明,多模态训练对两种架构均有持续的促进作用。此外,经过 EMMeTT 训练的 SALM-T5 不仅保留了原有的 NMT 能力,还在 FLORES 和 FLEURS 的四语言子集上超越了 AST 基线。最终的多模态翻译模型同时实现了强大的文本和语音翻译效果。

[NLP-12] A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges
该论文试图解决如何利用预训练语言模型(PLMs)从文本数据中提取和分析道德维度的问题。解决方案的关键在于结合道德基础理论(MFT)框架,通过分析PLMs中的道德倾向,并将其应用于MFT的背景下,从而实现对道德心理学的深入理解,并为创建具有道德意识的AI系统铺平道路。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13521
作者: Lorenzo Zangari,Candida M. Greco,Davide Picca,Andrea Tagarelli
关键词-EN: regulated societal order, Moral Foundation Theory, early civilizations, codified within norms, common good
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Moral values have deep roots in early civilizations, codified within norms and laws that regulated societal order and the common good. They play a crucial role in understanding the psychological basis of human behavior and cultural orientation. The Moral Foundation Theory (MFT) is a well-established framework that identifies the core moral foundations underlying the manner in which different cultures shape individual and social lives. Recent advancements in natural language processing, particularly Pre-trained Language Models (PLMs), have enabled the extraction and analysis of moral dimensions from textual data. This survey presents a comprehensive review of MFT-informed PLMs, providing an analysis of moral tendencies in PLMs and their application in the context of the MFT. We also review relevant datasets and lexicons and discuss trends, limitations, and future directions. By providing a structured overview of the intersection between PLMs and MFT, this work bridges moral psychology insights within the realm of PLMs, paving the way for further research and development in creating morally aware AI systems.
摘要:道德价值观深深植根于早期文明,被编码在规范和法律中,这些规范和法律调节着社会秩序和公共利益。它们在理解人类行为的心理基础和文化取向方面起着至关重要的作用。道德基础理论 (Moral Foundation Theory, MFT) 是一个成熟的框架,它识别了不同文化塑造个人和社会生活的核心道德基础。自然语言处理的最新进展,特别是预训练语言模型 (Pre-trained Language Models, PLMs),使得从文本数据中提取和分析道德维度成为可能。本调查报告对基于 MFT 的 PLMs 进行了全面回顾,分析了 PLMs 中的道德倾向及其在 MFT 背景下的应用。我们还回顾了相关数据集和词典,并讨论了趋势、局限性和未来方向。通过提供 PLMs 与 MFT 之间交叉点的结构化概述,这项工作在 PLMs 领域内架起了道德心理学的桥梁,为创建具有道德意识的 AI 系统进一步的研究和发展铺平了道路。

[NLP-13] LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR ICASSP2025
该论文试图解决自动语音识别(ASR)中识别特殊稀有词汇和词汇外词汇的挑战,以及快速领域适应的问题。解决方案的关键在于提出了一种轻量级的实时方法,通过结合命名实体偏置列表与基于Aho-Corasick字符串匹配算法的词级n-gram语言模型,采用浅融合技术来提升ASR性能。Aho-Corasick算法的高效性使得快速上下文适应成为可能,而n-gram语言模型则通过引入失败和输出弧的图结构,动态调整弧权重以适应n-gram概率,从而在保持整体性能的同时,增强对关键字和词汇外实体的识别能力。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13514
作者: Iuliia Thorbecke,Juan Zuluaga-Gomez,Esaú Villatoro-Tello,Andres Carofilis,Shashi Kumar,Petr Motlicek,Karthik Pandia,Aravind Ganapathiraju
关键词-EN: recognizing special rare, automatic speech recognition, language model, n-gram language model, recent success
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP2025

点击查看摘要

Abstract:Despite the recent success of end-to-end models for automatic speech recognition, recognizing special rare and out-of-vocabulary words, as well as fast domain adaptation with text, are still challenging. It often happens that biasing to the special entities leads to a degradation in the overall performance. We propose a light on-the-fly method to improve automatic speech recognition performance by combining a bias list of named entities with a word-level n-gram language model with the shallow fusion approach based on the Aho-Corasick string matching algorithm. The Aho-Corasick algorithm has proved to be more efficient than other methods and allows fast context adaptation. An n-gram language model is introduced as a graph with fail and output arcs, where the arc weights are adapted from the n-gram probabilities. The language model is used as an additional support to keyword biasing when the language model is combined with bias entities in a single context graph to take care of the overall performance. We demonstrate our findings on 4 languages, 2 public and 1 private datasets including performance on named entities and out-of-vocabulary entities. We achieve up to 21.6% relative improvement in the general word error rate with no practical difference in the inverse real-time factor.
摘要:尽管端到端模型在自动语音识别方面取得了近期成功,但识别特殊罕见词汇和词汇表外词汇,以及通过文本进行快速领域适应,仍然具有挑战性。通常情况下,偏向于特殊实体会导致整体性能下降。我们提出了一种轻量级的实时方法,通过结合命名实体的偏置列表与基于Aho-Corasick字符串匹配算法的浅层融合方法的词级n-gram语言模型,来提高自动语音识别性能。Aho-Corasick算法已被证明比其他方法更高效,并允许快速上下文适应。引入的n-gram语言模型被构建为一个带有失败和输出弧的图,其中弧权重从n-gram概率中调整。当语言模型与偏置实体结合在单一上下文图中时,语言模型作为对关键词偏置的额外支持,以照顾整体性能。我们在4种语言、2个公开数据集和1个私有数据集上展示了我们的发现,包括命名实体和词汇表外实体的性能。我们在通用词错误率上实现了高达21.6%的相对改进,而在逆实时因子方面没有实际差异。

[NLP-14] Sketching With Your Voice: “Non-Phonorealistic” Rendering of Sounds via Vocal Imitation SIGGRAPH
该论文试图解决自动生成人类语音模仿声音的问题,即通过声音而非视觉的方式进行“素描”。解决方案的关键在于结合感知显著的听觉特征匹配和基于认知理论的沟通策略,以更好地模拟人类对听众的直觉理解。通过这种方式,生成的语音模仿不仅在听觉特征上与目标声音匹配,还能更准确地反映人类在沟通中的直觉和策略,从而在实验和用户研究中表现出更高的准确性和人类直觉的一致性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13507
作者: Matthew Caren,Kartik Chandra,Joshua B. Tenenbaum,Jonathan Ragan-Kelley,Karima Ma
关键词-EN: automatically producing human-like, producing human-like vocal, human-like vocal imitations, visual representation, automatically producing
类目: Graphics (cs.GR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: SIGGRAPH Asia 2024

点击查看摘要

Abstract:We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of “sketching,” but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model’s control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.
摘要:我们提出了一种自动生成类人声音模仿的方法:这是一种“素描”的等价物,但针对的是听觉而非视觉表现。我们从模拟人类声道的模型开始,首先尝试通过调整模型的控制参数来生成声音模仿,使合成的发声在感知上显著的听觉特征上与目标声音匹配。然后,为了更好地符合人类的直觉,我们应用了一种认知交流理论,考虑人类说话者如何战略性地推理其听众。最后,通过多项实验和用户研究,我们证明,当在我们的方法中加入这种交流推理时,它比仅匹配听觉特征更能与人类直觉相符。这一观察结果对计算机图形学中的描绘研究具有广泛的影响。

[NLP-15] HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation
该论文试图解决预训练语言模型在下游任务微调时参数规模过大导致微调不切实际的问题。解决方案的关键在于提出了直接更新变换(UT)范式,通过构建从原始参数到更新参数的直接变换,确保原始参数与更新参数之间的高相关性,并在此基础上提出了Hadamard更新变换(HUT)方法。HUT利用Hadamard变换和两个低秩矩阵高效更新原始权重矩阵,提供了一种更具表达力和灵活性的更新机制,能够在降低计算复杂度的同时保持或提升模型质量。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13501
作者: Geyuan Zhang,Xiaofei Zhou,Chuheng Chen
关键词-EN: Fine-tuning pre-trained language, pre-trained language models, achieved impressive results, Parameter Efficient Fine-Tuning, pre-trained language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.
摘要:针对下游任务微调预训练语言模型在自然语言处理 (NLP) 领域取得了显著成果。然而,由于模型参数规模的迅速增长,微调所有参数变得不切实际。为此,参数高效微调 (Parameter Efficient Fine-Tuning, PEFT) 方法仅更新参数的子集。大多数 PEFT 方法,如 LoRA,采用增量更新,即通过添加学习到的权重矩阵增量到原始参数中。尽管这些方法有效,但它们在捕捉复杂参数动态方面存在局限性,并且无法保持原始参数与更新参数之间的高度相关性。为解决这些问题,我们提出了直接更新变换 (Updated Transformation, UT) 范式,该范式直接从原始参数构建到更新参数的变换。这种方法确保了原始参数与更新参数之间的相关性得以保留,充分利用了预训练过程中学习到的语义特征。在此范式基础上,我们提出了哈达玛更新变换 (Hadamard Updated Transformation, HUT) 方法。HUT 利用哈达玛变换和两个低秩矩阵高效更新原始权重矩阵,提供了一种更具表现力和灵活性的更新机制。这使得 HUT 能够通过函数变换捕捉更丰富的参数特征,同时降低计算复杂度,并保持或提升模型质量。理论分析和在 RoBERTa 和 GPT-2 上的广泛实验验证了 HUT 的有效性。结果表明,HUT 在模型质量方面与其它 PEFT 方法相当或更优,同时显著降低了计算复杂度。

[NLP-16] Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper EMNLP
该论文试图解决在几乎没有监督数据的情况下训练自动语音识别(ASR)模型的问题。解决方案的关键在于利用伪标签(PL)语音数据,通过基础语音模型(FSM)生成伪标签,并在消费级GPU上从头开始训练流式Transformer-Transducer(TT)模型。这种方法避免了传统两阶段训练(预训练和微调)所需的庞大数据和计算资源,实现了单阶段训练的鲁棒ASR模型。论文通过全面的消融实验,验证了伪标签语音数据在不同方面的影响,包括n-gram语言模型的浅融合、命名实体的上下文偏置、低延迟流式应用的分块解码以及FSM规模对TT模型性能的影响。实验结果表明,即使伪标签数据非常嘈杂,TT模型也可以在没有监督数据的情况下从头开始训练。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13499
作者: Iuliia Thorbecke,Juan Zuluaga-Gomez,Esaú Villatoro-Tello,Shashi Kumar,Pradeep Rangappa,Sergio Burdisso,Petr Motlicek,Karthik Pandia,Aravind Ganapathiraju
关键词-EN: automatic speech recognition, open question, remains an open, supervised data remains, speech recognition
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP Findings 2024

点击查看摘要

Abstract:The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.
摘要:在几乎没有监督数据的情况下训练自动语音识别 (ASR) 仍然是一个开放的问题。在这项工作中,我们展示了流式 Transformer-Transducer (TT) 模型可以在消费级和可访问的 GPU 上从头开始完全训练,使用基础语音模型 (FSM) 生成的伪标签 (PL) 语音数据。这使得只需一个阶段即可训练出稳健的 ASR 模型,并且与需要预训练和微调的两步场景相比,不需要大量的数据和计算预算。我们对基于 PL 的流式 TT 模型的不同方面进行了全面的消融实验,包括:(1) n-gram 语言模型的浅融合影响,(2) 命名实体的上下文偏置,(3) 低延迟流式应用的分块解码,以及 (4) TT 整体性能作为 FSM 大小的函数。我们的结果表明,TT 可以在没有监督数据的情况下从头开始训练,即使在伪标签非常嘈杂的情况下也是如此。我们在 CommonVoice 的 6 种语言上验证了所提出的框架,并提出了多种启发式方法来过滤掉幻觉生成的伪标签。

[NLP-17] Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models PRICAI2024
该论文试图解决大型语言模型(LLMs)在理论心灵(Theory-of-Mind, ToM)能力上的局限性问题,特别是在复杂ToM推理任务和非叙事性上下文中的表现不佳。解决方案的关键在于提出了一种名为Constrained Chain-of-ToM (CCoToM)的零样本提示方法,该方法通过利用领域知识和ToM维度之间的因果关系,引导LLMs构建显式的推理链。具体步骤包括首先提示LLMs推断相关的ToM维度(如信念),然后基于生成的相关ToM维度和相应的因果关系推断查询的ToM维度。此外,CCoToM通过自适应地施加约束来引入归纳偏置,以提高ToM维度之间的一致性。该方法不仅适用于叙事性上下文,还能处理非叙事性上下文如对话。实验结果表明,CCoToM在所有使用的LLMs和数据集上均显著优于先前的最先进方法。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13490
作者: Zizheng Lin,Chunkit Chan,Yangqiu Song,Xin Liu
关键词-EN: Large Language Models, Language Models, Large Language, ability possessed, ToM
类目: Computation and Language (cs.CL)
备注: Accepted by PRICAI 2024

点击查看摘要

Abstract:Theory-of-Mind (ToM) ability possessed by Large Language Models (LLMs) has been shown to be limited. Most existing methods for improving ToM in LLMs adopt zero-shot prompting, and they face challenges including poor performance in complex ToM reasoning tasks and an inability to handle non-narrative contexts. We propose a zero-shot prompting method named Constrained Chain-of-ToM (CCoToM) that leverages domain knowledge and the causal relations between ToM dimensions to address these limitations. Specifically, CCoToM guides LLMs to construct explicit reasoning chains by first prompting LLMs to infer related ToM dimensions (e.g., belief). Afterward, CCoToM prompts LLMs to infer the queried ToM dimension based on the generated related ToM dimensions and corresponding causal relations. Additionally, CCoToM adaptively imposes constraints on prompts to introduce inductive biases and improve consistency between ToM dimensions. Besides narratives, CCoToM can also handle non-narrative contexts like conversations. Extensive experiments show that CCoToM consistently outperforms previous state-of-the-art methods by large margins across all LLMs and datasets used. We also conduct in-depth analyses to gain deeper insights into CCoToM. We have made our code publicly available.
摘要:大语言模型 (LLM) 所具备的心智理论 (Theory-of-Mind, ToM) 能力已被证明存在局限性。大多数现有的提升 LLM 中 ToM 能力的方法采用零样本提示 (zero-shot prompting),这些方法在复杂的 ToM 推理任务中表现不佳,且无法处理非叙事性上下文。我们提出了一种名为约束链式心智理论 (Constrained Chain-of-ToM, CCoToM) 的零样本提示方法,该方法利用领域知识和 ToM 维度间的因果关系来解决这些局限。具体而言,CCoToM 通过首先提示 LLM 推断相关 ToM 维度(例如,信念),然后基于生成的相关 ToM 维度和相应的因果关系推断查询的 ToM 维度,从而引导 LLM 构建显式的推理链。此外,CCoToM 自适应地在提示中施加约束,以引入归纳偏置并提高 ToM 维度间的一致性。除了叙事性内容,CCoToM 还能处理对话等非叙事性上下文。广泛的实验表明,在所有使用的 LLM 和数据集上,CCoToM 均大幅超越了之前的最先进方法。我们还进行了深入的分析,以更深入地理解 CCoToM。我们的代码已公开发布。

[NLP-18] Since Lawyers are Males…: Examining Implicit Gender Bias in Hindi Language Generation by LLMs
该论文试图解决大型语言模型(LLMs)在生成文本时存在的性别偏见问题,特别是在英语以外的语言(如印地语)中性别偏见的显著性。解决方案的关键在于开发了受WinoBias启发的印地语数据集,用于检测和比较GPT-4o和Claude-3等模型在生成印地语和英语文本时的性别偏见模式。研究发现,印地语文本生成中的性别偏见高达87.8%,远高于英语的33.4%,这表明不同语言间的性别偏见存在显著差异,并强调了在生成式AI系统中应对这些偏见的重要性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13484
作者: Ishika Joshi,Ishita Gupta,Adrita Dey,Tapan Parikh
关键词-EN: Large Language Models, Large Language, customer support, Hindi, gender biases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used to generate text across various languages, for tasks such as translation, customer support, and education. Despite these advancements, LLMs show notable gender biases in English, which become even more pronounced when generating content in relatively underrepresented languages like Hindi. This study explores implicit gender biases in Hindi text generation and compares them to those in English. We developed Hindi datasets inspired by WinoBias to examine stereotypical patterns in responses from models like GPT-4o and Claude-3 sonnet. Our results reveal a significant gender bias of 87.8% in Hindi, compared to 33.4% in English GPT-4o generation, with Hindi responses frequently relying on gender stereotypes related to occupations, power hierarchies, and social class. This research underscores the variation in gender biases across languages and provides considerations for navigating these biases in generative AI systems.
摘要:大语言模型 (LLMs) 正越来越多地被用于生成多种语言的文本,应用于翻译、客户支持、教育等任务。尽管取得了这些进展,LLMs 在英语中表现出显著的性别偏见,当生成像印地语这样相对较少使用的语言内容时,这种偏见变得更加明显。本研究探讨了印地语文本生成中的隐性性别偏见,并将其与英语中的偏见进行比较。我们开发了受 WinoBias 启发的印地语数据集,以检查 GPT-4o 和 Claude-3 sonnet 等模型在响应中的刻板模式。我们的结果显示,印地语中的性别偏见高达 87.8%,而英语 GPT-4o 生成中的偏见为 33.4%,印地语响应频繁依赖于与职业、权力等级和社会阶层相关的性别刻板印象。这项研究强调了不同语言间性别偏见的差异,并为在生成式 AI 系统中应对这些偏见提供了考量。

[NLP-19] A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering
该论文试图解决语音驱动的开放领域问答系统中,依赖自动语音识别(ASR)模型进行问题转录后再进行文本检索的局限性问题。解决方案的关键在于提出了一种无需ASR的端到端多模态密集检索器,该检索器能够直接处理语音问题,从而避免了ASR模型在低资源语言和特定领域中数据不足的问题,并减少了ASR错误对检索性能的影响。实验结果表明,在处理较短问题时,该检索器在ASR可能出现重要词汇误识别或高词错误率的情况下,表现出更好的检索性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13483
作者: Georgios Sidiropoulos,Evangelos Kanoulas
关键词-EN: Speech-based open-domain, open-domain question answering, Speech-based open-domain question, large corpus, increasing number
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Speech-based open-domain question answering (QA over a large corpus of text passages with spoken questions) has emerged as an important task due to the increasing number of users interacting with QA systems via speech interfaces. Passage retrieval is a key task in speech-based open-domain QA. So far, previous works adopted pipelines consisting of an automatic speech recognition (ASR) model that transcribes the spoken question before feeding it to a dense text retriever. Such pipelines have several limitations. The need for an ASR model limits the applicability to low-resource languages and specialized domains with no annotated speech data. Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. Our experimental results showed that, on shorter questions, our retriever is a promising alternative to the \textitASR and Retriever pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.
摘要:基于语音的开放领域问答系统(QA over a large corpus of text passages with spoken questions)由于越来越多的用户通过语音接口与问答系统互动,已成为一项重要任务。段落检索是基于语音的开放领域问答中的关键任务。迄今为止,以往的研究采用了由自动语音识别(ASR)模型组成的流水线,该模型在将语音问题输入密集文本检索器之前进行转录。这种流水线存在若干局限性。ASR模型的需求限制了其在低资源语言和无标注语音数据的专门领域中的适用性。此外,ASR模型将其错误传播到检索器中。在本研究中,我们试图通过提出一种无需ASR、端到端训练的多模态密集检索器来缓解这些局限性,该检索器能够直接处理语音问题。我们的实验结果表明,在较短的问题上,我们的检索器是ASR和检索器流水线的一个有前景的替代方案,在ASR可能错误转录问题中的重要词汇或产生高字错误率的转录的情况下,实现了更好的检索性能。

[NLP-20] Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models
该论文试图解决大型语言模型(LLMs)在执行数据遗忘(machine unlearning)时,仅依赖负面反馈抑制特定训练数据(forget set)的影响,导致模型输出不连贯或不一致,降低模型效用并可能引发隐私风险的问题。解决方案的关键在于提出了一种名为Alternate Preference Optimization (AltPO)的新方法,该方法结合了负面反馈与遗忘集上的领域内正面反馈,以更有效地消除特定数据的影响,同时避免模型行为的负面变化,并保持整体模型性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13474
作者: Anmol Mekala,Vineeth Dorna,Shreya Dubey,Abhishek Lalwani,David Koleczek,Mukund Rungta,Sadid Hasan,Elita Lobo
关键词-EN: Machine unlearning aims, specific training data, forget set, Large Language Models, Machine unlearning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine unlearning aims to efficiently eliminate the influence of specific training data, known as the forget set, from the model. However, existing unlearning methods for Large Language Models (LLMs) face a critical challenge: they rely solely on negative feedback to suppress responses related to the forget set, which often results in nonsensical or inconsistent outputs, diminishing model utility and posing potential privacy risks. To address this limitation, we propose a novel approach called Alternate Preference Optimization (AltPO), which combines negative feedback with in-domain positive feedback on the forget set. Additionally, we introduce new evaluation metrics to assess the quality of responses related to the forget set. Extensive experiments show that our approach not only enables effective unlearning but also avoids undesirable model behaviors while maintaining overall model performance.
摘要:机器遗忘旨在高效地消除特定训练数据(称为遗忘集)对模型的影响。然而,现有的大语言模型 (LLM) 遗忘方法面临一个关键挑战:它们仅依赖负面反馈来抑制与遗忘集相关的响应,这往往导致输出无意义或不一致,降低了模型效用并带来潜在的隐私风险。为解决这一局限,我们提出了一种名为交替偏好优化 (Alternate Preference Optimization, AltPO) 的新方法,该方法结合了针对遗忘集的负面反馈和领域内正面反馈。此外,我们引入了新的评估指标来评估与遗忘集相关的响应质量。大量实验表明,我们的方法不仅实现了有效的遗忘,还避免了不理想的模型行为,同时保持了整体模型性能。

[NLP-21] Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts
该论文试图解决非AI专家在利用大型语言模型(LLMs)时面临的提示工程难题,即如何设计高质量的提示以提升LLMs的性能。解决方案的关键在于提出了LangGPT框架,这是一个结构化的提示设计框架,借鉴了可重用编程语言的结构化思想,使得提示设计更加系统化和易于迭代更新。此外,论文还引入了Minstrel系统,这是一个具有反思能力的多生成代理系统,能够自动化生成结构化提示,从而显著提升LLMs的性能,并通过用户调查验证了结构化提示的易用性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13449
作者: Ming Wang,Yuanzhong Liu,Xiaoyu Liang,Yijie Huang,Daling Wang,Xiaocui Yang,Sijia Shen,Shi Feng,Xiaoming Zhang,Chaofeng Guan,Yifei Zhang
关键词-EN: demonstrated commendable performance, diverse domains, demonstrated commendable, non-AI experts, structural
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2402.16929

点击查看摘要

Abstract:LLMs have demonstrated commendable performance across diverse domains. Nevertheless, formulating high-quality prompts to assist them in their work poses a challenge for non-AI experts. Existing research in prompt engineering suggests somewhat scattered optimization principles and designs empirically dependent prompt optimizers. Unfortunately, these endeavors lack a structural design, incurring high learning costs and it is not conducive to the iterative updating of prompts, especially for non-AI experts. Inspired by structured reusable programming languages, we propose LangGPT, a structural prompt design framework. Furthermore, we introduce Minstrel, a multi-generative agent system with reflection to automate the generation of structural prompts. Experiments and the case study illustrate that structural prompts generated by Minstrel or written manually significantly enhance the performance of LLMs. Furthermore, we analyze the ease of use of structural prompts through a user survey in our online community.
摘要:大语言模型 (LLM) 在多个领域展示了卓越的表现。然而,对于非 AI 专家而言,制定高质量的提示以辅助其工作仍是一个挑战。现有的提示工程研究提出了一些零散的优化原则和依赖经验设计的提示优化器。遗憾的是,这些努力缺乏结构化设计,导致学习成本高昂,且不利于提示的迭代更新,尤其是对非 AI 专家而言。受结构化可重用编程语言的启发,我们提出了 LangGPT,一个结构化提示设计框架。此外,我们引入了 Minstrel,一个具有反思功能的多生成式 AI 智能体系统,以自动化生成结构化提示。实验和案例研究表明,由 Minstrel 生成或手动编写的结构化提示显著提升了大语言模型的性能。此外,我们通过在线社区的用户调查分析了结构化提示的易用性。

[NLP-22] AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit
该论文试图解决的问题是如何根据不同类型的问题动态选择最合适的问答策略,以提高问答系统的效率和效果。解决方案的关键在于将适应性问答策略的选择建模为一个上下文多臂赌博机问题,并通过训练线性上置信界模型来学习不同问题类型与其最优多语言模型通信图配置之间的映射关系。这种方法能够在复杂策略表现优越时采用它们,而在简单策略足够时避免其成本,从而实现高效的模块化问答系统自适应协调。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13447
作者: Mohanna Hoveyda,Arjen P. de Vries,Harrie Oosterhuis,Maarten de Rijke,Faegheh Hasibi
关键词-EN: effectively addressed, question answering, answering strategies, answering, question
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In question answering (QA), different questions can be effectively addressed with different answering strategies. Some require a simple lookup, while others need complex, multi-step reasoning to be answered adequately. This observation motivates the development of a dynamic method that adaptively selects the most suitable QA strategy for each question, enabling more efficient and effective systems capable of addressing a broader range of question types. To this aim, we build on recent advances in the orchestration of multiple large language models (LLMs) and formulate adaptive QA as a dynamic orchestration challenge. We define this as a contextual multi-armed bandit problem, where the context is defined by the characteristics of the incoming question and the action space consists of potential communication graph configurations among the LLM agents. We then train a linear upper confidence bound model to learn an optimal mapping between different question types and their corresponding optimal multi-LLM communication graph representation. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules, as it combines the superior performance of more complex strategies while avoiding their costs when simpler strategies suffice.%
摘要:在问答 (QA) 系统中,不同的问题可以通过不同的回答策略得到有效解决。有些问题只需简单的查找即可回答,而另一些问题则需要复杂的、多步骤的推理才能充分解答。这一观察结果促使我们开发一种动态方法,能够自适应地为每个问题选择最合适的 QA 策略,从而构建出更高效、更有效的系统,能够处理更广泛的问题类型。为此,我们借鉴了近期在多个大语言模型 (LLM) 协同调度方面的进展,并将自适应 QA 问题形式化为一个动态调度挑战。我们将此定义为一个上下文多臂赌博机问题,其中上下文由传入问题的特征定义,动作空间由 LLM 智能体之间潜在的通信图配置组成。随后,我们训练了一个线性上置信界模型,以学习不同问题类型与其对应的最优多 LLM 通信图表示之间的最优映射。我们的实验表明,所提出的解决方案对于具有多个模块的 QA 系统的自适应调度是可行的,因为它在结合了更复杂策略的优越性能的同时,避免了在简单策略足够时产生的成本。

[NLP-23] Selective Exploration and Information Gathering in Search and Rescue Using Hierarchical Learning Guided by Natural Language Input
该论文试图解决在搜索与救援(SAR)操作中,传统机器人系统无法有效利用人类提供的地面真实信息来优化搜索策略的问题。解决方案的关键在于结合大型语言模型(LLMs)进行社交互动,并将这些信息整合到分层强化学习(HRL)框架中,从而使机器人能够根据人类输入调整搜索策略,提高学习效率和决策能力,特别是在长周期和稀疏奖励的环境中。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13445
作者: Dimitrios Panagopoulos,Adoldo Perrusquia,Weisi Guo
关键词-EN: recent years, daily lives, offering solutions, increasingly integral, solutions to complex
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: Pre-print version of the accepted paper to appear in IEEE International Conference on Systems, Man and Cybernetics (SMC) 2024

点击查看摘要

Abstract:In recent years, robots and autonomous systems have become increasingly integral to our daily lives, offering solutions to complex problems across various domains. Their application in search and rescue (SAR) operations, however, presents unique challenges. Comprehensively exploring the disaster-stricken area is often infeasible due to the vastness of the terrain, transformed environment, and the time constraints involved. Traditional robotic systems typically operate on predefined search patterns and lack the ability to incorporate and exploit ground truths provided by human stakeholders, which can be the key to speeding up the learning process and enhancing triage. Addressing this gap, we introduce a system that integrates social interaction via large language models (LLMs) with a hierarchical reinforcement learning (HRL) framework. The proposed system is designed to translate verbal inputs from human stakeholders into actionable RL insights and adjust its search strategy. By leveraging human-provided information through LLMs and structuring task execution through HRL, our approach not only bridges the gap between autonomous capabilities and human intelligence but also significantly improves the agent’s learning efficiency and decision-making process in environments characterised by long horizons and sparse rewards.
摘要:近年来,机器人和自主系统在我们的日常生活中变得越来越重要,为各个领域的复杂问题提供了解决方案。然而,它们在搜索与救援 (SAR) 操作中的应用带来了独特的挑战。由于地形广阔、环境变化以及时间限制,全面探索受灾区域通常是不可行的。传统的机器人系统通常按照预定义的搜索模式运行,缺乏整合和利用人类利益相关者提供的地面真实数据的能力,而这种数据可能是加速学习过程和提高分类效率的关键。针对这一差距,我们引入了一个系统,该系统通过大语言模型 (LLM) 与分层强化学习 (HRL) 框架相结合,实现了社会互动。所提出的系统旨在将人类利益相关者的口头输入转化为可操作的强化学习洞察,并调整其搜索策略。通过利用 LLM 获取人类提供的信息,并通过 HRL 结构化任务执行,我们的方法不仅弥合了自主能力与人类智能之间的差距,还显著提高了智能体在长周期和稀疏奖励环境中学习效率和决策过程。

[NLP-24] Contextual Compression in Retrieval-Augmented Generation for Large Language Models : A Survey
该论文试图解决大型语言模型(LLMs)在幻觉、知识过时、推理不透明等方面的局限性问题。解决方案的关键在于引入检索增强生成(RAG)技术,通过利用外部数据库来提升生成内容的连贯性和一致性,特别是在复杂、知识密集型任务中。RAG通过结合LLMs的内在知识和外部数据库的动态信息,实现了协同效应,但也面临上下文窗口有限、无关信息干扰和高处理开销等挑战。论文进一步探讨了上下文压缩范式的演进,并提出了未来研究方向,以推动该领域的进一步发展。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13385
作者: Sourav Verma
关键词-EN: Large Language Models, showcase remarkable abilities, Large Language, Language Models, showcase remarkable
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Ongoing Work

点击查看摘要

Abstract:Large Language Models (LLMs) showcase remarkable abilities, yet they struggle with limitations such as hallucinations, outdated knowledge, opacity, and inexplicable reasoning. To address these challenges, Retrieval-Augmented Generation (RAG) has proven to be a viable solution, leveraging external databases to improve the consistency and coherence of generated content, especially valuable for complex, knowledge-rich tasks, and facilitates continuous improvement by leveraging domain-specific insights. By combining the intrinsic knowledge of LLMs with the vast, dynamic repositories of external databases, RAG achieves a synergistic effect. However, RAG is not without its limitations, including a limited context window, irrelevant information, and the high processing overhead for extensive contextual data. In this comprehensive work, we explore the evolution of Contextual Compression paradigms, providing an in-depth examination of the field. Finally, we outline the current challenges and suggest potential research and development directions, paving the way for future advancements in this area.
摘要:大语言模型 (LLMs) 展示了显著的能力,但它们也面临着诸如幻觉、过时知识、不透明性和不可解释的推理等局限性。为了应对这些挑战,检索增强生成 (RAG) 已被证明是一种可行的解决方案,它利用外部数据库来提高生成内容的一致性和连贯性,尤其对于复杂的、知识密集型任务尤为宝贵,并通过利用领域特定的洞察力促进持续改进。通过结合 LLMs 的内在知识和外部数据库的庞大、动态资源,RAG 实现了协同效应。然而,RAG 并非没有局限性,包括有限的上下文窗口、无关信息以及处理大量上下文数据的高处理开销。在这项全面的工作中,我们探讨了上下文压缩范式的演变,对该领域进行了深入的考察。最后,我们概述了当前的挑战,并提出了潜在的研究和开发方向,为该领域的未来进展铺平了道路。

[NLP-25] LLMs Still Cant Plan; Can LRMs? A Preliminary Evaluation of OpenAIs o1 on PlanBench
该论文试图解决的问题是评估当前大型语言模型(LLMs)和新型的“大推理模型”(LRMs)在规划能力方面的表现,特别是通过PlanBench基准测试。解决方案的关键在于开发和训练一种新型模型o1(Strawberry),该模型被设计为超越传统自回归LLMs的局限性,从而在PlanBench上展现出显著的性能提升。尽管o1在基准测试中表现出色,但仍未达到饱和状态,这引发了对模型准确性、效率和部署前需考虑的保障措施的深入探讨。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13373
作者: Karthik Valmeekam,Kaya Stechly,Subbarao Kambhampati
关键词-EN: ability to plan, action that achieves, achieves a desired, desired state, state of affairs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.
摘要:规划行动以达到期望状态的能力长期以来被认为是智能体(AI Agent)的核心能力,并且自人工智能(AI)研究初期以来一直是其重要组成部分。随着大语言模型(LLM)的出现,人们对其是否具备这种规划能力产生了浓厚的兴趣。PlanBench 是我们于 2022 年 GPT3 发布后不久开发的一个可扩展基准测试,至今仍是评估 LLM 规划能力的重要工具。尽管自 GPT3 以来涌现了大量新的私有和开源 LLM,但在此基准上的进展却出乎意料地缓慢。OpenAI 声称,他们最近推出的 o1(Strawberry)模型专门构建和训练,以突破自回归 LLM 的常规限制,从而成为一种新型模型:大推理模型(LRM)。以此发展为契机,本文全面审视了当前 LLM 和新 LRM 在 PlanBench 上的表现。正如我们将看到的,尽管 o1 在基准测试中的表现有了质的提升,超越了竞争对手,但仍远未达到饱和状态。这一改进也引发了关于准确性、效率和保障措施的问题,这些问题在部署此类系统之前必须予以考虑。

[NLP-26] EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models ACL2024
该论文试图解决现有大型语言模型(LLMs)在情感智能评估方面的不足,特别是针对情感识别任务的局限性。解决方案的关键在于提出了一个名为EmotionQueen的新框架,该框架通过四个独特的任务(关键事件识别、混合事件识别、隐含情感识别和意图识别)来全面评估LLMs的情感智能。此外,论文还设计了两个评估指标,用于衡量LLMs在情感相关陈述中的识别和响应能力,从而揭示了LLMs在情感智能方面的能力和局限性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13359
作者: Yuyan Chen,Hao Wang,Songzhou Yan,Sijia Liu,Yueze Li,Yi Zhao,Yanghua Xiao
关键词-EN: Natural Language Processing, large language models, Language Processing, Natural Language, importance in Natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:Emotional intelligence in large language models (LLMs) is of great importance in Natural Language Processing. However, the previous research mainly focus on basic sentiment analysis tasks, such as emotion recognition, which is not enough to evaluate LLMs’ overall emotional intelligence. Therefore, this paper presents a novel framework named EmotionQueen for evaluating the emotional intelligence of LLMs. The framework includes four distinctive tasks: Key Event Recognition, Mixed Event Recognition, Implicit Emotional Recognition, and Intention Recognition. LLMs are requested to recognize important event or implicit emotions and generate empathetic response. We also design two metrics to evaluate LLMs’ capabilities in recognition and response for emotion-related statements. Experiments yield significant conclusions about LLMs’ capabilities and limitations in emotion intelligence.
摘要:大语言模型 (LLM) 中的情感智能在自然语言处理中具有重要意义。然而,以往的研究主要集中在基本的情感分析任务上,如情感识别,这不足以评估 LLM 的整体情感智能。因此,本文提出了一种名为 EmotionQueen 的新框架,用于评估 LLM 的情感智能。该框架包括四个独特的任务:关键事件识别、混合事件识别、隐性情感识别和意图识别。LLM 被要求识别重要事件或隐性情感,并生成同理心回应。我们还设计了两个指标来评估 LLM 在情感相关陈述中的识别和回应能力。实验得出了关于 LLM 在情感智能方面的能力和局限性的重要结论。

[NLP-27] Recent Advancement of Emotion Cognition in Large Language Models
该论文试图解决大型语言模型(LLMs)在情感认知方面的关键问题,特别是在情感分类、情感丰富的响应生成和心智理论评估等应用中的性能提升。解决方案的关键在于深入研究现有的研究进展,包括情感认知的方法、成果和资源,并将其与Ulric Neisser的认知阶段理论相结合。论文还提出了未来研究方向,如无监督学习方法和开发更复杂、可解释的情感认知LLMs,以及采用对比学习等先进方法来提升LLMs的情感认知能力。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13354
作者: Yuyan Chen,Yanghua Xiao
关键词-EN: large language models, mental health assessment, human-computer interaction, language models, social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotion cognition in large language models (LLMs) is crucial for enhancing performance across various applications, such as social media, human-computer interaction, and mental health assessment. We explore the current landscape of research, which primarily revolves around emotion classification, emotionally rich response generation, and Theory of Mind assessments, while acknowledge the challenges like dependency on annotated data and complexity in emotion processing. In this paper, we present a detailed survey of recent progress in LLMs for emotion cognition. We explore key research studies, methodologies, outcomes, and resources, aligning them with Ulric Neisser’s cognitive stages. Additionally, we outline potential future directions for research in this evolving field, including unsupervised learning approaches and the development of more complex and interpretable emotion cognition LLMs. We also discuss advanced methods such as contrastive learning used to improve LLMs’ emotion cognition capabilities.
摘要:大语言模型 (LLM) 中的情感认知对于提升社交媒体、人机交互和心理健康评估等应用的性能至关重要。我们探讨了当前研究领域,主要围绕情感分类、情感丰富的响应生成和心智理论评估,同时承认了依赖标注数据和情感处理复杂性等挑战。本文详细综述了近期在 LLM 情感认知方面的进展。我们探讨了关键研究、方法、成果和资源,并将其与 Ulric Neisser 的认知阶段相联系。此外,我们概述了该领域未来研究的可能方向,包括无监督学习方法和开发更复杂且可解释的情感认知 LLM。我们还讨论了如对比学习等高级方法,以提升 LLM 的情感认知能力。

[NLP-28] me Awareness in Large Language Models : Benchmarking Fact Recall Across Time
该论文试图解决大型语言模型(LLMs)在处理时间敏感事实时表现不足的问题。解决方案的关键在于引入了一个新的数据集,该数据集旨在严格测试LLMs在不同时间背景下对事实的准确性。通过这一基准,论文提供了一种系统的方法来衡量LLMs的知识与正确时间上下文的对齐程度,填补了当前评估方法中的关键空白,并为未来模型在实际应用中的改进提供了有价值的工具。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13338
作者: David Herel,Vojtech Bartek,Tomas Mikolov
关键词-EN: President, Abstract, question is asked, models, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. In this paper, we introduce a novel dataset designed to rigorously test LLMs’ ability to handle time-sensitive facts. Our benchmark offers a systematic way to measure how well LLMs align their knowledge with the correct time context, filling a key gap in current evaluation methods and offering a valuable tool for improving real-world applicability in future models.
摘要:谁是美国总统?这个答案会根据提问的时间而变化。尽管大语言模型 (LLMs) 在各种推理任务中得到了评估,但它们往往忽视了一个关键维度:时间。在现实场景中,答案的正确性常常与时间上下文紧密相关。本文中,我们引入了一个新的数据集,旨在严格测试 LLMs 处理时间敏感事实的能力。我们的基准提供了一种系统的方法来衡量 LLMs 的知识与正确时间上下文的对齐程度,填补了当前评估方法中的一个关键空白,并为未来模型在实际应用中的改进提供了宝贵的工具。

[NLP-29] Beyond the binary: Limitations and possibilities of gender-related speech technology research
该论文旨在解决语音与性别或性别研究领域中术语使用不明确和与社会科学主流观点脱节的问题。论文指出,当前研究中对“性别”一词的使用往往模糊不清,且未能反映性别是社会建构且具有多样性的观点,这可能导致边缘化群体面临的问题被忽视。解决方案的关键在于研究人员应自我反思并提出相关问题,以确保研究术语的准确性和与社会科学共识的一致性,从而更好地理解和解决语音与性别相关的问题。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13335
作者: Ariadna Sanchez,Alice Ross,Nina Markl
关键词-EN: ISCA Interspeech publications, research papers relating, ISCA Interspeech, Interspeech publications, research papers
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Spoken Language Technology (SLT) Workshop 2024

点击查看摘要

Abstract:This paper presents a review of 107 research papers relating to speech and sex or gender in ISCA Interspeech publications between 2013 and 2023. We note the scarcity of work on this topic and find that terminology, particularly the word \textitgender, is used in ways that are underspecified and often out of step with the prevailing view in social sciences that gender is socially constructed and is a spectrum as opposed to a binary category. We draw attention to the potential problems that this can cause for already marginalised groups, and suggest some questions for researchers to ask themselves when undertaking work on speech and gender.
摘要:本文对 2013 年至 2023 年间在 ISCA Interspeech 出版物中涉及语音与性别或性别的 107 篇研究论文进行了综述。我们注意到该领域研究的稀缺性,并发现术语,特别是“性别”一词,在使用上存在不明确之处,且往往与社会科学中普遍认为性别是社会构建的、是一个连续谱而非二元类别的观点不一致。我们强调了这可能对已经处于边缘化的群体造成的潜在问题,并建议研究人员在进行语音与性别相关研究时,应自问一些问题。

[NLP-30] Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection
该论文旨在解决大型语言模型(LLMs)中恶意提示注入攻击的有效检测和缓解问题。解决方案的关键在于利用多语言BERT(如多语言BERT和DistilBert)对提示文本进行分类,通过生成嵌入向量并结合多种机器学习方法(如高斯朴素贝叶斯、随机森林、支持向量机和逻辑回归)来提高恶意提示的检测准确率。特别是,多语言BERT嵌入方法显著提升了分类性能,使逻辑回归模型在检测恶意提示时达到了96.55%的准确率。此外,论文还分析了模型的错误预测,以揭示其局限性,并为研究人员在调整BERT模型以应对不同LLMs漏洞时提供指导。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13331
作者: Md Abdur Rahman,Hossain Shahriar,Fan Wu,Alfredo Cuzzocrea
关键词-EN: Large language models, Large language, exceptional capabilities, malicious prompt injection, wide range
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications. However, this widespread use brings significant vulnerabilities. Also, it is well observed that there are huge gap which lies in the need for effective detection and mitigation strategies against malicious prompt injection attacks in large language models, as current approaches may not adequately address the complexity and evolving nature of these vulnerabilities in real-world applications. Therefore, this work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications. It examines to apply various BERT (Bidirectional Encoder Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts. Also, we observed how tokenizing the prompt texts and generating embeddings using multilingual BERT contributes to improve the performance of various machine learning methods: Gaussian Naive Bayes, Random Forest, Support Vector Machine, and Logistic Regression. The performance of each model is rigorously analyzed with various parameters to improve the binary classification to discover malicious prompts. Multilingual BERT approach to embed the prompts significantly improved and outperformed the existing works and achieves an outstanding accuracy of 96.55% by Logistic regression. Additionally, we investigated the incorrect predictions of the model to gain insights into its limitations. The findings can guide researchers in tuning various BERT for finding the most suitable model for diverse LLMs vulnerabilities.
摘要:大语言模型 (LLMs) 以其卓越的能力而闻名,并被广泛应用于各种领域。然而,这种广泛的应用也带来了显著的脆弱性。同时,人们普遍观察到,在需要有效检测和缓解针对大语言模型的恶意提示注入攻击方面存在巨大差距,因为当前的方法可能无法充分应对这些漏洞在实际应用中的复杂性和演变性。因此,本研究聚焦于恶意提示注入攻击这一最危险的漏洞对实际 LLMs 应用的影响。研究中采用了多种 BERT (Bidirectional Encoder Representations from Transformers) 模型,如多语言 BERT 和 DistilBert,用于从合法提示中分类出恶意提示。此外,我们还观察了如何通过使用多语言 BERT 对提示文本进行 Token 化并生成嵌入,从而提升高斯朴素贝叶斯、随机森林、支持向量机和逻辑回归等多种机器学习方法的性能。通过对各种参数的严格分析,我们改进了二分类模型以发现恶意提示。多语言 BERT 方法在嵌入提示方面显著提升了性能,并超越了现有工作,通过逻辑回归实现了 96.55% 的卓越准确率。此外,我们还研究了模型的错误预测,以深入了解其局限性。这些发现可以指导研究人员在调整各种 BERT 模型时,找到最适合不同 LLMs 漏洞的模型。

[NLP-31] SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation
该论文试图解决在医疗领域中使用大型语言模型(LLMs)时面临的隐私问题和计算资源限制问题。解决方案的关键在于提出了一个开源的小型语言和视觉助手(SLaVA-CXR),专门用于胸部X光报告的自动化生成。通过引入Re³训练方法模拟放射科医生的认知发展过程,以及RADEX数据合成方法生成符合隐私法规的高质量多样化训练语料,SLaVA-CXR在2.7B参数的模型上实现了比现有最先进模型更高的性能和6倍的推理效率提升。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13321
作者: Jinge Wu,Yunsoo Kim,Daqian Shi,David Cliffton,Fenglin Liu,Honghan Wu
关键词-EN: growing research interest, assist clinicians, large language models, success of large, growing research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re ^3 Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.
摘要:受大语言模型 (Large Language Models, LLMs) 成功应用的启发,越来越多的研究兴趣集中在开发医疗领域的 LLMs 以辅助临床医生。然而,对于医院而言,使用闭源的商业 LLMs 涉及隐私问题,而开发开源的公共 LLMs 则需要大规模的计算资源,这在资源高效利用的地区和低收入国家通常是有限的。我们提出了一种开源的小型语言与视觉助手 (Small Language and Vision Assistant, SLaVA-CXR),可用于胸部 X 光报告的自动化。为了高效地训练一个小型助手,我们首先提出了 Re^3 训练方法,该方法模拟了放射科医生的认知发展,并以识别、推理和报告的方式优化模型。随后,我们引入了一种数据合成方法 RADEX,该方法能够生成高质量且多样化的训练语料库,同时符合隐私法规。广泛的实验表明,基于 2.7B 骨干网络构建的 SLaVA-CXR 不仅性能优于之前的最新模型,而且在推理效率上达到了 6 倍的速度提升。

[NLP-32] JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models
该论文试图解决日本大型语言模型(LLMs)在生物医学领域应用的不足问题,特别是缺乏全面、大规模的基准测试和评估资源。解决方案的关键在于提出一个新的基准测试,包括八个LLMs和20个日本生物医学数据集,涵盖五个任务,并通过实验结果揭示了LLMs在理解和应用生物医学知识方面的性能差异。此外,论文还提供了评估工具和数据集,以促进该领域的未来研究和发展。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13317
作者: Junfeng Jiang,Jiahao Huang,Akiko Aizawa
关键词-EN: large language models, Japanese large language, Japanese biomedical, Japanese biomedical tasks, Japanese
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available in this https URL to facilitate future research.
摘要:近期日本大语言模型 (LLM) 的发展主要集中在通用领域,而在日本生物医学 LLM 方面的进展较少。一个主要障碍是缺乏一个全面、大规模的基准用于比较。此外,用于评估日本生物医学 LLM 的资源不足。为了推动这一领域的发展,我们提出了一项新的基准,包括四个类别中的八个 LLM 和五个任务中的 20 个日本生物医学数据集。实验结果表明:(1) 对日语理解更深入且具备更丰富生物医学知识的 LLM 在日本生物医学任务中表现更佳,(2) 非主要为日本生物医学领域设计的 LLM 仍能表现出色,(3) 在某些日本生物医学任务中,现有 LLM 仍有很大的改进空间。此外,我们提供了一些见解,可能进一步促进该领域的发展。我们为基准量身定制的评估工具以及数据集已在 https URL 公开,以促进未来的研究。

[NLP-33] GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification COLING2025
该论文试图解决预训练语言模型(LMs)在文本分类任务中缺乏解释性的问题。解决方案的关键在于引入GAProtoNet,这是一种基于多头图注意力机制的原型网络,通过将输入向量和原型视为图中的节点,并利用多头图注意力机制来选择性地构建输入节点与原型节点之间的边,从而学习可解释的原型表示。在推理过程中,模型根据注意力分数加权的激活原型的线性组合做出决策,使得模型的选择可以通过注意力权重和投影到最匹配训练样本的原型来透明地解释。实验结果表明,该方法在不牺牲原始黑箱LMs准确性的前提下,实现了更好的解释性和分类性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13312
作者: Ximing Wen,Wenjuan Tan,Rosina O. Weber
关键词-EN: Pretrained transformer-based Language, transformer-based Language Models, powerful word embeddings, text classification tasks, Pretrained transformer-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figues, submitted to COLING 2025

点击查看摘要

Abstract:Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on text classification tasks with their powerful word embeddings, but their black-box nature, which leads to a lack of interpretability, has been a major concern. In this work, we introduce GAProtoNet, a novel white-box Multi-head Graph Attention-based Prototypical Network designed to explain the decisions of text classification models built with LM encoders. In our approach, the input vector and prototypes are regarded as nodes within a graph, and we utilize multi-head graph attention to selectively construct edges between the input node and prototype nodes to learn an interpretable prototypical representation. During inference, the model makes decisions based on a linear combination of activated prototypes weighted by the attention score assigned for each prototype, allowing its choices to be transparently explained by the attention weights and the prototypes projected into the closest matching training examples. Experiments on multiple public datasets show our approach achieves superior results without sacrificing the accuracy of the original black-box LMs. We also compare with four alternative prototypical network variations and our approach achieves the best accuracy and F1 among all. Our case study and visualization of prototype clusters also demonstrate the efficiency in explaining the decisions of black-box models built with LMs.
摘要:基于预训练 Transformer 的语言模型 (Language Models, LMs) 以其强大的词嵌入能力在文本分类任务中取得了显著的改进,但其黑箱特性导致缺乏可解释性,一直是主要关注的问题。本文中,我们提出了 GAProtoNet,一种新颖的白箱多头部图注意力 (Multi-head Graph Attention) 基于原型网络 (Prototypical Network),旨在解释由 LM 编码器构建的文本分类模型的决策过程。在我们的方法中,输入向量和原型被视为图中的节点,并利用多头部图注意力机制选择性地在输入节点和原型节点之间构建边,以学习可解释的原型表示。在推理阶段,模型基于激活原型的线性组合进行决策,每个原型的权重由注意力分数决定,使得其选择可以通过注意力权重和投影到最匹配训练样本的原型来透明地解释。在多个公开数据集上的实验表明,我们的方法在不牺牲原始黑箱 LMs 准确性的前提下,取得了优越的结果。我们还与四种替代原型网络变体进行了比较,我们的方法在所有比较中达到了最佳的准确率和 F1 分数。我们的案例研究和原型簇的可视化也展示了在解释由 LMs 构建的黑箱模型决策方面的效率。

[NLP-34] Unsupervised Domain Adaptation for Keyphrase Generation using Citation Contexts EMNLP2024
该论文试图解决领域自适应中关键短语生成模型需要大量标注数据的难题。解决方案的关键在于提出了一种无监督方法——SILK,通过从引用上下文中提取银标准关键短语,生成合成标注数据,从而在无需专家标注的情况下实现领域自适应。实验结果表明,该方法在多个领域中显著提升了模型在目标领域的表现。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13266
作者: Florian Boudin,Akiko Aizawa
关键词-EN: Adapting keyphrase generation, typically involves few-shot, involves few-shot fine-tuning, keyphrase generation models, Adapting keyphrase
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2024 Findings

点击查看摘要

Abstract:Adapting keyphrase generation models to new domains typically involves few-shot fine-tuning with in-domain labeled data. However, annotating documents with keyphrases is often prohibitively expensive and impractical, requiring expert annotators. This paper presents silk, an unsupervised method designed to address this issue by extracting silver-standard keyphrases from citation contexts to create synthetic labeled data for domain adaptation. Extensive experiments across three distinct domains demonstrate that our method yields high-quality synthetic samples, resulting in significant and consistent improvements in in-domain performance over strong baselines.
摘要:将关键词生成模型适应到新领域通常需要使用领域内的标注数据进行少样本微调。然而,为文档标注关键词往往成本高昂且不切实际,需要专家标注者。本文提出了 silk,一种无监督方法,旨在通过从引用上下文中提取银标准关键词来创建合成标注数据,以解决这一问题。在三个不同领域的广泛实验表明,我们的方法能够生成高质量的合成样本,从而在领域内性能上显著且一致地优于强大的基线模型。

[NLP-35] owards LifeSpan Cognitive Systems
该论文试图解决构建一个能够持续与复杂环境(如模拟数字世界或人类社会)进行高频交互的人类化系统(LifeSpan Cognitive System, LSCS)所面临的关键挑战。解决方案的关键在于实现两个主要功能:(1) 抽象与经验融合,即系统能够高效地整合新旧经验;(2) 长期保留与准确回忆,确保系统在快速更新信息的同时,能够准确地回忆和利用过去的经验。论文提出了一种新的范式,通过整合四种基于存储复杂度的现有技术,实现这两个核心过程:吸收经验与生成响应,从而构建一个能够持续学习并适应环境的LSCS。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13265
作者: Yu Wang,Chi Han,Tongtong Wu,Xiaoxin He,Wangchunshu Zhou,Nafis Sadeq,Xiusi Chen,Zexue He,Wei Wang,Gholamreza Haffari,Heng Ji,Julian McAuley
关键词-EN: simulated digital worlds, Building a human-like, human society, presents several key, continuously interacts
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building a human-like system that continuously interacts with complex environments – whether simulated digital worlds or human society – presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. We identify two major challenges in achieving this: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called Storage Complexity, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations. Given that none of the existing technologies can achieve LSCS alone, we propose a novel paradigm for LSCS that integrates all four classes of technologies. The new paradigm operates through two core processes: Absorbing Experiences and Generating Responses.
摘要:构建一个能够持续与复杂环境(无论是模拟的数字世界还是人类社会)交互的人类化系统,面临着几个关键挑战。核心在于实现连续、高频的交互,这些交互被称为经验。我们将这一设想中的系统称为生命周期认知系统 (LifeSpan Cognitive System, LSCS)。LSCS 的一个关键特征是它能够在进行增量和快速更新的同时,保留并准确回忆过去的经验。我们识别了实现这一目标的两个主要挑战:(1) 抽象与经验融合,以及 (2) 长期保留与准确回忆。这些特性对于存储新经验、组织过去的经验以及利用相关历史数据对环境做出反应至关重要。与依赖大型语料库进行微调并专注于特定领域或任务性能提升的持续学习语言模型不同,LSCS 必须能够以高频率快速且增量地更新来自其环境的新信息。现有技术中,具有解决上述两大挑战潜力的技术可以根据一个称为存储复杂度的概念性指标进行分类,该指标衡量存储过去经验所需的相对空间。这四类技术各有其优势和局限性。鉴于现有技术无法单独实现 LSCS,我们提出了一种新的 LSCS 范式,该范式整合了所有四类技术。新范式通过两个核心过程运作:吸收经验与生成响应。

[NLP-36] Large Language Model Should Understand Pinyin for Chinese ASR Error Correction
该论文试图解决中文自动语音识别(ASR)系统中的错误纠正问题,提出了一种拼音增强的生成错误纠正(PY-GEC)方法。解决方案的关键在于利用拼音(Pinyin)作为补充信息,通过多任务训练将拼音与文本的特征空间对齐,从而提高错误纠正的准确性。具体来说,该方法在训练时仅使用合成错误数据,并在推理时采用最优假设,通过增加对拼音特征的关注权重和特征空间的对齐,显著提升了中文ASR错误纠正的效果。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13262
作者: Yuang Li,Xiaosong Qiao,Xiaofeng Zhao,Huan Zhao,Wei Tang,Min Zhang,Hao Yang
关键词-EN: Large language models, enhance automatic speech, automatic speech recognition, speech recognition systems, Large language
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large language models can enhance automatic speech recognition systems through generative error correction. In this paper, we propose Pinyin-enhanced GEC, which leverages Pinyi, the phonetic representation of Mandarin Chinese, as supplementary information to improve Chinese ASR error correction. Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference. Additionally, we introduce a multitask training approach involving conversion tasks between Pinyin and text to align their feature spaces. Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input. More importantly, we provide intuitive explanations for the effectiveness of PY-GEC and multitask training from two aspects: 1) increased attention weight on Pinyin features; and 2) aligned feature space between Pinyin and text hidden states.
摘要:大语言模型可以通过生成式错误修正来增强自动语音识别系统。本文提出了一种拼音增强的生成式错误修正 (Pinyin-enhanced GEC),利用拼音 (Pinyi) 作为普通话的音标表示,作为补充信息来改进中文 ASR 错误修正。我们的方法仅使用合成错误进行训练,并在推理过程中采用最佳假设。此外,我们引入了一种多任务训练方法,涉及拼音与文本之间的转换任务,以对齐它们的特征空间。在 Aishell-1 和 Common Voice 数据集上的实验表明,我们的方法在文本输入的 GEC 基础上持续表现更优。更重要的是,我们从两个方面为拼音增强生成式错误修正 (PY-GEC) 和多任务训练的有效性提供了直观解释:1) 增加对拼音特征的关注权重;2) 对齐拼音与文本隐藏状态的特征空间。

[NLP-37] RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion
该论文试图解决强化学习从人类反馈(RLHF)训练系统中GPU利用率低的问题。解决方案的关键在于将传统的RLHF工作流程中的任务细分为更细粒度的子任务,并通过阶段融合(stage fusion)来提高GPU利用率。具体来说,RLHFuse将生成和推理任务分解为样本级别的子任务,以实现高效的跨阶段融合,从而缓解生成阶段的数据偏斜问题;同时,将训练任务分解为微批次子任务,并通过融合的流水线调度在训练阶段并行执行这些子任务,减少流水线气泡,从而显著提高训练吞吐量。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13221
作者: Yinmin Zhong,Zili Zhang,Bingyang Wu,Shengyu Liu,Yukun Chen,Changyi Wan,Hanpeng Hu,Lei Xia,Ranchen Ming,Yibo Zhu,Xin Jin
关键词-EN: Human Feedback, pivotal post-training technique, human preference, RLHF, Reinforcement Learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal post-training technique to enhance the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2409.13221 [cs.LG] (or arXiv:2409.13221v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:人类反馈强化学习 (Reinforcement Learning from Human Feedback, RLHF) 是一种关键的训练后技术,用于增强大语言模型 (Large Language Models, LLMs) 与人类偏好之间的对齐。RLHF 的工作流程通常涉及一系列不同阶段的多个模型和任务。现有的 RLHF 训练系统将每个任务视为最小的执行单元,从而忽略了子任务级别优化的机会。由于 RLHF 训练的内在特性,即生成阶段的数据偏斜和训练阶段的流水线气泡,现有的 RLHF 系统在生产部署中存在 GPU 利用率低的问题。RLHFuse 打破了将 RLHF 工作流程视为独立任务组合的传统观点,将每个任务细分为更细粒度的子任务,并执行阶段融合以提高 GPU 利用率。RLHFuse 包含两个关键思想。首先,对于生成和推理任务,RLHFuse 将其拆分为样本级别的子任务,通过高效的阶段间融合来缓解由长尾样本主导的原始生成瓶颈。其次,对于训练任务,RLHFuse 将其分解为微批次的子任务。通过利用流水线执行可以由另一个流水线本质互补的直觉,RLHFuse 在训练阶段执行阶段内融合,通过融合的流水线调度同时执行这些子任务,从而减少流水线气泡。此外,RLHFuse 还包含了一系列针对 RLHF 各阶段的系统优化,使其在我们的内部产品使用中高效且可扩展。我们在多种流行的大语言模型上评估了 RLHFuse,结果显示,与现有的最先进系统相比,RLHFuse 将训练吞吐量提高了多达 3.7 倍。

主题:机器学习 (cs.LG); 计算与语言 (cs.CL); 分布式、并行与集群计算 (cs.DC)
引用为:arXiv:2409.13221 [cs.LG] (或 arXiv:2409.13221v1 [cs.LG] 用于此版本)
https://doi.org/10.48550/arXiv.2409.13221
通过 DataCite 发布的 arXiv DOI (待注册)

[NLP-38] Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks
该论文试图解决小规模语言模型(SLMs)在复杂推理任务中表现不佳的问题,特别是在这些任务需要专业知识和稀有知识的情况下。解决方案的关键在于提出了一种名为Neural-Symbolic Collaborative Distillation(NesyCD)的新型知识蒸馏方法。NesyCD通过将大规模语言模型(LLMs)中的通用能力和专业知识分别进行蒸馏,实现了对SLMs的性能提升。具体来说,通用能力通过参数化的神经网络进行蒸馏,而专业知识和稀有知识则通过符号知识蒸馏方法存储在符号知识库(KB)中。这种方法不仅提高了SLMs在复杂推理任务中的表现,还使得专业知识库具有良好的泛化能力,并且易于人类理解和操作。实验结果表明,NesyCD显著提升了SLMs在多个数据集上的复杂推理性能,甚至在某些情况下超越了更大规模的模型。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13203
作者: Huanxuan Liao,Shizhu He,Yao Xu,Yuanzhe Zhang,Kang Liu,Jun Zhao
关键词-EN: Large Language Models, Small Language Models, textbf, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose \textbfNe ural- \textbfSy mbolic \textbfC ollaborative \textbfD istillation ( \textbfNesyCD ), a novel knowledge distillation method for learning the complex reasoning abilities of Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex reasoning tasks are difficult for Small Language Models (SLMs, e.g., \leq 7B), as these tasks demand not only general cognitive abilities but also specialized knowledge, which is often sparse and difficult for these neural-based SLMs to effectively capture. Therefore, NesyCD distills the general capabilities and specialized knowledge in LLMs using different manners. On the one hand, we distill only general abilities from teacher LLMs into the student SLMs of parameterized neural networks. On the other hand, for the specialized abilities and uncommon knowledge of a complex reasoning task, we employ a symbolic knowledge distillation approach to obtain and store the specialized knowledge within a symbolic knowledge base (KB). By decoupling general and specialized capabilities, the proposed NesyCD can achieve superior performance cost-effectively, utilizing smaller models and blending parameterized neural networks with symbolic KB. Moreover, the specialized KB generalizes well and is comprehended and manipulated by humans. Our experiments show that NesyCD significantly boosts SLMs’ complex reasoning performance on in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in performance and come close to matching LLaMA3-70B, despite the latter having nine times more parameters. Our code will be available at this https URL.
摘要:本文提出了一种名为 神经-符号协同蒸馏 (Neural-Symbolic Collaborative Distillation, NesyCD) 的新型知识蒸馏方法,用于学习大语言模型 (Large Language Models, LLMs) 的复杂推理能力 (例如,参数数量大于 13B)。我们认为,复杂推理任务对于小语言模型 (Small Language Models, SLMs) 来说较为困难 (例如,参数数量小于等于 7B),因为这些任务不仅需要一般的认知能力,还需要专门的领域知识,而这些知识通常是稀疏且难以被基于神经网络的 SLMs 有效捕捉的。因此,NesyCD 通过不同的方式从 LLMs 中蒸馏出一般能力和专门知识。一方面,我们仅将教师 LLMs 中的一般能力蒸馏到学生 SLMs 的参数化神经网络中。另一方面,对于复杂推理任务中的专门能力和不常见知识,我们采用符号知识蒸馏方法,将这些专门知识获取并存储在符号知识库 (Knowledge Base, KB) 中。通过将一般能力和专门能力解耦,所提出的 NesyCD 能够以更高的成本效益实现卓越的性能,利用更小的模型,并将参数化神经网络与符号 KB 相结合。此外,专门 KB 具有良好的泛化性,并且可以被人类理解和操作。我们的实验表明,NesyCD 显著提升了 SLMs 在领域内 (BBH, GSM8K) 和领域外 (AGIEval, ARC) 数据集上的复杂推理性能。值得注意的是,我们的方法使得 LLaMA3-8B 和 Qwen2-7B 在性能上超越了 GPT-3.5-turbo,并接近匹配 LLaMA3-70B,尽管后者的参数数量是前者的九倍。我们的代码将在以下链接中提供:[https URL]。

[NLP-39] CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance
该论文试图解决的问题是:在提升大型语言模型(LLMs)的工具利用能力时,过度调整模型以适应特定工具调用模式,导致模型整体性能受损的问题。解决方案的关键在于提出了一种基于组件重要性的工具利用能力注入方法(CITI),通过分析模型组件的梯度重要性得分,对不同组件采用不同的训练策略,如对重要组件应用Mixture-Of-LoRA(MOLoRA),对次要组件进行微调,同时保持其他参数冻结,从而在增强工具利用能力的同时,有效避免对模型整体性能的过度损害。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13202
作者: Yupu Hao,Pengfei Cao,Zhuoran Jin,Huanxuan Liao,ubo Chen,Kang Liu,Jun Zhao
关键词-EN: Large Language Models, Large Language, Tool learning enables, enables the Large, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool learning enables the Large Language Models (LLMs) to interact with the external environment by invoking tools, enriching the accuracy and capability scope of LLMs. However, previous works predominantly focus on improving model’s tool-utilizing accuracy and the ability to generalize to new, unseen tools, excessively forcing LLMs to adjust specific tool-invoking pattern without considering the harm to model’s general performance. This deviates from the actual applications and original intention of integrating tools to enhance model. To tackle this problem, we dissect the capability trade-offs by examining the hidden representation changes and the gradient-based importance score of model’s components. Based on the analysis result, we propose a Component Importance-based Tool-utilizing ability Injection method (CITI). According to the gradient-based importance score of different components, it alleviates the capability conflicts caused by fine-tuning process by applying distinct training strategies to different components. CITI applies Mixture-Of-LoRA (MOLoRA) for important components. Meanwhile, it fine-tunes the parameters of few components deemed less important in the backbone of the LLM, while keeping other parameters frozen. CITI can effectively enhance the model’s tool-utilizing capability without excessively compromising its general performance. Experimental results demonstrate that our approach achieves outstanding performance across a range of evaluation metrics.
摘要:工具学习使大语言模型 (LLMs) 能够通过调用工具与外部环境互动,从而丰富了 LLMs 的准确性和能力范围。然而,以往的研究主要集中在提高模型工具利用的准确性和对新工具的泛化能力上,过度迫使 LLMs 调整特定的工具调用模式,而未考虑这对模型整体性能的损害。这偏离了实际应用中通过工具增强模型的初衷。为解决这一问题,我们通过分析隐藏表示变化和基于梯度的重要性评分来剖析能力权衡。基于分析结果,我们提出了一种基于组件重要性的工具利用能力注入方法 (CITI)。根据不同组件的基于梯度的重要性评分,CITI 通过为不同组件应用不同的训练策略来缓解微调过程中引起的能力冲突。CITI 对重要组件应用了混合 LoRA (MOLoRA),同时对 LLM 主干中被认为不太重要的少数组件进行参数微调,而保持其他参数冻结。CITI 能够有效增强模型的工具利用能力,而不过度损害其整体性能。实验结果表明,我们的方法在多种评估指标上表现出色。

[NLP-40] CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information
该论文试图解决大型语言模型(LLMs)在实际应用中因参数庞大和计算开销高而面临的挑战。解决方案的关键在于提出了一种高效的结构化剪枝框架CFSP,通过利用粗粒度(块间)和细粒度(块内)激活信息作为重要性准则来指导剪枝过程。CFSP仅需一次前向传播即可计算特征激活,从而实现高效的剪枝。具体步骤包括根据块的重要性分配稀疏预算,并在每个块内保留重要权重,同时引入基于粗粒度重要性的自适应微调策略以进一步提高性能。实验结果表明,CFSP在不同模型和稀疏预算下均优于现有方法。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13199
作者: Yuxin Wang,Minghua Ma,Zekun Wang,Jingchang Chen,Huiming Fan,Liping Shan,Qing Yang,Dongliang Xu,Ming Liu,Bing Qin
关键词-EN: Large Language Models, Large Language, Language Models, real-world applications, pruning
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at this https URL.
摘要:大语言模型 (LLM) 的庞大参数和计算开销对其在实际应用中的可行性提出了挑战。网络剪枝 (Network pruning) 通过移除冗余参数来实现非结构化或结构化稀疏性,最近被探索用于 LLM 加速。现有的 LLM 剪枝工作主要集中在非结构化剪枝上,这通常需要专用硬件支持才能实现实际的速度提升。相比之下,结构化剪枝可以在通用设备上减少延迟。然而,高效地进行结构化剪枝并保持性能,尤其是在高稀疏比率下,仍然是一个挑战。为此,我们引入了一种高效的结构化剪枝框架,名为 CFSP,该框架利用粗粒度 (interblock) 和细粒度 (intrablock) 激活信息作为重要性准则来指导剪枝。剪枝过程非常高效,因为它只需要一次前向传播来计算特征激活。具体来说,我们首先根据各块的重要性分配稀疏预算,然后在每个块内保留重要权重。此外,我们引入了一种恢复微调策略,该策略根据粗粒度重要性自适应地分配训练开销,以进一步提高性能。实验结果表明,CFSP 在各种稀疏预算下的多种模型上优于现有方法。我们的代码将在此 https URL 上提供。

[NLP-41] Exploring Scaling Laws for Local SGD in Large Language Model Training
该论文探讨了在大规模语言模型(LLM)训练中使用本地随机梯度下降(Local SGD)的扩展规律。解决方案的关键在于通过分布式优化算法,使得在松散连接的设备上进行训练成为可能,并在多集群设置和边缘计算环境中验证了Local SGD的有效性。研究表明,在同等模型参数、数据集和计算资源条件下,Local SGD能够达到与传统方法相媲美的效果,为替代单一大型集群训练提供了可行性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13198
作者: Qiaozhi He,Xiaomin Zhuang,Zhihua Wu
关键词-EN: loosely connected devices, paper investigates scaling, investigates scaling laws, distributed optimization algorithm, local SGD
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Technical Report

点击查看摘要

Abstract:This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
摘要:本文研究了在大语言模型 (LLM) 训练中局部随机梯度下降 (Local SGD) 的缩放规律,这是一种促进在松散连接设备上进行训练的分布式优化算法。通过广泛的实验,我们展示了在给定相同模型参数、数据集和计算资源的情况下,局部 SGD 能够取得与传统方法相媲美的结果。此外,我们还探讨了局部 SGD 在多种实际场景中的应用,包括多集群设置和边缘计算环境。我们的研究阐明了有效进行多集群 LLM 训练的必要条件,并考察了在 LLM 训练过程中利用边缘计算资源的潜力与局限性。这表明局部 SGD 作为一种替代单一大型集群训练的可行性。

[NLP-42] ChemDFM-X: Towards Large Multimodal Model for Chemistry
该论文试图解决化学研究中多模态数据处理和任务覆盖不足的问题。解决方案的关键在于开发了一个跨模态的化学对话基础模型(ChemDFM-X),通过利用近似计算和任务特定模型预测生成多样化的多模态数据,构建了一个包含7.6M数据的指令微调数据集。这一策略不仅降低了成本,还显著提升了模型对多模态和跨模态知识的理解能力,使其在多种化学任务和数据模态中表现出色,标志着向化学通用智能(CGI)迈出了重要一步。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13194
作者: Zihan Zhao,Bo Chen,Jingpiao Li,Lu Chen,Liyang Wen,Pengyu Wang,Zichen Zhu,Danyang Zhang,Ziping Wan,Yansi Li,Zhongyang Dai,Xin Chen,Kai Yu
关键词-EN: offer unprecedented assistance, natural science including, Rapid developments, science including chemistry, tools are expected
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 19 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.
摘要:AI 工具的快速发展有望为包括化学在内的自然科学研究提供前所未有的帮助。然而,现有的单模态任务特定专家模型和新兴的多模态大模型 (LMM) 都无法覆盖化学数据模态和任务类别的广泛范围。为了满足化学家的实际需求,迫切需要一个跨模态的化学通用智能 (CGI) 系统,该系统能够充分利用 LMM 的巨大潜力,成为真正实用且有用的研究助手。在本研究中,我们介绍了首个用于化学的跨模态对话基础模型 (ChemDFM-X)。通过近似计算和任务特定模型预测,从初始模态生成多样化的多模态数据。这一策略不仅创建了充足的化学训练语料库,还显著降低了过高的成本,最终形成了一个包含 760 万数据的指令微调数据集。经过指令微调后,ChemDFM-X 在不同化学任务和多种数据模态的广泛实验中进行了评估。结果表明,ChemDFM-X 具备多模态和跨模态知识理解的能力。ChemDFM-X 标志着在化学领域对齐所有模态方面迈出了重要一步,更接近 CGI 的目标。

[NLP-43] An adapted large language model facilitates multiple medical tasks in diabetes care
该论文试图解决糖尿病管理中的多任务处理问题,特别是如何利用大型语言模型(LLMs)来优化糖尿病相关的任务处理。解决方案的关键在于开发了一个全面的糖尿病数据处理框架,包括数据收集、过滤、增强和精炼,从而创建了一个高质量的糖尿病专用数据集。基于此数据集,论文进一步微调了一系列糖尿病专用LLMs,这些模型在理解和处理各种糖尿病任务方面表现出色,并展示了在个性化医疗、医学教育和临床任务简化等方面的潜在应用。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13191
作者: Lai Wei,Zhen Ying,Muyang He,Yutong Chen,Qian Yang,Yanzhe Hong,Jiaping Lu,Xiaoying Li,Weiran Huang,Ying Chen
关键词-EN: global health burden, requires multi-stakeholder collaboration, significant global health, management requires multi-stakeholder, optimizing diabetes management
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diabetes is a chronic disease that poses a significant global health burden, and optimizing diabetes management requires multi-stakeholder collaboration. Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across a diverse range of diabetes tasks remains unproven. In this study, we introduced a framework to train and validate diabetes-specific LLMs. We first developed a comprehensive data processing pipeline that includes data collection, filtering, augmentation and refinement. This approach contributes to creating a high-quality, diabetes-specific dataset, and several evaluation benchmarks entirely from scratch. Utilizing the collected training dataset, we fine-tuned a diabetes-specific LLM family that demonstrated state-of-the-art proficiency in understanding and processing various diabetes tasks compared to other LLMs. Furthermore, clinical studies showed the potential applications of our models in diabetes care, including providing personalized healthcare, assisting medical education, and streamlining clinical tasks. In conclusion, our study introduced a framework to develop and evaluate a diabetes-specific LLM family, and highlighted its potential to enhance clinical practice and provide personalized, data-driven support for diabetes support when facing different end users. The code is provided via GitHub at this https URL.
摘要:糖尿病是一种慢性疾病,对全球健康构成重大负担,优化糖尿病管理需要多方协作。大语言模型 (LLM) 在多种医疗场景中显示出潜力,但其在广泛糖尿病任务中的有效性尚未得到验证。在本研究中,我们提出了一种框架来训练和验证糖尿病专用 LLM。我们首先开发了一个全面的数据处理流程,包括数据收集、过滤、增强和精炼。这种方法有助于创建高质量的糖尿病专用数据集,并从零开始构建多个评估基准。利用收集的训练数据集,我们微调了一系列糖尿病专用 LLM,这些模型在理解和处理各种糖尿病任务方面表现出色,优于其他 LLM。此外,临床研究表明,我们的模型在糖尿病护理中具有潜在应用,包括提供个性化医疗、辅助医学教育和简化临床任务。总之,本研究介绍了一种开发和评估糖尿病专用 LLM 系列的框架,并强调了其在增强临床实践和为不同终端用户提供个性化、数据驱动支持方面的潜力。代码通过 GitHub 提供,链接为 https URL。

[NLP-44] textitSKIntern: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models
该论文试图解决小型语言模型(SLMs)在知识记忆、推理能力和域外泛化方面的局限性问题。解决方案的关键在于引入了一种名为 SKIntern 的创新方法,通过渐进式微调过程,使SLMs逐步内化符号知识和少量示例,并由课程学习中的预定义线性衰减调度引导。这种方法有效减少了计算开销,加速了推理过程,并在多个SLMs上在域内和域外任务中显著优于现有技术基准,推理成本(以FLOPs衡量)降低了多达4倍。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13183
作者: Huanxuan Liao,Shizhu He,Yupu Hao,Xiang Li,Yuanzhe Zhang,Kang Liu,Jun Zhao
关键词-EN: Small Language Models, Large Language Models, Language Models, Small Language, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) are attracting attention due to the high computational demands and privacy concerns of Large Language Models (LLMs). Some studies fine-tune SLMs using Chains of Thought (CoT) data distilled from LLMs, aiming to enhance their reasoning ability. Furthermore, Some CoT distillation methods introduce external symbolic knowledge into the generation process to improve the limited knowledge memory, reasoning ability and out-of-domain (OOD) generalization of SLMs. However, the introduction of symbolic knowledge increases computational overhead and introduces potential noise. In this paper, we introduce \textitSKIntern , an innovative approach that empowers SLMs to internalize symbolic knowledge and few-shot examples gradually through a progressive fine-tuning process, guided by a predefined linear decay schedule under curriculum learning. By efficiently internalizing knowledge, \textitSKIntern reduces computational overhead and speeds up the reasoning process by focusing solely on the question during inference. It outperforms state-of-the-art baselines by over 5%, while reducing inference costs (measured in FLOPs) by up to 4\times across a wide range of SLMs in both in-domain (ID) and out-of-domain (OOD) tasks. Our code will be available at \urlthis https URL.
摘要:小型语言模型 (Small Language Models, SLMs) 由于大语言模型 (Large Language Models, LLMs) 的高计算需求和隐私问题而受到关注。一些研究通过使用从 LLMs 中提取的思维链 (Chains of Thought, CoT) 数据对 SLMs 进行微调,旨在增强其推理能力。此外,一些 CoT 蒸馏方法在生成过程中引入外部符号知识,以改善 SLMs 有限的记忆、推理能力和域外 (Out-of-Domain, OOD) 泛化能力。然而,引入符号知识增加了计算开销并引入了潜在的噪声。在本文中,我们提出了 \textit{SKIntern},这是一种创新方法,通过课程学习下的预定义线性衰减计划,逐步引导 SLMs 通过渐进式微调过程内化符号知识和少样本示例。通过高效地内化知识,\textit{SKIntern} 减少了计算开销,并通过在推理过程中仅关注问题来加速推理过程。在域内 (In-Domain, ID) 和域外 (OOD) 任务中,\textit{SKIntern} 在广泛的 SLMs 上表现优于最先进的基线模型超过 5%,同时将推理成本(以 FLOPs 衡量)降低了多达 4 倍。我们的代码将在 \url{https://this} 上提供。

[NLP-45] RRM: Robust Reward Model Training Mitigates Reward Hacking
该论文试图解决传统奖励模型(RM)在训练过程中难以区分上下文信号与无关特征(如响应长度和格式)的问题。解决方案的关键在于引入因果框架,学习与这些无关特征独立的偏好,并通过一种新颖的数据增强技术来消除这些特征的影响。实验结果表明,这种方法能够有效过滤掉不相关的特征,显著提升奖励模型的鲁棒性和性能。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13156
作者: Tianqi Liu,Wei Xiong,Jie Ren,Lichang Chen,Junru Wu,Rishabh Joshi,Yang Gao,Jiaming Shen,Zhen Qin,Tianhe Yu,Daniel Sohn,Anastasiia Makarova,Jeremiah Liu,Yuan Liu,Bilal Piot,Abe Ittycheriah,Aviral Kumar,Mohammad Saleh
关键词-EN: aligning large language, large language models, play a pivotal, pivotal role, role in aligning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.
摘要:奖励模型 (Reward Models, RMs) 在大语言模型 (Large Language Models, LLMs) 与人类偏好对齐中扮演着关键角色。然而,传统的 RM 训练依赖于与特定提示相关的响应对,难以区分由提示驱动的偏好与提示无关的特征,如响应长度和格式。本文揭示了当前 RM 训练方法的一个根本性局限,即 RMs 在确定偏好时无法有效区分上下文信号与无关特征。为解决这一问题,我们引入了一个因果框架,该框架学习与这些特征无关的偏好,并提出了一种新颖的数据增强技术,旨在消除这些特征。大量实验表明,我们的方法成功过滤了不理想的特征,生成了一个更鲁棒的奖励模型 (Robust Reward Model, RRM)。我们的 RRM 在 Gemma-2-9b-it 上训练的成对奖励模型的性能在 RewardBench 上有所提升,准确率从 80.61% 提高到 84.15%。此外,我们使用 RM 和 RRM 分别训练了两个 DPO 策略,结果显示 RRM 显著增强了 DPO 对齐策略,MT-Bench 评分从 7.27 提升至 8.31,AlpacaEval-2 中的长度控制胜率从 33.46% 提高到 52.49%。

[NLP-46] Are Large Language Models Good Essay Graders?
该论文试图解决大语言模型(LLMs)在自动作文评分(AES)任务中与人类评分的一致性问题。解决方案的关键在于评估ChatGPT和Llama在零样本和少样本学习环境下的表现,并通过不同的提示方法进行测试。研究结果表明,尽管LLMs在评分上普遍低于人类评分且相关性不强,但Llama 3的表现相对较好,显示出LLMs在未来可能作为辅助人类评分的工具的潜力。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13120
作者: Anindita Kundu,Denilson Barbosa
关键词-EN: Large Language Models, Language Models, Large Language, effectiveness of Large, assessing essay quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.
摘要:我们评估了大语言模型 (LLM) 在评估作文质量方面的有效性,特别关注其与人类评分的对齐情况。更具体地说,我们评估了 ChatGPT 和 Llama 在自动作文评分 (AES) 任务中的表现,这是教育领域中一个重要的自然语言处理 (NLP) 应用。我们考虑了零样本和少样本学习以及不同的提示方法。我们比较了 LLM 提供的数值评分与 ASAP 数据集中人类评分者提供的分数,ASAP 数据集是 AES 任务中一个著名的基准。我们的研究表明,与人类评分者提供的分数相比,LLM 通常会给出较低的分数;此外,这些分数与人类评分者的分数相关性不佳。特别是,ChatGPT 往往比 Llama 更为严格,并且与人类评价的偏差更大。我们还尝试了多种以往 AES 方法常用的作文特征,包括长度、连接词和过渡词的使用情况以及可读性指标,如拼写和语法错误的数量。我们发现,通常这些特征与人类或 LLM 的评分没有很强的相关性。最后,我们报告了 Llama 3 的结果,这些结果总体上表现更好,符合预期。总的来说,虽然 LLM 似乎不足以完全替代人类评分,但我们的结果在一定程度上鼓励了它们未来作为辅助工具用于书面作文评分的潜力。

[NLP-47] Personalized Speech Recognition for Children with Test-Time Adaptation
该论文试图解决儿童语音识别中的数据域偏移问题,即现有的基于成人数据预训练的自动语音识别(ASR)模型在应用于儿童语音时表现不佳。解决方案的关键在于提出了一种新的ASR管道,采用无监督的测试时适应(TTA)方法,使得预训练于成人语音的ASR模型能够在测试时持续适应每个儿童说话者,而无需额外的人工标注。这种方法显著提升了ASR模型在儿童语音识别上的性能,并揭示了儿童说话者之间及内部存在的显著数据域偏移,进一步强调了测试时适应的必要性。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13095
作者: Zhonghao Shi,Harshvardhan Srivastava,Xuan Shi,Shrikanth Narayanan,Maja J. Matarić
关键词-EN: Accurate automatic speech, real-time child-AI interaction, effective real-time child-AI, Accurate automatic, ASR
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurate automatic speech recognition (ASR) for children is crucial for effective real-time child-AI interaction, especially in educational applications. However, off-the-shelf ASR models primarily pre-trained on adult data tend to generalize poorly to children’s speech due to the data domain shift from adults to children. Recent studies have found that supervised fine-tuning on children’s speech data can help bridge this domain shift, but human annotations may be impractical to obtain for real-world applications and adaptation at training time can overlook additional domain shifts occurring at test time. We devised a novel ASR pipeline to apply unsupervised test-time adaptation (TTA) methods for child speech recognition, so that ASR models pre-trained on adult speech can be continuously adapted to each child speaker at test time without further human annotations. Our results show that ASR models adapted with TTA methods significantly outperform the unadapted off-the-shelf ASR baselines both on average and statistically across individual child speakers. Our analysis also discovered significant data domain shifts both between child speakers and within each child speaker, which further motivates the need for test-time adaptation.
摘要:儿童的自动语音识别 (ASR) 对于实现有效的实时儿童与 AI 互动至关重要,特别是在教育应用中。然而,现成的 ASR 模型主要基于成人数据进行预训练,由于从成人到儿童的数据域转移,这些模型在儿童语音上的泛化能力较差。最近的研究发现,对儿童语音数据进行监督微调可以帮助弥合这一数据域转移,但在现实应用中获取人工标注可能不切实际,并且在训练时进行的适应可能会忽略测试时出现的额外数据域转移。我们设计了一种新的 ASR 流程,应用无监督的测试时适应 (TTA) 方法进行儿童语音识别,使得预训练于成人语音的 ASR 模型能够在测试时持续适应每个儿童说话者,而无需进一步的人工标注。我们的结果表明,通过 TTA 方法适应的 ASR 模型在平均水平和统计上均显著优于未适应的现成 ASR 基线模型,无论是针对个体儿童说话者还是整体表现。我们的分析还发现了儿童说话者之间以及每个儿童说话者内部存在显著的数据域转移,这进一步强调了测试时适应的必要性。

[NLP-48] Guided Profile Generation Improves Personalization with LLMs EMNLP2024
该论文试图解决大型语言模型(LLMs)在处理稀疏和复杂个性化上下文时难以有效解析和利用的问题。解决方案的关键在于提出了一种名为Guided Profile Generation(GPG)的通用方法,通过生成自然语言的个人简介,帮助LLMs从个性化上下文中总结和提取重要且独特的特征,从而生成更贴近个人习惯和偏好的描述性句子。实验结果表明,GPG显著提升了LLMs在个性化任务中的表现,例如在预测个人偏好时,准确率提高了37%。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13093
作者: Jiarui Zhang
关键词-EN: Large Language Models, modern commercial systems, improving customer experiences, including Recommendation, input into Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:In modern commercial systems, including Recommendation, Ranking, and E-Commerce platforms, there is a trend towards improving customer experiences by incorporating Personalization context as input into Large Language Models (LLMs). However, LLMs often struggle to effectively parse and utilize sparse and complex personal context without additional processing or contextual enrichment, underscoring the need for more sophisticated context understanding mechanisms. In this work, we propose Guided Profile Generation (GPG), a general method designed to generate personal profiles in natural language. As is observed, intermediate guided profile generation enables LLMs to summarize, and extract the important, distinctive features from the personal context into concise, descriptive sentences, precisely tailoring their generation more closely to an individual’s unique habits and preferences. Our experimental results show that GPG improves LLM’s personalization ability across different tasks, for example, it increases 37% accuracy in predicting personal preference compared to directly feeding the LLMs with raw personal context.
摘要:在现代商业系统中,包括推荐系统、排序系统和电子商务平台,通过将个性化上下文作为输入融入大语言模型 (LLM) 来提升客户体验的趋势日益明显。然而,LLM 在处理和利用稀疏且复杂的个性化上下文时,往往表现不佳,这凸显了需要更复杂上下文理解机制的必要性。在本研究中,我们提出了引导式个人资料生成 (Guided Profile Generation, GPG),这是一种旨在生成自然语言个人资料的通用方法。观察发现,中间引导式个人资料生成使 LLM 能够总结并提取个性化上下文中的重要、独特特征,将其转化为简洁、描述性的句子,从而更精确地定制生成内容,以更贴近个人的独特习惯和偏好。我们的实验结果表明,GPG 在不同任务中提升了 LLM 的个性化能力,例如,与直接将原始个人上下文输入 LLM 相比,在预测个人偏好时,准确率提高了 37%。

[NLP-49] Embedding Geometries of Contrastive Language-Image Pre-Training ECCV2024
该论文试图解决CLIP模型在对比预训练中使用L2归一化和余弦相似度对数的问题,提出了一种基于欧几里得几何的替代方案,即Euclidean CLIP (EuCLIP)。解决方案的关键在于采用直观的欧几里得几何和softmax对数,这种方法不仅简化了模型设计,还在性能上与原始CLIP相当或更优,同时支持层次关系,至少不逊于更复杂的双曲几何替代方案。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13079
作者: Jason Chuan-Chih Chou,Nahid Alam
关键词-EN: InfoNCE loss, loss for contrastive, widely popular, popular for bridging, CLIP
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024 - Beyond Euclidean Workshop

点击查看摘要

Abstract:Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP’s original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.
摘要:自 CLIP 发布以来,使用 InfoNCE 损失进行对比预训练的方法在连接两种或多种模态方面变得非常流行。尽管这种方法被广泛采用,但 CLIP 最初设计的 L2 归一化和余弦相似度对数选择却很少被重新审视。我们系统地试验了语言-图像预训练中替代的几何结构和对数形式的 softmax,发现采用直观欧几里得几何结构的变体,即欧几里得 CLIP (EuCLIP),在性能上能够匹配或超越 CLIP,并且在支持层次关系方面至少与更复杂的双曲几何替代方案一样有效。

[NLP-50] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models
该论文试图解决大语言模型(LLMs)中嵌入的过时或问题知识的问题,提出了一种名为“LLM Surgery”的框架,通过优化一个包含三个组件的目标函数来高效地修改LLM的行为。解决方案的关键在于:(1) 对需要遗忘的数据集(问题和过时信息)进行反向梯度处理;(2) 对更新数据集(新信息)进行梯度下降处理;(3) 通过最小化保留数据集(未变化文本的小子集)上的KL散度,确保预训练模型与修改后模型输出的一致性。该方法在不重新训练模型的前提下,实现了对问题知识的有效遗忘、新知识的整合以及模型性能的维持。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13054
作者: Akshaj Kumar Veldanda,Shi-Xiong Zhang,Anirban Das,Supriyo Chakraborty,Stephen Rawls,Sambit Sahu,Milind Naphade
关键词-EN: Large language models, Large language, problematic knowledge embedded, revolutionized various domains, embedded during pretraining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized various domains, yet their utility comes with significant challenges related to outdated or problematic knowledge embedded during pretraining. This paper addresses the challenge of modifying LLMs to unlearn problematic and outdated information while efficiently integrating new knowledge without retraining from scratch. Here, we propose LLM Surgery, a framework to efficiently modify LLM behaviour by optimizing a three component objective function that: (1) Performs reverse gradient on unlearning dataset (problematic and outdated information), (2) Performs gradient descent on the update dataset (new and updated information), and (3) Minimizes the KL divergence on the retain dataset (small subset of unchanged text), ensuring alignment between pretrained and modified model outputs. Due to the lack of publicly available datasets specifically tailored for our novel task, we compiled a new dataset and an evaluation benchmark. Using Llama2-7B, we demonstrate that LLM Surgery can achieve significant forgetting on the unlearn set, a 20% increase in accuracy on the update set, and maintain performance on the retain set.
摘要:大语言模型 (LLMs) 已经在多个领域带来了革命性的变化,但其效用伴随着一个重大挑战,即预训练过程中嵌入的过时或问题知识。本文针对这一挑战,提出了一种在不从头开始重新训练的情况下,修改 LLMs 以遗忘问题和过时信息,并高效整合新知识的解决方案。在此,我们提出了 LLM 手术 (LLM Surgery) 框架,通过优化一个包含三个组成部分的目标函数来高效修改 LLM 行为:(1) 对遗忘数据集 (问题和过时信息) 执行反向梯度操作,(2) 对更新数据集 (新信息和更新信息) 执行梯度下降操作,(3) 最小化保留数据集 (不变文本的小子集) 上的 KL 散度,以确保预训练模型与修改后模型输出之间的一致性。由于缺乏专门针对我们这一新任务的公开可用数据集,我们编译了一个新的数据集和一个评估基准。使用 Llama2-7B,我们展示了 LLM 手术能够在遗忘集上实现显著的遗忘效果,在更新集上提高 20% 的准确率,并在保留集上保持性能。

[NLP-51] ACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning COLING2025
该论文试图解决大语言模型(如GPT-4)在应用中由于提示(prompt)规模增大导致的计算效率问题。解决方案的关键在于提出了一种基于强化学习(RL)的任务感知提示压缩方法,通过利用Transformer编码器和轻量级REINFORCE算法,结合任务特定的奖励信号,实现了在不牺牲任务性能的前提下,有效减少输入令牌数量,从而降低推理成本并满足低延迟要求。该方法在文本摘要、问答和代码摘要三个不同且具有挑战性的任务中,相比现有最先进的压缩技术,显著提升了任务性能(提升幅度为8%至260%)。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13035
作者: Shivam Shandilya,Menglin Xia,Supriyo Ghosh,Huiqiang Jiang,Jue Zhang,Qianhui Wu,Victor Rühle
关键词-EN: large language models, leading to challenges, computational efficiency, increasing prevalence, prevalence of large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to COLING 2025

点击查看摘要

Abstract:The increasing prevalence of large language models (LLMs) such as GPT-4 in various applications has led to a surge in the size of prompts required for optimal performance, leading to challenges in computational efficiency. Prompt compression aims to reduce the inference cost by minimizing input tokens without compromising on the task performance. However, existing prompt compression techniques either rely on sub-optimal metrics such as information entropy or model it as a task-agnostic token classification problem that fails to capture task-specific information. To address these issues, we propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method. To ensure low latency requirements, we leverage existing Transformer encoder-based token classification model while guiding the learning process with task-specific reward signals using lightweight REINFORCE algorithm. We evaluate the performance of our method on three diverse and challenging tasks including text summarization, question answering and code summarization. We demonstrate that our RL-guided compression method improves the task performance by 8% - 260% across these three scenarios over state-of-the-art compression techniques while satisfying the same compression rate and latency requirements.
摘要:随着诸如 GPT-4 等大语言模型 (LLM) 在各种应用中的普及,为了达到最佳性能所需的提示词规模不断增加,导致计算效率面临挑战。提示词压缩旨在通过最小化输入 Token 来降低推理成本,同时不影响任务表现。然而,现有的提示词压缩技术要么依赖于信息熵等次优指标,要么将其视为与任务无关的 Token 分类问题,未能捕捉到任务特定的信息。为解决这些问题,我们提出了一种新颖且高效的基于强化学习 (RL) 的任务感知提示词压缩方法。为确保低延迟要求,我们利用现有的基于 Transformer 编码器的 Token 分类模型,同时使用轻量级的 REINFORCE 算法,通过任务特定的奖励信号指导学习过程。我们在三个多样且具有挑战性的任务上评估了我们的方法,包括文本摘要、问答和代码摘要。实验结果表明,在满足相同的压缩率和延迟要求下,我们的 RL 引导的压缩方法在这些三个场景中相较于最先进的压缩技术,任务表现提升了 8% 至 260%。

[NLP-52] CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
该论文旨在解决大型语言模型(LLMs)在生成硬件描述语言(如Verilog)代码时面临的两个主要问题:难以处理非文本表示(如卡诺图、状态转移图和波形图)以及训练过程中模型随机出现“小错误”导致的显著变异性。解决方案的关键在于通过创建正确性构造的数据集来增强数据管理,专门针对非文本表示进行优化,并引入自动化框架生成错误报告,将这些错误注入开源代码以创建有针对性的代码修复数据。这些改进使得经过微调的Starcoder2-15B模型在VerilogEval-Machine、VerilogEval-Human和RTLLM上的pass@1性能分别提升了3.8%、10.9%和6.6%,超越了先前的最先进结果。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.12993
作者: Mingjie Liu,Yun-Da Tsai,Wenfei Zhou,Haoxing Ren
关键词-EN: hardware description languages, challenges persist, significant progress made, large language models, progress made
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the significant progress made in code generation with large language models, challenges persist, especially with hardware description languages such as Verilog. This paper first presents an analysis of fine-tuned LLMs on Verilog coding, with synthetic data from prior methods. We identify two main issues: difficulties in handling non-textual representations (Karnaugh maps, state-transition diagrams and waveforms) and significant variability during training with models randomly making “minor” mistakes. To address these limitations, we enhance data curation by creating correct-by-construction data targeting non-textual representations. Additionally, we introduce an automated framework that generates error reports from various model checkpoints and injects these errors into open-source code to create targeted code repair data. Our fine-tuned Starcoder2-15B outperforms prior state-of-the-art results by 3.8%, 10.9%, 6.6% for pass@1 on VerilogEval-Machine, VerilogEval-Human, and RTLLM.
摘要:尽管大语言模型在代码生成方面取得了显著进展,但在硬件描述语言如 Verilog 方面仍存在挑战。本文首先对基于 Verilog 编码的微调大语言模型进行了分析,使用了先前方法的合成数据。我们识别出两个主要问题:处理非文本表示(如卡诺图、状态转移图和波形)的困难,以及训练过程中模型随机产生“小”错误的显著变异性。为解决这些限制,我们通过创建针对非文本表示的正确构造数据来增强数据管理。此外,我们引入了一个自动化框架,从各种模型检查点生成错误报告,并将这些错误注入开源代码以创建有针对性的代码修复数据。我们微调的 Starcoder2-15B 在 VerilogEval-Machine、VerilogEval-Human 和 RTLLM 上的 pass@1 分别比先前的最先进结果高出 3.8%、10.9% 和 6.6%。

[NLP-53] Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models
该论文试图解决的问题是评估视觉语言模型(VLMs)是否具备视觉视角转换(VPT)能力,即理解他人视角并预测其行为的能力。解决方案的关键在于引入了两个手动构建的数据集(Isle-Bricks和Isle-Dots),用于测试VLMs的VPT技能,并通过评估12种常用VLMs的表现,发现这些模型在需要视角转换的任务中表现显著下降。此外,研究还指出,物体检测任务的表现与VPT任务的表现之间缺乏相关性,表明现有基准可能不足以全面理解这一问题。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.12969
作者: Gracjan Góral,Alicja Ziarko,Michal Nauman,Maciej Wołczyk
关键词-EN: enables individuals, Visual perspective-taking, individuals to anticipate, anticipate the actions, Vision Language Models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at this https URL
摘要:视觉视角理解 (Visual Perspective-Taking, VPT),即理解他人视角的能力,使个体能够预测他人的行为。例如,驾驶员可以通过评估行人所见来避免事故。人类通常在幼儿时期就发展出这种技能,但目前尚不清楚新兴的视觉语言模型 (Vision Language Models, VLMs) 是否具备这种能力。此外,随着这些模型越来越多地应用于现实世界,理解它们在如 VPT 这类复杂任务中的表现变得至关重要。本文中,我们引入了两个手动策划的数据集,Isle-Bricks 和 Isle-Dots,用于测试 VPT 技能,并使用这些数据集评估了 12 个常用的 VLMs。在所有模型中,我们观察到当需要进行视角理解时,性能显著下降。此外,我们发现物体检测任务的性能与 VPT 任务的性能之间存在较差的关联性,这表明现有的基准可能不足以全面理解这一问题。代码和数据集将在以下链接中提供:https URL

[NLP-54] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment
该论文试图解决大型语言模型(LLMs)在生成有害内容方面的安全性问题,特别是缺乏系统评估其抵抗能力的基准。解决方案的关键在于提出了一种名为h4rm3l的新型动态基准,通过以下三个组件实现:(1) 一种领域特定语言,用于将越狱攻击形式化为参数化提示转换原语的组合;(2) 基于bandit的少样本程序合成算法,生成针对目标黑箱LLM的安全过滤器进行优化的创新攻击;(3) 开源自动化红队测试软件,结合前两个组件生成并执行攻击。该方法生成了2656个成功的创新越狱攻击,针对6个最先进的LLMs,其中一些攻击的成功率超过90%,显著提升了对LLM安全限制的理解,并支持开发更强大的防御机制。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2408.04811
作者: Moussa Koulako Bala Doumbouya,Ananjan Nandi,Gabriel Poesia,Davide Ghilardi,Anna Goldie,Federico Bianchi,Dan Jurafsky,Christopher D. Manning
关键词-EN: critical concern due, Large Language Models, resist generating harmful, Large Language, jailbreak attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The safety of Large Language Models (LLMs) remains a critical concern due to a lack of adequate benchmarks for systematically evaluating their ability to resist generating harmful content. Previous efforts towards automated red teaming involve static or templated sets of illicit requests and adversarial prompts which have limited utility given jailbreak attacks’ evolving and composable nature. We propose a novel dynamic benchmark of composable jailbreak attacks to move beyond static datasets and taxonomies of attacks and harms. Our approach consists of three components collectively called h4rm3l: (1) a domain-specific language that formally expresses jailbreak attacks as compositions of parameterized prompt transformation primitives, (2) bandit-based few-shot program synthesis algorithms that generate novel attacks optimized to penetrate the safety filters of a target black box LLM, and (3) open-source automated red-teaming software employing the previous two components. We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs. Several of our synthesized attacks are more effective than previously reported ones, with Attack Success Rates exceeding 90% on SOTA closed language models such as claude-3-haiku and GPT4-o. By generating datasets of jailbreak attacks in a unified formal representation, h4rm3l enables reproducible benchmarking and automated red-teaming, contributes to understanding LLM safety limitations, and supports the development of robust defenses in an increasingly LLM-integrated world. Warning: This paper and related research artifacts contain offensive and potentially disturbing prompts and model-generated content. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG) MSC classes: 68 ACMclasses: I.2; I.2.0; I.2.1; I.2.5; I.2.7; K.6.5; K.4.2 Cite as: arXiv:2408.04811 [cs.CR] (or arXiv:2408.04811v2 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04811 Focus to learn more arXiv-issued DOI via DataCite
摘要:大语言模型 (LLM) 的安全性仍然是一个关键问题,因为缺乏系统评估其抵抗生成有害内容能力的充分基准。先前在自动化红队测试方面的努力涉及静态或模板化的非法请求集和对抗性提示,鉴于越狱攻击的演变和组合性质,这些方法的实用性有限。我们提出了一种新颖的组合越狱攻击动态基准,以超越静态数据集和攻击与危害的分类。我们的方法由三个组件组成,统称为 h4rm3l:(1) 一种领域特定语言,正式地将越狱攻击表达为参数化提示转换原语的组合;(2) 基于 bandit 的少样本程序合成算法,生成针对目标黑箱大语言模型安全过滤器进行优化的创新攻击;(3) 采用前两个组件的开源自动化红队测试软件。我们使用 h4rm3l 生成了针对 6 个最先进 (SOTA) 开源和专有大语言模型的 2656 个成功的新型越狱攻击数据集。我们合成的几种攻击比先前报告的攻击更有效,在 SOTA 封闭语言模型如 claude-3-haiku 和 GPT4-o 上的攻击成功率超过 90%。通过生成统一形式表示的越狱攻击数据集,h4rm3l 实现了可重复的基准测试和自动化红队测试,有助于理解大语言模型的安全局限性,并支持在日益集成大语言模型的世界中开发强大的防御措施。警告:本文及相关研究成果包含具有攻击性和可能令人不安的提示和模型生成内容。

主题:密码学与安全 (cs.CR);人工智能 (cs.AI);计算与语言 (cs.CL);计算机与社会 (cs.CY);机器学习 (cs.LG)
MSC 类别:68
ACM 类别:I.2;I.2.0;I.2.1;I.2.5;I.2.7;K.6.5;K.4.2
引用为:arXiv:2408.04811 [cs.CR]
(或 arXiv:2408.04811v2 [cs.CR] 用于此版本)
https://doi.org/10.48550/arXiv.2408.04811
通过 DataCite 发布的 arXiv 发行 DOI

[NLP-55] Natural Language Processing Methods for the Study of Protein-Ligand Interactions
该论文试图解决蛋白质-配体相互作用(PLIs)的预测问题,这一问题在药物发现和蛋白质工程中具有重要意义。解决方案的关键在于利用自然语言处理(NLP)技术,特别是长短期记忆网络(LSTM)、变压器(Transformers)和注意力机制等先进模型,来处理和分析蛋白质和配体的序列及结构数据。这些NLP方法通过模拟人类语言处理的方式,能够有效提取和理解蛋白质与配体之间的复杂关系,从而提高PLI预测的准确性和效率。然而,论文也指出了当前NLP方法在PLI研究中的局限性,并提出了未来需要解决的关键挑战。【详细内容请查看摘要】

链接: https://arxiv.org/abs/2409.13057
作者: James Michels,Ramya Bandarupalli,Amin Ahangar Akbari,Thai Le,Hong Xiao,Jing Li,Erik F. Y. Hom
关键词-EN: Natural Language Processing, predicting protein-ligand interactions, protein engineering efforts, developing effective methods, Language Processing
类目: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL)
备注: 52 Pages and 3 Figures

点击查看摘要

Abstract:Recent advances in Natural Language Processing (NLP) have ignited interest in developing effective methods for predicting protein-ligand interactions (PLIs) given their relevance to drug discovery and protein engineering efforts and the ever-growing volume of biochemical sequence and structural data available. The parallels between human languages and the “languages” used to represent proteins and ligands have enabled the use of NLP machine learning approaches to advance PLI studies. In this review, we explain where and how such approaches have been applied in the recent literature and discuss useful mechanisms such as long short-term memory, transformers, and attention. We conclude with a discussion of the current limitations of NLP methods for the study of PLIs as well as key challenges that need to be addressed in future work.
摘要:近年来,自然语言处理 (Natural Language Processing, NLP) 的进展激发了开发有效方法来预测蛋白质-配体相互作用 (Protein-Ligand Interactions, PLIs) 的兴趣,这些相互作用与药物发现和蛋白质工程的努力密切相关,并且随着生物化学序列和结构数据的不断增长,其重要性日益凸显。人类语言与用于表示蛋白质和配体的“语言”之间的相似性,使得 NLP 机器学习方法能够应用于 PLI 研究。在本综述中,我们解释了这些方法在近期文献中的应用情况,并讨论了诸如长短期记忆 (Long Short-Term Memory)、Transformer 和注意力机制等有用的机制。最后,我们讨论了当前 NLP 方法在 PLI 研究中的局限性,以及未来工作中需要解决的关键挑战。

人工智能

[AI-0] Morphological Detection and Classification of Microplastics and Nanoplastics Emerged from Consumer Products by Deep Learning

链接: https://arxiv.org/abs/2409.13688
作者: Hadi Rezvani,Navid Zarrabi,Ishaan Mehta,Christopher Kolios,Hussein Ali Jaafar,Cheng-Hao Kao,Sajad Saeedi,Nariman Yousefi
关键词-EN: escalating global issue, Plastic pollution presents, global issue, impacting health, environmental systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Plastic pollution presents an escalating global issue, impacting health and environmental systems, with micro- and nanoplastics found across mediums from potable water to air. Traditional methods for studying these contaminants are labor-intensive and time-consuming, necessitating a shift towards more efficient technologies. In response, this paper introduces micro- and nanoplastics (MiNa), a novel and open-source dataset engineered for the automatic detection and classification of micro and nanoplastics using object detection algorithms. The dataset, comprising scanning electron microscopy images simulated under realistic aquatic conditions, categorizes plastics by polymer type across a broad size spectrum. We demonstrate the application of state-of-the-art detection algorithms on MiNa, assessing their effectiveness and identifying the unique challenges and potential of each method. The dataset not only fills a critical gap in available resources for microplastic research but also provides a robust foundation for future advancements in the field.

[AI-1] he Impact of Large Language Models in Academia: from Writing to Speaking

链接: https://arxiv.org/abs/2409.13686
作者: Mingmeng Geng,Caixi Chen,Yanru Wu,Dongping Chen,Yao Wan,Pan Zhou
关键词-EN: Large language models, increasingly impacting human, Large language, impacting human society, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly impacting human society, particularly in textual information. Based on more than 30,000 papers and 1,000 presentations from machine learning conferences, we examined and compared the words used in writing and speaking, representing the first large-scale investigating study of how LLMs influence the two main modes of verbal communication and expression within the same group of people. Our empirical results show that LLM-style words such as “significant” have been used more frequently in abstracts and oral presentations. The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.

[AI-2] he FIX Benchmark: Extracting Features Interpretable to eXperts

链接: https://arxiv.org/abs/2409.13684
作者: Helen Jin,Shreya Havaldar,Chaehyeon Kim,Anton Xue,Weiqiu You,Helen Qu,Marco Gatti,Daniel A Hashimoto,Bhuvnesh Jain,Amin Madani,Masao Sako,Lyle Ungar,Eric Wong
关键词-EN: explain model predictions, model predictions, explain model, implicitly assume, features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we have developed feature interpretability objectives across diverse real-world settings and unified them into a single framework that is the FIX benchmark. We find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.

[AI-3] ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation

链接: https://arxiv.org/abs/2409.13682
作者: Abrar Anwar,John Welsh,Joydeep Biswas,Soha Pouya,Yan Chang
关键词-EN: understanding complex environments, Navigating and understanding, understanding complex, complex environments, environments over extended
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Navigating and understanding complex environments over extended periods of time is a significant challenge for robots. People interacting with the robot may want to ask questions like where something happened, when it occurred, or how long ago it took place, which would require the robot to reason over a long history of their deployment. To address this problem, we introduce a Retrieval-augmented Memory for Embodied Robots, or ReMEmbR, a system designed for long-horizon video question answering for robot navigation. To evaluate ReMEmbR, we introduce the NaVQA dataset where we annotate spatial, temporal, and descriptive questions to long-horizon robot navigation videos. ReMEmbR employs a structured approach involving a memory building and a querying phase, leveraging temporal information, spatial information, and images to efficiently handle continuously growing robot histories. Our experiments demonstrate that ReMEmbR outperforms LLM and VLM baselines, allowing ReMEmbR to achieve effective long-horizon reasoning with low latency. Additionally, we deploy ReMEmbR on a robot and show that our approach can handle diverse queries. The dataset, code, videos, and other material can be found at the following link: this https URL

[AI-4] A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification

链接: https://arxiv.org/abs/2409.13676
作者: Michel Olvera(S2A, LTCI, IDS),Paraskevas Stamatiadis(S2A, LTCI, IDS),Slim Essid(IDS, S2A, LTCI)
关键词-EN: Audio-text models trained, contrastive learning offer, audio classification, Audio-text models, zero-shot audio classification
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events, Oct 2024, Tokyo, Japan

点击查看摘要

Abstract:Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as “this is a sound of” followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.

[AI-5] OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

链接: https://arxiv.org/abs/2409.13652
作者: Stephen Zhang,Vardan Papyan
关键词-EN: recent paradigm shift, found great success, prohibitively expensive costs, high memory consumption, large-scale foundation models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that utilizes the second moment information in the input embeddings to decompose the model weights into a sum of sparse and low-rank matrices. Without any retraining, OATS achieves state-of-the-art performance when compressing models by up to 60% on large language models such as Llama-3 and Phi-3 and vision transformers such as ViT and DINOv2 while delivering up to 1.37\times the CPU acceleration versus a model that was comparably pruned.

[AI-6] Advancing Event Causality Identification via Heuristic Semantic Dependency Inquiry Network

链接: https://arxiv.org/abs/2409.13621
作者: Haoran Li,Qiang Gao,Hongmei Wu,Li Huang
关键词-EN: Event Causality Identification, Causality Identification, Event Causality, focuses on extracting, Dependency Inquiry Network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Event Causality Identification (ECI) focuses on extracting causal relations between events in texts. Existing methods for ECI primarily rely on causal features and external knowledge. However, these approaches fall short in two dimensions: (1) causal features between events in a text often lack explicit clues, and (2) external knowledge may introduce bias, while specific problems require tailored analyses. To address these issues, we propose SemDI - a simple and effective Semantic Dependency Inquiry Network for ECI. SemDI captures semantic dependencies within the context using a unified encoder. Then, it utilizes a Cloze Analyzer to generate a fill-in token based on comprehensive context understanding. Finally, this fill-in token is used to inquire about the causal relation between two events. Extensive experiments demonstrate the effectiveness of SemDI, surpassing state-of-the-art methods on three widely used benchmarks. Code is available at this https URL.

[AI-7] MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension EMNLP2024

链接: https://arxiv.org/abs/2409.13609
作者: Ting Liu,Zunnan Xu,Yue Hu,Liangtao Shi,Zhiqiang Wang,Quanjun Yin
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, local visual region, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.

[AI-8] MeLIAD: Interpretable Few-Shot Anomaly Detection with Metric Learning and Entropy-based Scoring

链接: https://arxiv.org/abs/2409.13602
作者: Eirini Cholopoulou,Dimitris K. Iakovidis
关键词-EN: automating quality inspection, detecting defective products, plays a pivotal, quality inspection, pivotal role
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Anomaly detection (AD) plays a pivotal role in multimedia applications for detecting defective products and automating quality inspection. Deep learning (DL) models typically require large-scale annotated data, which are often highly imbalanced since anomalies are usually scarce. The black box nature of these models prohibits them from being trusted by users. To address these challenges, we propose MeLIAD, a novel methodology for interpretable anomaly detection, which unlike the previous methods is based on metric learning and achieves interpretability by design without relying on any prior distribution assumptions of true anomalies. MeLIAD requires only a few samples of anomalies for training, without employing any augmentation techniques, and is inherently interpretable, providing visualizations that offer insights into why an image is identified as anomalous. This is achieved by introducing a novel trainable entropy-based scoring component for the identification and localization of anomalous instances, and a novel loss function that jointly optimizes the anomaly scoring component with a metric learning objective. Experiments on five public benchmark datasets, including quantitative and qualitative evaluation of interpretability, demonstrate that MeLIAD achieves improved anomaly detection and localization performance compared to state-of-the-art methods.

[AI-9] YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.13592
作者: Abhilash Nandy,Yash Agarwal,Ashish Patwa,Millon Madhur Das,Aman Bansal,Ankit Raj,Pawan Goyal,Niloy Ganguly
关键词-EN: Understanding satire, Satirical Image Detection, current Vision-Language models, satire and humor, Image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Main (Long), 18 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at this https URL.

[AI-10] ChainBuddy: An AI Agent System for Generating LLM Pipelines

链接: https://arxiv.org/abs/2409.13588
作者: Jingyue Zhang,Ian Arawjo
关键词-EN: large language models, language models, grown significantly, large language, potential applications
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures, pre-print

点击查看摘要

Abstract:As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LLM behavior on user-specific tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the “blank page” problem. ChainBuddy, an AI assistant for generating evaluative LLM pipelines built into the ChainForge platform, aims to tackle this issue. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior, making the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload and felt more confident setting up evaluation pipelines of LLM behavior. We derive insights for the future of interfaces that assist users in the open-ended evaluation of AI.

[AI-11] Neurosymbolic Conformal Classification

链接: https://arxiv.org/abs/2409.13585
作者: Arthur Ledaguenel,Céline Hudelot,Mostepha Khouadjia
关键词-EN: improvement of Machine, driven by Deep, Machine Learning, Deep Learning, drastic improvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 0 figures. arXiv admin note: text overlap with arXiv:2404.08404

点击查看摘要

Abstract:The last decades have seen a drastic improvement of Machine Learning (ML), mainly driven by Deep Learning (DL). However, despite the resounding successes of ML in many domains, the impossibility to provide guarantees of conformity and the fragility of ML systems (faced with distribution shifts, adversarial attacks, etc.) have prevented the design of trustworthy AI systems. Several research paths have been investigated to mitigate this fragility and provide some guarantees regarding the behavior of ML systems, among which are neurosymbolic AI and conformal prediction. Neurosymbolic artificial intelligence is a growing field of research aiming to combine neural network learning capabilities with the reasoning abilities of symbolic systems. One of the objective of this hybridization can be to provide theoritical guarantees that the output of the system will comply with some prior knowledge. Conformal prediction is a set of techniques that enable to take into account the uncertainty of ML systems by transforming the unique prediction into a set of predictions, called a confidence set. Interestingly, this comes with statistical guarantees regarding the presence of the true label inside the confidence set. Both approaches are distribution-free and model-agnostic. In this paper, we see how these two approaches can complement one another. We introduce several neurosymbolic conformal prediction techniques and explore their different characteristics (size of confidence sets, computational complexity, etc.).

[AI-12] Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

链接: https://arxiv.org/abs/2409.13576
作者: Xingtao Lin,Heqian Qiu,Lanxiao Wang,RUihang Wang,Linfeng XU,Hongliang Li
关键词-EN: Contrastive Language-Image Pre-trained, scene text detection, successfully adapted large-scale, adapted large-scale models, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder’s input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text’'. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.

[AI-13] Scalable Multi-agent Reinforcement Learning for Factory-wide Dynamic Scheduling

链接: https://arxiv.org/abs/2409.13571
作者: Jaeyeon Jang,Diego Klabjan,Han Liu,Nital S. Patel,Xiuqi Li,Balakrishnan Ananthanarayanan,Husam Dauod,Tzung-Han Juang
关键词-EN: high decision complexity, notoriously challenging task, modern manufacturing processes, decision complexity, crucial but notoriously
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time dynamic scheduling is a crucial but notoriously challenging task in modern manufacturing processes due to its high decision complexity. Recently, reinforcement learning (RL) has been gaining attention as an impactful technique to handle this challenge. However, classical RL methods typically rely on human-made dispatching rules, which are not suitable for large-scale factory-wide scheduling. To bridge this gap, this paper applies a leader-follower multi-agent RL (MARL) concept to obtain desired coordination after decomposing the scheduling problem into a set of sub-problems that are handled by each individual agent for scalability. We further strengthen the procedure by proposing a rule-based conversion algorithm to prevent catastrophic loss of production capacity due to an agent’s error. Our experimental results demonstrate that the proposed model outperforms the state-of-the-art deep RL-based scheduling models in various aspects. Additionally, the proposed model provides the most robust scheduling performance to demand changes. Overall, the proposed MARL-based scheduling model presents a promising solution to the real-time scheduling problem, with potential applications in various manufacturing industries.

[AI-14] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Tensorflow Pretrained Models

链接: https://arxiv.org/abs/2409.13566
作者: Keyu Chen,Ziqian Bi,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Ming Liu,Ming Li,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Pohsun Feng
关键词-EN: providing detailed guidance, providing detailed, object detection, application of TensorFlow, detailed guidance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This book contains 148 pages and 7 figures

点击查看摘要

Abstract:This book focuses on the application of TensorFlow pre-trained models in deep learning, providing detailed guidance on effectively using these models for tasks such as image classification and object detection. It covers practical implementations of modern architectures like ResNet, MobileNet, and EfficientNet, demonstrating the power of transfer learning through real-world examples and experiments. The book compares linear probing and model fine-tuning, offering visualizations using techniques such as PCA, t-SNE, and UMAP to help readers intuitively understand the impact of different approaches. Designed for beginners to advanced users, this book includes complete example code and step-by-step instructions, enabling readers to quickly master how to leverage pre-trained models to improve performance in practical scenarios. By blending theoretical insights with hands-on practice, this book equips readers with the knowledge to confidently tackle various deep learning challenges.

[AI-15] Efficient Visualization of Neural Networks with Generative Models and Adversarial Perturbations

链接: https://arxiv.org/abs/2409.13559
作者: Athanasios Karagounis
关键词-EN: offering an improvement, paper presents, approach for deep, improvement over existing, generative network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:This paper presents a novel approach for deep visualization via a generative network, offering an improvement over existing methods. Our model simplifies the architecture by reducing the number of networks used, requiring only a generator and a discriminator, as opposed to the multiple networks traditionally involved. Additionally, our model requires less prior training knowledge and uses a non-adversarial training process, where the discriminator acts as a guide rather than a competitor to the generator. The core contribution of this work is its ability to generate detailed visualization images that align with specific class labels. Our model incorporates a unique skip-connection-inspired block design, which enhances label-directed image generation by propagating class information across multiple layers. Furthermore, we explore how these generated visualizations can be utilized as adversarial examples, effectively fooling classification networks with minimal perceptible modifications to the original images. Experimental results demonstrate that our method outperforms traditional adversarial example generation techniques in both targeted and non-targeted attacks, achieving up to a 94.5% fooling rate with minimal perturbation. This work bridges the gap between visualization methods and adversarial examples, proposing that fooling rate could serve as a quantitative measure for evaluating visualization quality. The insights from this study provide a new perspective on the interpretability of neural networks and their vulnerabilities to adversarial attacks.

[AI-16] rustworthy Hate Speech Detection Through Visual Augmentation

链接: https://arxiv.org/abs/2409.13557
作者: Ziyuan Yang,Ming Yan,Yingyu Chen,Hui Wang,Zexin Lu,Yi Zhang
关键词-EN: social media platforms, media platforms poses, hate speech, hate speech detection, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The surge of hate speech on social media platforms poses a significant challenge, with hate speech detection~(HSD) becoming increasingly critical. Current HSD methods focus on enriching contextual information to enhance detection performance, but they overlook the inherent uncertainty of hate speech. We propose a novel HSD method, named trustworthy hate speech detection method through visual augmentation (TrusV-HSD), which enhances semantic information through integration with diffused visual images and mitigates uncertainty with trustworthy loss. TrusV-HSD learns semantic representations by effectively extracting trustworthy information through multi-modal connections without paired data. Our experiments on public HSD datasets demonstrate the effectiveness of TrusV-HSD, showing remarkable improvements over conventional methods.

[AI-17] Generating Visual Stories with Grounded and Coreferent Characters

链接: https://arxiv.org/abs/2409.13555
作者: Danyang Liu,Mirella Lapata,Frank Keller
关键词-EN: Characters, stories, create emotional connections, Abstract, character mentions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Characters are important in narratives. They move the plot forward, create emotional connections, and embody the story’s themes. Visual storytelling methods focus more on the plot and events relating to it, without building the narrative around specific characters. As a result, the generated stories feel generic, with character mentions being absent, vague, or incorrect. To mitigate these issues, we introduce the new task of character-centric story generation and present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions. Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark. Specifically, we develop an automated pipeline to enrich VIST with visual and textual character coreference chains. We also propose new evaluation metrics to measure the richness of characters and coreference in stories. Experimental results show that our model generates stories with recurring characters which are consistent and coreferent to larger extent compared to baselines and state-of-the-art systems.

[AI-18] Certified Adversarial Robustness via Partition-based Randomized Smoothing

链接: https://arxiv.org/abs/2409.13546
作者: Hossein Goli,Farzan Farnia
关键词-EN: additive Gaussian noise, Gaussian noise, classifiers requires robustness, requires robustness certificates, network classifiers requires
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A reliable application of deep neural network classifiers requires robustness certificates against adversarial perturbations. Gaussian smoothing is a widely analyzed approach to certifying robustness against norm-bounded perturbations, where the certified prediction radius depends on the variance of the Gaussian noise and the confidence level of the neural net’s prediction under the additive Gaussian noise. However, in application to high-dimensional image datasets, the certified radius of the plain Gaussian smoothing could be relatively small, since Gaussian noise with high variances can significantly harm the visibility of an image. In this work, we propose the Pixel Partitioning-based Randomized Smoothing (PPRS) methodology to boost the neural net’s confidence score and thus the robustness radius of the certified prediction. We demonstrate that the proposed PPRS algorithm improves the visibility of the images under additive Gaussian noise. We discuss the numerical results of applying PPRS to standard computer vision datasets and neural network architectures. Our empirical findings indicate a considerable improvement in the certified accuracy and stability of the prediction model to the additive Gaussian noise in randomized smoothing.

[AI-19] First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

链接: https://arxiv.org/abs/2409.13538
作者: Yingzhe Peng,Yixiao Yuan,Zitian Ao,Huapeng Zhou,Kangqi Wang,Qipeng Zhu,Xu Yang
关键词-EN: Video Question Answering, Multiple-choice Video Question, Perception Test Challenge, Question Answering, Multiple-choice Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.

[AI-20] ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources

链接: https://arxiv.org/abs/2409.13537
作者: Shuting Yang,Zehui Liu,Wolfgang Mayer
关键词-EN: handle complex inquiries, Recent developments, intelligent dialogue systems’ability, complex inquiries, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 15 pages,3 figures, WISE2024

点击查看摘要

Abstract:Recent developments in large language models (LLMs) have led to significant improvements in intelligent dialogue systems’ability to handle complex inquiries. However, current LLMs still exhibit limitations in specialized domain knowledge, particularly in technical fields such as agriculture. To address this problem, we propose ShizishanGPT, an intelligent question answering system for agriculture based on the Retrieval Augmented Generation (RAG) framework and agent architecture. ShizishanGPT consists of five key modules: including a generic GPT-4 based module for answering general questions; a search engine module that compensates for the problem that the large language model’s own knowledge cannot be updated in a timely manner; an agricultural knowledge graph module for providing domain facts; a retrieval module which uses RAG to supplement domain knowledge; and an agricultural agent module, which invokes specialized models for crop phenotype prediction, gene expression analysis, and so on. We evaluated the ShizishanGPT using a dataset containing 100 agricultural questions specially designed for this study. The experimental results show that the tool significantly outperforms general LLMs as it provides more accurate and detailed answers due to its modular design and integration of different domain knowledge sources. Our source code, dataset, and model weights are publicly available at this https URL.

[AI-21] Contextualized AI for Cyber Defense: An Automated Survey using LLMs

链接: https://arxiv.org/abs/2409.13524
作者: Christoforus Yoga Haryanto,Anne Maria Elvira,Trung Duc Nguyen,Minh Hieu Vu,Yoshiano Hartanto,Emily Lomempow,Arathi Arakala
关键词-EN: cyber defense capabilities, enhancing cyber defense, revealing significant research, significant research growth, defense capabilities
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures, 4 tables, accepted into 17th International Conference on Security of Information and Networks (SINCONF 2024)

点击查看摘要

Abstract:This paper surveys the potential of contextualized AI in enhancing cyber defense capabilities, revealing significant research growth from 2015 to 2024. We identify a focus on robustness, reliability, and integration methods, while noting gaps in organizational trust and governance frameworks. Our study employs two LLM-assisted literature survey methodologies: (A) ChatGPT 4 for exploration, and (B) Gemma 2:9b for filtering with Claude 3.5 Sonnet for full-text analysis. We discuss the effectiveness and challenges of using LLMs in academic research, providing insights for future researchers.

[AI-22] A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges

链接: https://arxiv.org/abs/2409.13521
作者: Lorenzo Zangari,Candida M. Greco,Davide Picca,Andrea Tagarelli
关键词-EN: regulated societal order, Moral Foundation Theory, early civilizations, codified within norms, common good
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Moral values have deep roots in early civilizations, codified within norms and laws that regulated societal order and the common good. They play a crucial role in understanding the psychological basis of human behavior and cultural orientation. The Moral Foundation Theory (MFT) is a well-established framework that identifies the core moral foundations underlying the manner in which different cultures shape individual and social lives. Recent advancements in natural language processing, particularly Pre-trained Language Models (PLMs), have enabled the extraction and analysis of moral dimensions from textual data. This survey presents a comprehensive review of MFT-informed PLMs, providing an analysis of moral tendencies in PLMs and their application in the context of the MFT. We also review relevant datasets and lexicons and discuss trends, limitations, and future directions. By providing a structured overview of the intersection between PLMs and MFT, this work bridges moral psychology insights within the realm of PLMs, paving the way for further research and development in creating morally aware AI systems.

[AI-23] SatFed: A Resource-Efficient LEO Satellite-Assisted Heterogeneous Federated Learning Framework

链接: https://arxiv.org/abs/2409.13503
作者: Yuxin Zhang,Zheng Lin,Zhe Chen,Zihan Fang,Wenjun Zhu,Xianhao Chen,Jin Zhao,Yue Gao
关键词-EN: Traditional federated learning, congestion significantly hinder, frameworks rely heavily, hinder model convergence, increasing bandwidth congestion
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 12 figures

点击查看摘要

Abstract:Traditional federated learning (FL) frameworks rely heavily on terrestrial networks, where coverage limitations and increasing bandwidth congestion significantly hinder model convergence. Fortunately, the advancement of low-Earth orbit (LEO) satellite networks offers promising new communication avenues to augment traditional terrestrial FL. Despite this potential, the limited satellite-ground communication bandwidth and the heterogeneous operating environments of ground devices-including variations in data, bandwidth, and computing power-pose substantial challenges for effective and robust satellite-assisted FL. To address these challenges, we propose SatFed, a resource-efficient satellite-assisted heterogeneous FL framework. SatFed implements freshness-based model prioritization queues to optimize the use of highly constrained satellite-ground bandwidth, ensuring the transmission of the most critical models. Additionally, a multigraph is constructed to capture real-time heterogeneous relationships between devices, including data distribution, terrestrial bandwidth, and computing capability. This multigraph enables SatFed to aggregate satellite-transmitted models into peer guidance, enhancing local training in heterogeneous environments. Extensive experiments with real-world LEO satellite networks demonstrate that SatFed achieves superior performance and robustness compared to state-of-the-art benchmarks.

[AI-24] HUT: A More Computation Efficient Fine-Tuning Method With Hadamard Updated Transformation

链接: https://arxiv.org/abs/2409.13501
作者: Geyuan Zhang,Xiaofei Zhou,Chuheng Chen
关键词-EN: Fine-tuning pre-trained language, pre-trained language models, achieved impressive results, Parameter Efficient Fine-Tuning, pre-trained language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning pre-trained language models for downstream tasks has achieved impressive results in NLP. However, fine-tuning all parameters becomes impractical due to the rapidly increasing size of model parameters. To address this, Parameter Efficient Fine-Tuning (PEFT) methods update only a subset of parameters. Most PEFT methods, such as LoRA, use incremental updates, which involve adding learned weight matrix increments to the original parameters. Although effective, these methods face limitations in capturing complex parameter dynamics and do not maintain a strong correlation between the original and updated parameters. To overcome these challenges, we propose the direct Updated Transformation (UT) paradigm, which constructs a transformation directly from the original to the updated parameters. This approach ensures that the correlation between the original and updated parameters is preserved, leveraging the semantic features learned during pre-training. Building on this paradigm, we present the Hadamard Updated Transformation (HUT) method. HUT efficiently updates the original weight matrix using the Hadamard transformation with two low-rank matrices, offering a more expressive and flexible update mechanism. This allows HUT to capture richer parameter features through functional transformations, reducing computational complexity while maintaining or improving model quality. Theoretical analysis and extensive experiments on RoBERTa and GPT-2 validate the effectiveness of HUT. Results show that HUT performs on par with or better than other PEFT methods in terms of model quality, while significantly reducing computational complexity.

[AI-25] DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

链接: https://arxiv.org/abs/2409.13496
作者: Ling Wang,Chen Wu,Lin Wang
关键词-EN: long exposure time, motion blur caused, RGB cameras, Autonomous vehicles, time of RGB
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (\eg, color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, \eg, Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: \urlthis https URL.

[AI-26] Since Lawyers are Males…: Examining Implicit Gender Bias in Hindi Language Generation by LLMs

链接: https://arxiv.org/abs/2409.13484
作者: Ishika Joshi,Ishita Gupta,Adrita Dey,Tapan Parikh
关键词-EN: Large Language Models, Large Language, customer support, Hindi, gender biases
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used to generate text across various languages, for tasks such as translation, customer support, and education. Despite these advancements, LLMs show notable gender biases in English, which become even more pronounced when generating content in relatively underrepresented languages like Hindi. This study explores implicit gender biases in Hindi text generation and compares them to those in English. We developed Hindi datasets inspired by WinoBias to examine stereotypical patterns in responses from models like GPT-4o and Claude-3 sonnet. Our results reveal a significant gender bias of 87.8% in Hindi, compared to 33.4% in English GPT-4o generation, with Hindi responses frequently relying on gender stereotypes related to occupations, power hierarchies, and social class. This research underscores the variation in gender biases across languages and provides considerations for navigating these biases in generative AI systems.

[AI-27] Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study

链接: https://arxiv.org/abs/2409.13476
作者: Tirtha Chanda,Sarah Haggenmueller,Tabea-Clara Bucher,Tim Holland-Letz,Harald Kittler,Philipp Tschandl,Markus V. Heppt,Carola Berking,Jochen S. Utikal,Bastian Schilling,Claudia Buerger,Cristian Navarrete-Dechent,Matthias Goebeler,Jakob Nikolas Kather,Carolin V. Schneider,Benjamin Durani,Hendrike Durani,Martin Jansen,Juliane Wacker,Joerg Wacker,Reader Study Consortium,Titus J. Brinker
关键词-EN: enhancing clinicians’ confidence, Artificial intelligence, substantially improved dermatologists’, AI-driven decisions, enhancing clinicians’
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems have substantially improved dermatologists’ diagnostic accuracy for melanoma, with explainable AI (XAI) systems further enhancing clinicians’ confidence and trust in AI-driven decisions. Despite these advancements, there remains a critical need for objective evaluation of how dermatologists engage with both AI and XAI tools. In this study, 76 dermatologists participated in a reader study, diagnosing 16 dermoscopic images of melanomas and nevi using an XAI system that provides detailed, domain-specific explanations. Eye-tracking technology was employed to assess their interactions. Diagnostic performance was compared with that of a standard AI system lacking explanatory features. Our findings reveal that XAI systems improved balanced diagnostic accuracy by 2.8 percentage points relative to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and complex lesions were associated with elevated cognitive load, as evidenced by increased ocular fixations. These insights have significant implications for clinical practice, the design of AI tools for visual tasks, and the broader development of XAI in medical diagnostics.

[AI-28] Deterministic versus stochastic dynamical classifiers: opposing random adversarial attacks with noise

链接: https://arxiv.org/abs/2409.13470
作者: Lorenzo Chicchi,Duccio Fanelli,Diego Febbe,Lorenzo Buffoni,Francesca Di Patti,Lorenzo Giambagli,Raffele Marino
关键词-EN: Continuous-Variable Firing Rate, excitatory biological neurons, dynamically assisted classifier, veritable dynamically assisted, Firing Rate
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Continuous-Variable Firing Rate (CVFR) model, widely used in neuroscience to describe the intertangled dynamics of excitatory biological neurons, is here trained and tested as a veritable dynamically assisted classifier. To this end the model is supplied with a set of planted attractors which are self-consistently embedded in the inter-nodes coupling matrix, via its spectral decomposition. Learning to classify amounts to sculp the basin of attraction of the imposed equilibria, directing different items towards the corresponding destination target, which reflects the class of respective pertinence. A stochastic variant of the CVFR model is also studied and found to be robust to aversarial random attacks, which corrupt the items to be classified. This remarkable finding is one of the very many surprising effects which arise when noise and dynamical attributes are made to mutually resonate.

[AI-29] Global Outlier Detection in a Federated Learning Setting with Isolation Forest

链接: https://arxiv.org/abs/2409.13466
作者: Daniele Malpetti,Laura Azzimonti
关键词-EN: detecting global outliers, federated learning setting, learning setting, cross-silo scenarios, strategy for detecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at FLTA 2024: The 2nd IEEE International Conference on Federated Learning Technologies and Applications

点击查看摘要

Abstract:We present a novel strategy for detecting global outliers in a federated learning setting, targeting in particular cross-silo scenarios. Our approach involves the use of two servers and the transmission of masked local data from clients to one of the servers. The masking of the data prevents the disclosure of sensitive information while still permitting the identification of outliers. Moreover, to further safeguard privacy, a permutation mechanism is implemented so that the server does not know which client owns any masked data point. The server performs outlier detection on the masked data, using either Isolation Forest or its extended version, and then communicates outlier information back to the clients, allowing them to identify and remove outliers in their local datasets before starting any subsequent federated model training. This approach provides comparable results to a centralized execution of Isolation Forest algorithms on plain data.

[AI-30] CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

链接: https://arxiv.org/abs/2409.13430
作者: Zhangchen Ye,Tao Jiang,Chenfeng Xu,Yiming Li,Hang Zhao
关键词-EN: occupancy prediction, depth estimation, significantly challenged, inherent limitations, limitations of monocular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \urlthis https URL.

[AI-31] A User Study on Contrastive Explanations for Multi-Effector Temporal Planning with Non-Stationary Costs

链接: https://arxiv.org/abs/2409.13427
作者: Xiaowei Liu,Kevin McAreavey,Weiru Liu
关键词-EN: adopt constrastive explanations, smart homes, adopt constrastive, end-user application, temporal planning
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we adopt constrastive explanations within an end-user application for temporal planning of smart homes. In this application, users have requirements on the execution of appliance tasks, pay for energy according to dynamic energy tariffs, have access to high-capacity battery storage, and are able to sell energy to the grid. The concurrent scheduling of devices makes this a multi-effector planning problem, while the dynamic tariffs yield costs that are non-stationary (alternatively, costs that are stationary but depend on exogenous events). These characteristics are such that the planning problems are generally not supported by existing PDDL-based planners, so we instead design a custom domain-dependent planner that scales to reasonable appliance numbers and time horizons. We conduct a controlled user study with 128 participants using an online crowd-sourcing platform based on two user stories. Our results indicate that users provided with contrastive questions and explanations have higher levels of satisfaction, tend to gain improved understanding, and rate the helpfulness more favourably with the recommended AI schedule compared to those without access to these features.

[AI-32] Sine Wave Normalization for Deep Learning-Based Tumor Segmentation in CT/PET Imaging

链接: https://arxiv.org/abs/2409.13410
作者: Jintao Ren,Muheng Li,Stine Sofia Korreman
关键词-EN: autoPET III Challenge, III Challenge, automated tumor segmentation, autoPET III, PET scans
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: Report for Team WukongRT in the AutoPET III Challenge

点击查看摘要

Abstract:This report presents a normalization block for automated tumor segmentation in CT/PET scans, developed for the autoPET III Challenge. The key innovation is the introduction of the SineNormal, which applies periodic sine transformations to PET data to enhance lesion detection. By highlighting intensity variations and producing concentric ring patterns in PET highlighted regions, the model aims to improve segmentation accuracy, particularly for challenging multitracer PET datasets. The code for this project is available on GitHub (this https URL).

[AI-33] Validation Exploration of Multimodal Deep-Learning Camera-Lidar Calibration models

链接: https://arxiv.org/abs/2409.13402
作者: Venkat Karramreddy,Liam Mitchell
关键词-EN: multi-modal sensor systems, implementing deep learning, deep learning architectures, article presents, presents an innovative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:This article presents an innovative study in exploring, evaluating, and implementing deep learning architectures for the calibration of multi-modal sensor systems. The focus behind this is to leverage the use of sensor fusion to achieve dynamic, real-time alignment between 3D LiDAR and 2D Camera sensors. static calibration methods are tedious and time-consuming, which is why we propose utilizing Conventional Neural Networks (CNN) coupled with geometrically informed learning to solve this issue. We leverage the foundational principles of Extrinsic LiDAR-Camera Calibration tools such as RegNet, CalibNet, and LCCNet by exploring open-source models that are available online and comparing our results with their corresponding research papers. Requirements for extracting these visual and measurable outputs involved tweaking source code, fine-tuning, training, validation, and testing for each of these frameworks for equal comparisons. This approach aims to investigate which of these advanced networks produces the most accurate and consistent predictions. Through a series of experiments, we reveal some of their shortcomings and areas for potential improvements along the way. We find that LCCNet yields the best results out of all the models that we validated.

[AI-34] Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis ICASSP2025

链接: https://arxiv.org/abs/2409.13382
作者: Lauri Juvela,Xin Wang
关键词-EN: Automatic detection, current synthesis methods, Audio, increasingly important, important as current
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Automatic detection of synthetic speech is becoming increasingly important as current synthesis methods are both near indistinguishable from human speech and widely accessible to the public. Audio watermarking and other active disclosure methods of are attracting research activity, as they can complement traditional deepfake defenses based on passive detection. In both active and passive detection, robustness is of major interest. Traditional audio watermarks are particularly susceptible to removal attacks by audio codec application. Most generated speech and audio content released into the wild passes through an audio codec purely as a distribution method. We recently proposed collaborative watermarking as method for making generated speech more easily detectable over a noisy but differentiable transmission channel. This paper extends the channel augmentation to work with non-differentiable traditional audio codecs and neural audio codecs and evaluates transferability and effect of codec bitrate over various configurations. The results show that collaborative watermarking can be reliably augmented by black-box audio codecs using a waveform-domain straight-through-estimator for gradient approximation. Furthermore, that results show that channel augmentation with a neural audio codec transfers well to traditional codecs. Listening tests demonstrate collaborative watermarking incurs negligible perceptual degradation with high bitrate codecs or DAC at 8kbps.

[AI-35] LLMs Still Cant Plan; Can LRMs? A Preliminary Evaluation of OpenAIs o1 on PlanBench

链接: https://arxiv.org/abs/2409.13373
作者: Karthik Valmeekam,Kaya Stechly,Subbarao Kambhampati
关键词-EN: ability to plan, action that achieves, achieves a desired, desired state, state of affairs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems.

[AI-36] RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning

链接: https://arxiv.org/abs/2409.13366
作者: Wenhui Diao,Haichen Yu,Kaiyue Kang,Tong Ling,Di Liu,Yingchao Feng,Hanbo Bi,Libo Ren,Xuexue Li,Yongqiang Mao,Xian Sun
关键词-EN: Aerial Remote Sensing, Aerial Remote, Remote Sensing, pose significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model’s detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model’s adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and effectiveness of RingMo-Aerial in enhancing the performance of ARS vision tasks.

[AI-37] FPBoost: Fully Parametric Gradient Boosting for Survival Analysis

链接: https://arxiv.org/abs/2409.13363
作者: Alberto Archetti,Eugenio Lomurno,Diego Piccinotti,Matteo Matteucci
关键词-EN: valuable clinical insights, extracting valuable clinical, tool for analyzing, data and extracting, clinical insights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis is a critical tool for analyzing time-to-event data and extracting valuable clinical insights. Recently, numerous machine learning techniques leveraging neural networks and decision trees have been developed for this task. Among these, the most successful approaches often rely on specific assumptions about the shape of the modeled hazard function. These assumptions include proportional hazard, accelerated failure time, or discrete estimation at a predefined set of time points. In this study, we propose a novel paradigm for survival model design based on the weighted sum of individual fully parametric hazard contributions. We build upon well-known ensemble techniques to deliver a novel contribution to the field by applying additive hazard functions, improving over approaches based on survival or cumulative hazard functions. Furthermore, the proposed model, which we call FPBoost, is the first algorithm to directly optimize the survival likelihood via gradient boosting. We evaluated our approach across a diverse set of datasets, comparing it against a variety of state-of-the-art models. The results demonstrate that FPBoost improves risk estimation, according to both concordance and calibration metrics.

[AI-38] EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models ACL2024

链接: https://arxiv.org/abs/2409.13359
作者: Yuyan Chen,Hao Wang,Songzhou Yan,Sijia Liu,Yueze Li,Yi Zhao,Yanghua Xiao
关键词-EN: Natural Language Processing, large language models, Language Processing, Natural Language, importance in Natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 (Findings)

点击查看摘要

Abstract:Emotional intelligence in large language models (LLMs) is of great importance in Natural Language Processing. However, the previous research mainly focus on basic sentiment analysis tasks, such as emotion recognition, which is not enough to evaluate LLMs’ overall emotional intelligence. Therefore, this paper presents a novel framework named EmotionQueen for evaluating the emotional intelligence of LLMs. The framework includes four distinctive tasks: Key Event Recognition, Mixed Event Recognition, Implicit Emotional Recognition, and Intention Recognition. LLMs are requested to recognize important event or implicit emotions and generate empathetic response. We also design two metrics to evaluate LLMs’ capabilities in recognition and response for emotion-related statements. Experiments yield significant conclusions about LLMs’ capabilities and limitations in emotion intelligence.

[AI-39] Recent Advancement of Emotion Cognition in Large Language Models

链接: https://arxiv.org/abs/2409.13354
作者: Yuyan Chen,Yanghua Xiao
关键词-EN: large language models, mental health assessment, human-computer interaction, language models, social media
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Emotion cognition in large language models (LLMs) is crucial for enhancing performance across various applications, such as social media, human-computer interaction, and mental health assessment. We explore the current landscape of research, which primarily revolves around emotion classification, emotionally rich response generation, and Theory of Mind assessments, while acknowledge the challenges like dependency on annotated data and complexity in emotion processing. In this paper, we present a detailed survey of recent progress in LLMs for emotion cognition. We explore key research studies, methodologies, outcomes, and resources, aligning them with Ulric Neisser’s cognitive stages. Additionally, we outline potential future directions for research in this evolving field, including unsupervised learning approaches and the development of more complex and interpretable emotion cognition LLMs. We also discuss advanced methods such as contrastive learning used to improve LLMs’ emotion cognition capabilities.

[AI-40] ID-Guard: A Universal Framework for Combating Facial Manipulation via Breaking Identification

链接: https://arxiv.org/abs/2409.13349
作者: Zuomin Qu,Wei Lu,Xiangyang Luo,Qian Wang,Xiaochun Cao
关键词-EN: deep learning-based facial, learning-based facial manipulation, facial manipulation poses, misuse of deep, deep learning-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the retention of identity information of the person in the image, leading to stigmatization of the individual. In this paper, we propose a novel universal framework for combating facial manipulation, called ID-Guard. Specifically, this framework requires only a single forward pass of an encoder-decoder network to generate a cross-model universal adversarial perturbation corresponding to a specific facial image. To ensure anonymity in manipulated facial images, a novel Identity Destruction Module (IDM) is introduced to destroy the identifiable information in forged faces targetedly. Additionally, we optimize the perturbations produced by considering the disruption towards different facial manipulations as a multi-task learning problem and design a dynamic weights strategy to improve cross-model performance. The proposed framework reports impressive results in defending against multiple widely used facial manipulations, effectively distorting the identifiable regions in the manipulated facial images. In addition, our experiments reveal the ID-Guard’s ability to enable disrupted images to avoid face inpaintings and open-source image recognition systems.

[AI-41] Imagine yourself: Tuning-Free Personalized Image Generation

链接: https://arxiv.org/abs/2409.13346
作者: Zecheng He,Bo Sun,Felix Juefei-Xu,Haoyu Ma,Ankit Ramchandani,Vincent Cheung,Siddharth Shah,Anmol Kalia,Harihar Subramanyam,Alireza Zareian,Li Chen,Ankit Jain,Ning Zhang,Peizhao Zhang,Roshan Sumbaly,Peter Vajda,Animesh Sinha
关键词-EN: demonstrated remarkable efficacy, Diffusion models, demonstrated remarkable, remarkable efficacy, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model’s SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.

[AI-42] A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing

链接: https://arxiv.org/abs/2409.13345
作者: Yi Ren,Tianyi Zhang,Zhixiong Han,Weibin Li,Zhiyang Wang,Wenbo Ji,Chenhao Qin,Chenbin Liang,Licheng Jiao
关键词-EN: adaptive fine-tuning algorithm, propose an adaptive, adaptive fine-tuning, data, algorithm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an adaptive fine-tuning algorithm for multimodal large models. The core steps of this algorithm involve two stages of truncation. First, the vast amount of data is projected into a semantic vector space, and the MiniBatchKMeans algorithm is used for automated clustering. This classification ensures that the data within each cluster exhibit high semantic similarity. Next, we process the data in each cluster, calculating the translational difference between the original and perturbed data in the multimodal large model’s vector space. This difference serves as a generalization metric for the data. Based on this metric, we select the data with high generalization potential for training. We applied this algorithm to train the InternLM-XComposer2-VL-7B model on two 3090 GPUs using one-third of the GeoChat multimodal remote sensing dataset. The results demonstrate that our algorithm outperforms the state-of-the-art baselines. various baselines. The model trained on our optimally chosen one-third dataset, based on experimental validation, exhibited only 1% reduction in performance across various remote sensing metrics compared to the model trained on the full dataset. This approach significantly preserved general-purpose capabilities while reducing training time by 68.2%. Furthermore, the model achieved scores of 89.86 and 77.19 on the UCMerced and AID evaluation datasets, respectively, surpassing the GeoChat dataset by 5.43 and 5.16 points. It only showed a 0.91-point average decrease on the LRBEN evaluation dataset.

[AI-43] me Awareness in Large Language Models : Benchmarking Fact Recall Across Time

链接: https://arxiv.org/abs/2409.13338
作者: David Herel,Vojtech Bartek,Tomas Mikolov
关键词-EN: President, Abstract, question is asked, models, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. In this paper, we introduce a novel dataset designed to rigorously test LLMs’ ability to handle time-sensitive facts. Our benchmark offers a systematic way to measure how well LLMs align their knowledge with the correct time context, filling a key gap in current evaluation methods and offering a valuable tool for improving real-world applicability in future models.

[AI-44] SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

链接: https://arxiv.org/abs/2409.13321
作者: Jinge Wu,Yunsoo Kim,Daqian Shi,David Cliffton,Fenglin Liu,Honghan Wu
关键词-EN: growing research interest, assist clinicians, large language models, success of large, growing research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re ^3 Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.

[AI-45] GAProtoNet: A Multi-head Graph Attention-based Prototypical Network for Interpretable Text Classification COLING2025

链接: https://arxiv.org/abs/2409.13312
作者: Ximing Wen,Wenjuan Tan,Rosina O. Weber
关键词-EN: Pretrained transformer-based Language, transformer-based Language Models, powerful word embeddings, text classification tasks, Pretrained transformer-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figues, submitted to COLING 2025

点击查看摘要

Abstract:Pretrained transformer-based Language Models (LMs) are well-known for their ability to achieve significant improvement on text classification tasks with their powerful word embeddings, but their black-box nature, which leads to a lack of interpretability, has been a major concern. In this work, we introduce GAProtoNet, a novel white-box Multi-head Graph Attention-based Prototypical Network designed to explain the decisions of text classification models built with LM encoders. In our approach, the input vector and prototypes are regarded as nodes within a graph, and we utilize multi-head graph attention to selectively construct edges between the input node and prototype nodes to learn an interpretable prototypical representation. During inference, the model makes decisions based on a linear combination of activated prototypes weighted by the attention score assigned for each prototype, allowing its choices to be transparently explained by the attention weights and the prototypes projected into the closest matching training examples. Experiments on multiple public datasets show our approach achieves superior results without sacrificing the accuracy of the original black-box LMs. We also compare with four alternative prototypical network variations and our approach achieves the best accuracy and F1 among all. Our case study and visualization of prototype clusters also demonstrate the efficiency in explaining the decisions of black-box models built with LMs.

[AI-46] OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

链接: https://arxiv.org/abs/2409.13299
作者: Yooseok Lim,Sujee Lee
关键词-EN: medical decision-making processes, personalized medical decision-making, individual patient conditions, Accurate diagnosis, medication dosing strategies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of individual patient conditions and appropriate medication dosing strategies are core elements of personalized medical decision-making processes. This therapeutic procedure, which entails recursively assessing the patient’s condition and administering suitable medications, can effectively be modeled as a reinforcement learning (RL) problem. Crucially, the success of RL in this context depends on the establishment of a well-defined reward function that accurately represents the optimal treatment strategy. However, defining the learning direction in RL with only a limited set of explicit indicators complicates the task due to the inherent complexity of the required domain knowledge. This approach may also increase the likelihood that the RL policy does not adequately reflect the clinician’s treatment intentions, which are determined by considering various situations and indicators. In this study, we focus on developing a reward function that reflects the clinician’s intentions and introduce Offline Model-based Guided Reward Learning (OMG-RL), which performs offline inverse reinforcement learning (IRL) aligned with the offline RL environment. Through OMG-RL, we learn a parameterized reward function that includes the expert’s intentions from limited data, thereby enhancing the agent’s policy. We validate the proposed approach on the heparin dosing task. The results demonstrate that policy learning through OMG-RL is meaningful and confirm that the learned policy is positively reinforced in terms of activated partial thromboplastin time (aPTT), a key indicator for monitoring the effects of heparin. This approach can be broadly utilized not only for the heparin dosing problem but also for RL-based medication dosing tasks in general.

[AI-47] me Distributed Deep Learning models for Purely Exogenous Forecasting. Application to Water Table Depth Prediction using Weather Image Time Series

链接: https://arxiv.org/abs/2409.13284
作者: Matteo Salis,Abdourrahmane M. Atto,Stefano Ferraris,Rosa Meo
关键词-EN: resources management framework, sustainable resources management, Groundwater resources, Time Distributed Convolutional, management framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Groundwater resources are one of the most relevant elements in the water cycle, therefore developing models to accurately predict them is a pivotal task in the sustainable resources management framework. Deep Learning (DL) models have been revealed very effective in hydrology, especially by feeding spatially distributed data (e.g. raster data). In many regions, hydrological measurements are difficult to obtain regularly or periodically in time, and in some cases, last available data are not up to date. Reversely, weather data, which significantly impacts water resources, are usually more available and with higher quality. More specifically, we have proposed two different DL models to predict the water table depth in the Grana-Maira catchment (Piemonte, IT) using only exogenous weather image time series. To deal with the image time series, both models are made of a first Time Distributed Convolutional Neural Network (TDC) which encodes the image available at each time step into a vectorial representation. The first model, TDC-LSTM uses then a Sequential Module based on an LSTM layer to learn temporal relations and output the predictions. The second model, TDC-UnPWaveNet uses instead a new version of the WaveNet architecture, adapted here to output a sequence shorter and completely shifted in the future with respect to the input one. To this aim, and to deal with the different sequence lengths in the UnPWaveNet, we have designed a new Channel Distributed layer, that acts like a Time Distributed one but on the channel dimension, i.e. applying the same set of operations to each channel of the input. TDC-LSTM and TDC-UnPWaveNet have shown both remarkable results. However, the two models have focused on different learnable information: TDC-LSTM has focused more on lowering the bias, while the TDC-UnPWaveNet has focused more on the temporal dynamics maximising correlation and KGE.

[AI-48] Leveraging Knowledge Graphs and LLMs to Support and Monitor Legislative Systems

链接: https://arxiv.org/abs/2409.13252
作者: Andrea Colombo
关键词-EN: enhancing data analytics, organize large datasets, Knowledge Graphs, Legislative Knowledge Graphs, datasets into structured
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) have been used to organize large datasets into structured, interconnected information, enhancing data analytics across various fields. In the legislative context, one potential natural application of KGs is modeling the intricate set of interconnections that link laws and their articles with each other and the broader legislative context. At the same time, the rise of large language models (LLMs) such as GPT has opened new opportunities in legal applications, such as text generation and document drafting. Despite their potential, the use of LLMs in legislative contexts is critical since it requires the absence of hallucinations and reliance on up-to-date information, as new laws are published on a daily basis. This work investigates how Legislative Knowledge Graphs and LLMs can synergize and support legislative processes. We address three key questions: the benefits of using KGs for legislative systems, how LLM can support legislative activities by ensuring an accurate output, and how we can allow non-technical users to use such technologies in their activities. To this aim, we develop Legis AI Platform, an interactive platform focused on Italian legislation that enhances the possibility of conducting legislative analysis and that aims to support lawmaking activities. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.13252 [cs.DB] (or arXiv:2409.13252v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2409.13252 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3627673.3680268 Focus to learn more DOI(s) linking to related resources

[AI-49] From Cognition to Precognition: A Future-Aware Framework for Social Navigation

链接: https://arxiv.org/abs/2409.13244
作者: Zeying Gong,Tianshuai Hu,Ronghe Qiu,Junwei Liang
关键词-EN: anticipate future human, navigate safely, safely and efficiently, efficiently in crowded, perceive the current
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Social Navigation; Trajectory Prediction; Auxiliary Tasks

点击查看摘要

Abstract:To navigate safely and efficiently in crowded spaces, robots should not only perceive the current state of the environment but also anticipate future human movements. In this paper, we propose a reinforcement learning architecture, namely Falcon, to tackle socially-aware navigation by explicitly predicting human trajectories and penalizing actions that block future human paths. To facilitate realistic evaluation, we introduce a novel SocialNav benchmark containing two new datasets, Social-HM3D and Social-MP3D. This benchmark offers large-scale photo-realistic indoor scenes populated with a reasonable amount of human agents based on scene area size, incorporating natural human movements and trajectory patterns. We conduct a detailed experimental analysis with the state-of-the-art learning-based method and two classic rule-based path-planning algorithms on the new benchmark. The results demonstrate the importance of future prediction and our method achieves the best task success rate of 55% while maintaining about 90% personal space compliance. We will release our code and datasets. Videos of demonstrations can be viewed at this https URL .

[AI-50] Relationship between Uncertainty in DNNs and Adversarial Attacks

链接: https://arxiv.org/abs/2409.13232
作者: Abigail Adeniran,Adewale Adeyemo
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, natural language processing, outperformed human accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: review

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have achieved state of the art results and even outperformed human accuracy in many challenging tasks, leading to DNNs adoption in a variety of fields including natural language processing, pattern recognition, prediction, and control optimization. However, DNNs are accompanied by uncertainty about their results, causing them to predict an outcome that is either incorrect or outside of a certain level of confidence. These uncertainties stem from model or data constraints, which could be exacerbated by adversarial attacks. Adversarial attacks aim to provide perturbed input to DNNs, causing the DNN to make incorrect predictions or increase model uncertainty. In this review, we explore the relationship between DNN uncertainty and adversarial attacks, emphasizing how adversarial attacks might raise DNN uncertainty.

[AI-51] Redefining Data Pairing for Motion Retargeting Leveraging a Human Body Prior IROS2024

链接: https://arxiv.org/abs/2409.13208
作者: Xiyana Figuera,Soogeun Park,Hyemin Ahn
关键词-EN: HUman, robot poses, poses, data, random robot poses
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 5 Figures, Accepted at IROS 2024

点击查看摘要

Abstract:We propose MR.HuBo (Motion Retargeting leveraging a HUman BOdy prior), a cost-effective and convenient method to collect high-quality upper body paired \langle \textrobot, human \rangle pose data, which is essential for data-driven motion retargeting methods. Unlike existing approaches which collect \langle \textrobot, human \rangle pose data by converting human MoCap poses into robot poses, our method goes in reverse. We first sample diverse random robot poses, and then convert them into human poses. However, since random robot poses can result in extreme and infeasible human poses, we propose an additional technique to sort out extreme poses by exploiting a human body prior trained from a large amount of human pose data. Our data collection method can be used for any humanoid robots, if one designs or optimizes the system’s hyperparameters which include a size scale factor and the joint angle ranges for sampling. In addition to this data collection method, we also present a two-stage motion retargeting neural network that can be trained via supervised learning on a large amount of paired data. Compared to other learning-based methods trained via unsupervised learning, we found that our deep neural network trained with ample high-quality paired data achieved notable performance. Our experiments also show that our data filtering method yields better retargeting results than training the model with raw and noisy data. Our code and video results are available on this https URL

[AI-52] An adapted large language model facilitates multiple medical tasks in diabetes care

链接: https://arxiv.org/abs/2409.13191
作者: Lai Wei,Zhen Ying,Muyang He,Yutong Chen,Qian Yang,Yanzhe Hong,Jiaping Lu,Xiaoying Li,Weiran Huang,Ying Chen
关键词-EN: global health burden, requires multi-stakeholder collaboration, significant global health, management requires multi-stakeholder, optimizing diabetes management
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diabetes is a chronic disease that poses a significant global health burden, and optimizing diabetes management requires multi-stakeholder collaboration. Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across a diverse range of diabetes tasks remains unproven. In this study, we introduced a framework to train and validate diabetes-specific LLMs. We first developed a comprehensive data processing pipeline that includes data collection, filtering, augmentation and refinement. This approach contributes to creating a high-quality, diabetes-specific dataset, and several evaluation benchmarks entirely from scratch. Utilizing the collected training dataset, we fine-tuned a diabetes-specific LLM family that demonstrated state-of-the-art proficiency in understanding and processing various diabetes tasks compared to other LLMs. Furthermore, clinical studies showed the potential applications of our models in diabetes care, including providing personalized healthcare, assisting medical education, and streamlining clinical tasks. In conclusion, our study introduced a framework to develop and evaluate a diabetes-specific LLM family, and highlighted its potential to enhance clinical practice and provide personalized, data-driven support for diabetes support when facing different end users. The code is provided via GitHub at this https URL.

[AI-53] Cooperative Resilience in Artificial Intelligence Multiagent Systems

链接: https://arxiv.org/abs/2409.13187
作者: Manuela Chacon-Chamorro,Luis Felipe Giraldo,Nicanor Quijano,Vicente Vargas-Panesso,César González,Juan Sebastián Pinzón,Rubén Manrrique,Manuel Ríos,Yesid Fonseca,Daniel Gómez-Barrera,Mónica Perdomo-Pérez
关键词-EN: cooperative resilience, Resilience, Resilience refers, recover from disruptive, disruptive events
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: Supplementary material in this https URL

点击查看摘要

Abstract:Resilience refers to the ability of systems to withstand, adapt to, and recover from disruptive events. While studies on resilience have attracted significant attention across various research domains, the precise definition of this concept within the field of cooperative artificial intelligence remains unclear. This paper addresses this gap by proposing a clear definition of `cooperative resilience’ and outlining a methodology for its quantitative measurement. The methodology is validated in an environment with RL-based and LLM-augmented autonomous agents, subjected to environmental changes and the introduction of agents with unsustainable behaviors. These events are parameterized to create various scenarios for measuring cooperative resilience. The results highlight the crucial role of resilience metrics in analyzing how the collective system prepares for, resists, recovers from, sustains well-being, and transforms in the face of disruptions. These findings provide foundational insights into the definition, measurement, and preliminary analysis of cooperative resilience, offering significant implications for the broader field of AI. Moreover, the methodology and metrics developed here can be adapted to a wide range of AI applications, enhancing the reliability and effectiveness of AI in dynamic and unpredictable environments.

[AI-54] FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

链接: https://arxiv.org/abs/2409.13180
作者: Feng Qiu,Wei Zhang,Chen Liu,Rudong An,Lincheng Li,Yu Ding,Changjie Fan,Zhipeng Hu,Xin Yu
关键词-EN: facial animation transfer, animation transfer aims, facial animation, animation transfer, facial
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 11 figures

点击查看摘要

Abstract:Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose \textbfFreeAvatar, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network.

[AI-55] Morphology and Behavior Co-Optimization of Modular Satellites for Attitude Control

链接: https://arxiv.org/abs/2409.13166
作者: Yuxing Wang,Jie Li,Cong Yu,Xinyang Li,Simeng Huang,Yongzhe Chang,Xueqian Wang,Bin Liang
关键词-EN: space exploration endeavors, spacecraft engineering, paradigm of flexibility, exploration endeavors, modular satellites marks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: The paper was accepted as an oral presentation by the 75th International Astronautical Congress, Milan, Italy

点击查看摘要

Abstract:The emergence of modular satellites marks a significant transformation in spacecraft engineering, introducing a new paradigm of flexibility, resilience, and scalability in space exploration endeavors. In addressing complex challenges such as attitude control, both the satellite’s morphological architecture and the controller are crucial for optimizing performance. Despite substantial research on optimal control, there remains a significant gap in developing optimized and practical assembly strategies for modular satellites tailored to specific mission constraints. This research gap primarily arises from the inherently complex nature of co-optimizing design and control, a process known for its notorious bi-level optimization loop. Conventionally tackled through artificial evolution, this issue involves optimizing the morphology based on the fitness of individual controllers, which is sample-inefficient and computationally expensive. In this paper, we introduce a novel gradient-based approach to simultaneously optimize both morphology and control for modular satellites, enhancing their performance and efficiency in attitude control missions. Our Monte Carlo simulations demonstrate that this co-optimization approach results in modular satellites with better mission performance compared to those designed by evolution-based approaches. Furthermore, this study discusses potential avenues for future research.

[AI-56] owards Efficient Neuro-Symbolic AI: From Workload Characterization to Hardware Architecture

链接: https://arxiv.org/abs/2409.13153
作者: Zishen Wan,Che-Kai Liu,Hanchen Yang,Ritik Raj,Chaojian Li,Haoran You,Yonggan Fu,Cheng Wan,Sixu Li,Youbin Kim,Ananda Samajdar,Yingyan(Celine)Lin,Mohamed Ibrahim,Jan M. Rabaey,Tushar Krishna,Arijit Raychowdhury
关键词-EN: deep neural networks, unsustainable computational trajectories, surrounding unsustainable computational, facing challenges surrounding, challenges surrounding unsustainable
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 14 pages, 11 figures, 7 tables; IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2024

点击查看摘要

Abstract:The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, are facing challenges surrounding unsustainable computational trajectories, limited robustness, and a lack of explainability. To develop next-generation cognitive AI systems, neuro-symbolic AI emerges as a promising paradigm, fusing neural and symbolic approaches to enhance interpretability, robustness, and trustworthiness, while facilitating learning from much less data. Recent neuro-symbolic systems have demonstrated great potential in collaborative human-AI scenarios with reasoning and cognitive capabilities. In this paper, we aim to understand the workload characteristics and potential architectures for neuro-symbolic AI. We first systematically categorize neuro-symbolic AI algorithms, and then experimentally evaluate and analyze them in terms of runtime, memory, computational operators, sparsity, and system characteristics on CPUs, GPUs, and edge SoCs. Our studies reveal that neuro-symbolic models suffer from inefficiencies on off-the-shelf hardware, due to the memory-bound nature of vector-symbolic and logical operations, complex flow control, data dependencies, sparsity variations, and limited scalability. Based on profiling insights, we suggest cross-layer optimization solutions and present a hardware acceleration case study for vector-symbolic architecture to improve the performance, efficiency, and scalability of neuro-symbolic computing. Finally, we discuss the challenges and potential future directions of neuro-symbolic AI from both system and architectural perspectives.

[AI-57] Learning to Compare Hardware Designs for High-Level Synthesis

链接: https://arxiv.org/abs/2409.13138
作者: Yunsheng Bai,Atefeh Sohrabizadeh,Zijian Ding,Rongjian Liang,Weikai Li,Ding Wang,Haoxing Ren,Yizhou Sun,Jason Cong
关键词-EN: transforms high-level code, transforms high-level, High-level synthesis, high-level code, automated design process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Published in MLCAD 2024

点击查看摘要

Abstract:High-level synthesis (HLS) is an automated design process that transforms high-level code into hardware designs, enabling the rapid development of hardware accelerators. HLS relies on pragmas, which are directives inserted into the source code to guide the synthesis process, and pragmas have various settings and values that significantly impact the resulting hardware design. State-of-the-art ML-based HLS methods, such as HARP, first train a deep learning model, typically based on graph neural networks (GNNs) applied to graph-based representations of the source code and pragmas. They then perform design space exploration (DSE) to explore the pragma design space, rank candidate designs using the model, and return the top designs. However, traditional DSE methods face challenges due to the highly nonlinear relationship between pragma settings and performance metrics, along with complex interactions between pragmas that affect performance in non-obvious ways. To address these challenges, we propose compareXplore, a novel approach that learns to compare hardware designs for effective HLS optimization. CompareXplore introduces a hybrid loss function that combines pairwise preference learning with pointwise performance prediction, enabling the model to capture both relative preferences and absolute performance. Moreover, we introduce a novel node difference attention module that focuses on the most informative differences between designs, enabling the model to identify critical pragmas impacting performance. CompareXplore adopts a two-stage DSE, where a pointwise prediction model is used for the initial design pruning, followed by a pairwise comparison stage for precise performance verification. In extensive experiments, compareXplore achieves significant improvements in ranking metrics and generates high-quality HLS results for the selected designs, outperforming the existing SOTA method. Comments: Published in MLCAD 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2409.13138 [cs.LG] (or arXiv:2409.13138v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13138 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD (MLCAD '24), ACM, 2024, Article 2, 1-7 Related DOI: https://doi.org/10.1145/3670474.3685940 Focus to learn more DOI(s) linking to related resources

[AI-58] Interpret the Predictions of Deep Networks via Re-Label Distillation ICME2021

链接: https://arxiv.org/abs/2409.13137
作者: Yingying Hua,Shiming Ge,Daichi Zhang
关键词-EN: Interpreting the predictions, synthetic images, black-box deep network, facilitate the reliability, deep network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Published by IEEE ICME 2021

点击查看摘要

Abstract:Interpreting the predictions of a black-box deep network can facilitate the reliability of its deployment. In this work, we propose a re-label distillation approach to learn a direct map from the input to the prediction in a self-supervision manner. The image is projected into a VAE subspace to generate some synthetic images by randomly perturbing its latent vector. Then, these synthetic images can be annotated into one of two classes by identifying whether their labels shift. After that, using the labels annotated by the deep network as teacher, a linear student model is trained to approximate the annotations by mapping these synthetic images to the classes. In this manner, these re-labeled synthetic images can well describe the local classification mechanism of the deep network, and the learned student can provide a more intuitive explanation towards the predictions. Extensive experiments verify the effectiveness of our approach qualitatively and quantitatively.

[AI-59] Are Large Language Models Good Essay Graders?

链接: https://arxiv.org/abs/2409.13120
作者: Anindita Kundu,Denilson Barbosa
关键词-EN: Large Language Models, Language Models, Large Language, effectiveness of Large, assessing essay quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:We evaluate the effectiveness of Large Language Models (LLMs) in assessing essay quality, focusing on their alignment with human grading. More precisely, we evaluate ChatGPT and Llama in the Automated Essay Scoring (AES) task, a crucial natural language processing (NLP) application in Education. We consider both zero-shot and few-shot learning and different prompting approaches. We compare the numeric grade provided by the LLMs to human rater-provided scores utilizing the ASAP dataset, a well-known benchmark for the AES task. Our research reveals that both LLMs generally assign lower scores compared to those provided by the human raters; moreover, those scores do not correlate well with those provided by the humans. In particular, ChatGPT tends to be harsher and further misaligned with human evaluations than Llama. We also experiment with a number of essay features commonly used by previous AES methods, related to length, usage of connectives and transition words, and readability metrics, including the number of spelling and grammar mistakes. We find that, generally, none of these features correlates strongly with human or LLM scores. Finally, we report results on Llama 3, which are generally better across the board, as expected. Overall, while LLMs do not seem an adequate replacement for human grading, our results are somewhat encouraging for their use as a tool to assist humans in the grading of written essays in the future.

[AI-60] Evolution and challenges of computer vision and deep learning technologies for analysing mixed construction and demolition waste

链接: https://arxiv.org/abs/2409.13112
作者: Adrian Langley,Matthew Lonergan,Tao Huang,Mostafa Rahimi Azghadi
关键词-EN: enhancing business returns, Improving the automatic, CDW, composition is crucial, business returns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Improving the automatic and timely recognition of construction and demolition waste (CDW) composition is crucial for enhancing business returns, economic outcomes, and sustainability. Technologies like computer vision, artificial intelligence (AI), robotics, and internet of things (IoT) are increasingly integrated into waste processing to achieve these goals. While deep learning (DL) models show promise in recognising homogeneous CDW piles, few studies assess their performance with mixed, highly contaminated material in commercial settings. Drawing on extensive experience at a CDW materials recovery facility (MRF) in Sydney, Australia, we explore the challenges and opportunities in developing an advanced automated mixed CDW management system. We begin with an overview of the evolution of waste management in the construction industry, highlighting its environmental, economic, and societal impacts. We review various CDW analysis techniques, concluding that DL-based visual methods are the optimal solution. Additionally, we examine the progression of sensor and camera technologies for CDW analysis as well as the evolution of DL algorithms focused on object detection and material segmentation. We also discuss CDW datasets, their curation, and innovative methods for their creation. Finally, we share insights on CDW visual analysis, addressing technical and commercial challenges, research trends, and future directions for mixed CDW analysis. This paper aims to improve the efficiency of CDW management by providing valuable insights for ongoing and future research and development efforts in this critical sector.

[AI-61] ERIC: Estimating Rainfall with Commodity Doorbell Camera for Precision Residential Irrigation

链接: https://arxiv.org/abs/2409.13104
作者: Tian Liu,Liuyi Jin,Radu Stoleru,Amran Haroon,Charles Swanson,Kexin Feng
关键词-EN: nearby weather stations, adjust irrigation amounts, nearby weather, weather stations, stations to adjust
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: BuildSys 2024

点击查看摘要

Abstract:Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed a cost-effective irrigation system, dubbed ERIC, which employs machine learning models to estimate rainfall from commodity doorbell camera footage and optimizes irrigation schedules without human intervention. Specifically, we: a) designed novel visual and audio features with lightweight neural network models to infer rainfall from the camera at the edge, preserving user privacy; b) built a complete end-to-end irrigation system on Raspberry Pi 4, costing only 75. We deployed the system across five locations (collecting over 750 hours of video) with varying backgrounds and light conditions. Comprehensive evaluation validates that ERIC achieves state-of-the-art rainfall estimation performance (~ 5mm/day), saving 9,112 gallons/month of water, translating to 28.56/month in utility savings.

[AI-62] Guided Profile Generation Improves Personalization with LLMs EMNLP2024

链接: https://arxiv.org/abs/2409.13093
作者: Jiarui Zhang
关键词-EN: Large Language Models, modern commercial systems, improving customer experiences, including Recommendation, input into Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: EMNLP 2024 Findings

点击查看摘要

Abstract:In modern commercial systems, including Recommendation, Ranking, and E-Commerce platforms, there is a trend towards improving customer experiences by incorporating Personalization context as input into Large Language Models (LLMs). However, LLMs often struggle to effectively parse and utilize sparse and complex personal context without additional processing or contextual enrichment, underscoring the need for more sophisticated context understanding mechanisms. In this work, we propose Guided Profile Generation (GPG), a general method designed to generate personal profiles in natural language. As is observed, intermediate guided profile generation enables LLMs to summarize, and extract the important, distinctive features from the personal context into concise, descriptive sentences, precisely tailoring their generation more closely to an individual’s unique habits and preferences. Our experimental results show that GPG improves LLM’s personalization ability across different tasks, for example, it increases 37% accuracy in predicting personal preference compared to directly feeding the LLMs with raw personal context.

[AI-63] Interpretable Action Recognition on Hard to Classify Actions ECCV2024

链接: https://arxiv.org/abs/2409.13091
作者: Anastasia Anichenko,Frank Guerin,Andrew Gilbert
关键词-EN: investigate a human-like, model, human-like interpretable model, video understanding, human-like interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, This manuscript has been accepted at the Human-inspired Computer Vision (HCV) ECCV 2024 Workshop. arXiv admin note: text overlap with arXiv:2107.05319

点击查看摘要

Abstract:We investigate a human-like interpretable model of video understanding. Humans recognise complex activities in video by recognising critical spatio-temporal relations among explicitly recognised objects and parts, for example, an object entering the aperture of a container. To mimic this we build on a model which uses positions of objects and hands, and their motions, to recognise the activity taking place. To improve this model we focussed on three of the most confused classes (for this model) and identified that the lack of 3D information was the major problem. To address this we extended our basic model by adding 3D awareness in two ways: (1) A state-of-the-art object detection model was fine-tuned to determine the difference between “Container” and “NotContainer” in order to integrate object shape information into the existing object features. (2) A state-of-the-art depth estimation model was used to extract depth values for individual objects and calculate depth relations to expand the existing relations used our interpretable model. These 3D extensions to our basic model were evaluated on a subset of three superficially similar “Putting” actions from the Something-Something-v2 dataset. The results showed that the container detector did not improve performance, but the addition of depth relations made a significant improvement to performance.

[AI-64] FedAT: Federated Adversarial Training for Distributed Insider Threat Detection

链接: https://arxiv.org/abs/2409.13083
作者: R G Gayathri,Atul Sajjanhar,Md Palash Uddin,Yong Xiang
关键词-EN: Insider Threat Detection, Insider threats, ITD, entity closely, detect insider threats
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Insider threats usually occur from within the workplace, where the attacker is an entity closely associated with the organization. The sequence of actions the entities take on the resources to which they have access rights allows us to identify the insiders. Insider Threat Detection (ITD) using Machine Learning (ML)-based approaches gained attention in the last few years. However, most techniques employed centralized ML methods to perform such an ITD. Organizations operating from multiple locations cannot contribute to the centralized models as the data is generated from various locations. In particular, the user behavior data, which is the primary source of ITD, cannot be shared among the locations due to privacy concerns. Additionally, the data distributed across various locations result in extreme class imbalance due to the rarity of attacks. Federated Learning (FL), a distributed data modeling paradigm, gained much interest recently. However, FL-enabled ITD is not yet explored, and it still needs research to study the significant issues of its implementation in practical settings. As such, our work investigates an FL-enabled multiclass ITD paradigm that considers non-Independent and Identically Distributed (non-IID) data distribution to detect insider threats from different locations (clients) of an organization. Specifically, we propose a Federated Adversarial Training (FedAT) approach using a generative model to alleviate the extreme data skewness arising from the non-IID data distribution among the clients. Besides, we propose to utilize a Self-normalized Neural Network-based Multi-Layer Perceptron (SNN-MLP) model to improve ITD. We perform comprehensive experiments and compare the results with the benchmarks to manifest the enhanced performance of the proposed FedATdriven ITD scheme.

[AI-65] AutoVerus: Automated Proof Generation for Rust Code

链接: https://arxiv.org/abs/2409.13082
作者: Chenyuan Yang,Xuheng Li,Md Rakib Hossain Misu,Jianan Yao,Weidong Cui,Yeyun Gong,Chris Hawblitzel,Shuvendu Lahiri,Jacob R. Lorch,Shuai Lu,Fan Yang,Ziqiao Zhou,Shan Lu
关键词-EN: software engineering tasks, software engineering, Generative, proof, LLM
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLM to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts’ three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.

[AI-66] Fear and Loathing on the Frontline: Decoding the Language of Othering by Russia-Ukraine War Bloggers

链接: https://arxiv.org/abs/2409.13064
作者: Patrick Gerard,William Theisen,Tim Weninger,Kristina Lerman
关键词-EN: fueling intergroup conflict, existential threats, fueling intergroup, act of portraying, portraying outgroups
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:Othering, the act of portraying outgroups as fundamentally different from the ingroup, often escalates into framing them as existential threats–fueling intergroup conflict and justifying exclusion and violence. These dynamics are alarmingly pervasive, spanning from the extreme historical examples of genocides against minorities in Germany and Rwanda to the ongoing violence and rhetoric targeting migrants in the US and Europe. While concepts like hate speech and fear speech have been explored in existing literature, they capture only part of this broader and more nuanced dynamic which can often be harder to detect, particularly in online speech and propaganda. To address this challenge, we introduce a novel computational framework that leverages large language models (LLMs) to quantify othering across diverse contexts, extending beyond traditional linguistic indicators of hostility. Applying the model to real-world data from Telegram war bloggers and political discussions on Gab reveals how othering escalates during conflicts, interacts with moral language, and garners significant attention, particularly during periods of crisis. Our framework, designed to offer deeper insights into othering dynamics, combines with a rapid adaptation process to provide essential tools for mitigating othering’s adverse impacts on social cohesion.

[AI-67] Comprehensive Overview of Artificial Intelligence Applications in Modern Industries

链接: https://arxiv.org/abs/2409.13059
作者: Yijie Weng,Jianhao Wu,Tara Kelly,William Johnson
关键词-EN: enhancing decision-making processes, Artificial Intelligence, optimizing operations, decision-making processes, opportunities for innovation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is fundamentally reshaping various industries by enhancing decision-making processes, optimizing operations, and unlocking new opportunities for innovation. This paper explores the applications of AI across four key sectors: healthcare, finance, manufacturing, and retail. Each section delves into the specific challenges faced by these industries, the AI technologies employed to address them, and the measurable impact on business outcomes and societal welfare. We also discuss the implications of AI integration, including ethical considerations, the future trajectory of AI development, and its potential to drive economic growth while posing challenges that need to be managed responsibly.

[AI-68] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

链接: https://arxiv.org/abs/2409.13054
作者: Akshaj Kumar Veldanda,Shi-Xiong Zhang,Anirban Das,Supriyo Chakraborty,Stephen Rawls,Sambit Sahu,Milind Naphade
关键词-EN: Large language models, Large language, problematic knowledge embedded, revolutionized various domains, embedded during pretraining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized various domains, yet their utility comes with significant challenges related to outdated or problematic knowledge embedded during pretraining. This paper addresses the challenge of modifying LLMs to unlearn problematic and outdated information while efficiently integrating new knowledge without retraining from scratch. Here, we propose LLM Surgery, a framework to efficiently modify LLM behaviour by optimizing a three component objective function that: (1) Performs reverse gradient on unlearning dataset (problematic and outdated information), (2) Performs gradient descent on the update dataset (new and updated information), and (3) Minimizes the KL divergence on the retain dataset (small subset of unchanged text), ensuring alignment between pretrained and modified model outputs. Due to the lack of publicly available datasets specifically tailored for our novel task, we compiled a new dataset and an evaluation benchmark. Using Llama2-7B, we demonstrate that LLM Surgery can achieve significant forgetting on the unlearn set, a 20% increase in accuracy on the update set, and maintain performance on the retain set.

[AI-69] HeadCT-ONE: Enabling Granular and Controllable Automated Evaluation of Head CT Radiology Report Generation

链接: https://arxiv.org/abs/2409.13038
作者: Julián N. Acosta,Xiaoman Zhang,Siddhant Dogra,Hong-Yu Zhou,Seyedmehdi Payabvash,Guido J. Falcone,Eric K. Oermann,Pranav Rajpurkar
关键词-EN: Ontology Normalized Evaluation, Ontology Normalized, Head CT Ontology, generation through ontology-normalized, extraction derived metrics
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present Head CT Ontology Normalized Evaluation (HeadCT-ONE), a metric for evaluating head CT report generation through ontology-normalized entity and relation extraction. HeadCT-ONE enhances current information extraction derived metrics (such as RadGraph F1) by implementing entity normalization through domain-specific ontologies, addressing radiological language variability. HeadCT-ONE compares normalized entities and relations, allowing for controllable weighting of different entity types or specific entities. Through experiments on head CT reports from three health systems, we show that HeadCT-ONE’s normalization and weighting approach improves the capture of semantically equivalent reports, better distinguishes between normal and abnormal reports, and aligns with radiologists’ assessment of clinically significant errors, while offering flexibility to prioritize specific aspects of report content. Our results demonstrate how HeadCT-ONE enables more flexible, controllable, and granular automated evaluation of head CT reports.

[AI-70] Cost: A Novel Instance Complexity Based Cost-Sensitive Learning Framework for Imbalanced Classification

链接: https://arxiv.org/abs/2409.13007
作者: Asif Newaz,Asif Ur Rahman Adib,Taskeed Jabid
关键词-EN: data presents significant, presents significant challenges, imbalance in data, data presents, Class imbalance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class imbalance in data presents significant challenges for classification tasks. It is fairly common and requires careful handling to obtain desirable performance. Traditional classification algorithms become biased toward the majority class. One way to alleviate the scenario is to make the classifiers cost-sensitive. This is achieved by assigning a higher misclassification cost to minority-class instances. One issue with this implementation is that all the minority-class instances are treated equally, and assigned with the same penalty value. However, the learning difficulties of all the instances are not the same. Instances that are located near the decision boundary are harder to classify, whereas those further away are easier. Without taking into consideration the instance complexity and naively weighting all the minority-class samples uniformly, results in an unwarranted bias and consequently, a higher number of misclassifications of the majority-class instances. This is undesirable and to overcome the situation, we propose a novel instance complexity-based cost-sensitive approach in this study. We first categorize all the minority-class instances based on their difficulty level and then the instances are penalized accordingly. This ensures a more equitable instance weighting and prevents excessive penalization. The performance of the proposed approach is tested on 66 imbalanced datasets against the traditional cost-sensitive learning frameworks and a significant improvement in performance is noticeable, demonstrating the effectiveness of our method.

[AI-71] Introducing the Large Medical Model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences

链接: https://arxiv.org/abs/2409.13000
作者: Ricky Sahu,Eric Marriott,Ethan Siegel,David Wagner,Flore Uzan,Troy Yang,Asim Javed
关键词-EN: NHE Fact Sheet, NHE Fact, Fact Sheet, healthcare spending approaching, optimal patient care
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:With U.S. healthcare spending approaching 5T (NHE Fact Sheet 2024), and 25% of it estimated to be wasteful (Waste in the US the health care system: estimated costs and potential for savings, n.d.), the need to better predict risk and optimal patient care is evermore important. This paper introduces the Large Medical Model (LMM), a generative pre-trained transformer (GPT) designed to guide and predict the broad facets of patient care and healthcare administration. The model is trained on medical event sequences from over 140M longitudinal patient claims records with a specialized vocabulary built from medical terminology systems and demonstrates a superior capability to forecast healthcare costs and identify potential risk factors. Through experimentation and validation, we showcase the LMM’s proficiency in not only in cost and risk predictions, but also in discerning intricate patterns within complex medical conditions and an ability to identify novel relationships in patient care. The LMM is able to improve both cost prediction by 14.1% over the best commercial models and chronic conditions prediction by 1.9% over the best transformer models in research predicting a broad set of conditions. The LMM is a substantial advancement in healthcare analytics, offering the potential to significantly enhance risk assessment, cost management, and personalized medicine.

[AI-72] VCAT: Vulnerability-aware and Curiosity-driven Adversarial Training for Enhancing Autonomous Vehicle Robustness

链接: https://arxiv.org/abs/2409.12997
作者: Xuan Cai,Zhiyong Cui,Xuesong Bai,Ruimin Ke,Zhenshu Ma,Haiyang Yu,Yilong Ren
关键词-EN: face significant threats, Autonomous vehicles, face significant, significant threats, safe operation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, conference

点击查看摘要

Abstract:Autonomous vehicles (AVs) face significant threats to their safe operation in complex traffic environments. Adversarial training has emerged as an effective method of enabling AVs to preemptively fortify their robustness against malicious attacks. Train an attacker using an adversarial policy, allowing the AV to learn robust driving through interaction with this attacker. However, adversarial policies in existing methodologies often get stuck in a loop of overexploiting established vulnerabilities, resulting in poor improvement for AVs. To overcome the limitations, we introduce a pioneering framework termed Vulnerability-aware and Curiosity-driven Adversarial Training (VCAT). Specifically, during the traffic vehicle attacker training phase, a surrogate network is employed to fit the value function of the AV victim, providing dense information about the victim’s inherent vulnerabilities. Subsequently, random network distillation is used to characterize the novelty of the environment, constructing an intrinsic reward to guide the attacker in exploring unexplored territories. In the victim defense training phase, the AV is trained in critical scenarios in which the pretrained attacker is positioned around the victim to generate attack behaviors. Experimental results revealed that the training methodology provided by VCAT significantly improved the robust control capabilities of learning-based AVs, outperforming both conventional training modalities and alternative reinforcement learning counterparts, with a marked reduction in crash rates. The code is available at this https URL.

[AI-73] pyrtklib: An open-source package for tightly coupled deep learning and GNSS integration for positioning in urban canyons

链接: https://arxiv.org/abs/2409.12996
作者: Runzhi Hu,Penghui Xu,Yihan Zhong,Weisong Wen
关键词-EN: Global Navigation Satellite, Navigation Satellite Systems, intelligent transportation systems, Global Navigation, Navigation Satellite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is revolutionizing numerous fields, with increasing applications in Global Navigation Satellite Systems (GNSS) positioning algorithms in intelligent transportation systems (ITS) via deep learning. However, a significant technological disparity exists as traditional GNSS algorithms are often developed in Fortran or C, contrasting with the Python-based implementation prevalent in deep learning tools. To address this discrepancy, this paper introduces pyrtklib, a Python binding for the widely utilized open-source GNSS tool, RTKLIB. This binding makes all RTKLIB functionalities accessible in Python, facilitating seamless integration. Moreover, we present a deep learning subsystem under pyrtklib, which is a novel deep learning framework that leverages pyrtklib to accurately predict weights and biases within the GNSS positioning process. The use of pyrtklib enables developers to easily and quickly prototype and implement deep learning-aided GNSS algorithms, showcasing its potential to enhance positioning accuracy significantly.

[AI-74] Improving generalisability of 3D binding affinity models in low data regimes

链接: https://arxiv.org/abs/2409.12995
作者: Julia Buhmann,Ward Haddadin,Lukáš Pravda,Alan Bilsland,Hagen Triendl
关键词-EN: Predicting protein-ligand binding, computer-aided drug design, Predicting protein-ligand, binding affinity, protein-ligand binding affinity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 10 figues

点击查看摘要

Abstract:Predicting protein-ligand binding affinity is an essential part of computer-aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well-suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein-specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre-training via quantum mechanical data, unsupervised pre-training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.

[AI-75] Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML

链接: https://arxiv.org/abs/2409.12994
作者: Chelsea Maria John,Stepan Nassyr,Carolin Penke,Andreas Herten
关键词-EN: specialized hardware accelerators, hardware accelerators designed, efficient model training, machine learning, technologies has driven
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: To be published in Workshop Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC-W '24) (2024)

点击查看摘要

Abstract:The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.

[AI-76] DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

链接: https://arxiv.org/abs/2409.12992
作者: Yang Chen,Yuhang Jia,Shiwan Zhao,Ziyue Jiang,Haoran Li,Jiarong Kang,Yong Qin
关键词-EN: free-text editing continues, unrestricted free-text editing, OOD text scenarios, increasingly prevalent, continues to grow
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state-of-the-art performance in both in-domain and OOD text scenarios.

[AI-77] Can we only use guideline instead of shot in prompt?

链接: https://arxiv.org/abs/2409.12979
作者: Jiaxiang Chen,Song Wang,Zhucong Li,Wayne Xiong,Lizhen Qu,Zenglin Xu,Yuan Qi
关键词-EN: method implicitly inspires, shot method implicitly, shot method, prompting techniques, few-shot CoT
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, prompting techniques can be mainly divided into two categories:1)shot method implicitly inspires the model to answer the question by mimicing the steps in the given example, e.g., the few-shot CoT. 2) Guideline method explicitly instructs the model to reason by following guidelines, which contains succinct and concise task-specific knowledge. Shot method is prone to difficulties in terms of selection of shots type, the number of shots, and the design of the reasoning steps, so a question arises: can we only use guideline instead of shot in the prompt? To this end, we propose the FGT framework to automatically learn task-specific guidelines from dataset consisting of Feedback, Guideline, and Tree-gather agents. First, the feedback agent is designed to evaluate the outcomes, both right and wrong, of each QA to gather insights guiding more effective optimization strategies. Next, the guideline agent is tasked with deriving guidelines from each piece of feedback and storing them in local memory. Lastly, the tree-gather agent aggregates all guidelines hierarchically through a tree structure, ultimately obtaining all unduplicated guidelines from a global perspective. In addition, we induce the model to generate intermediate processes to ensure the reasoning consistent with the guidelines. Experimental results demonstrate that our approach achieves superior performance across multiple tasks, thereby highlighting the effectiveness of using the guidelines in prompt.

[AI-78] he Era of Foundation Models in Medical Imaging is Approaching : A Scoping Review of the Clinical Value of Large-Scale Generative AI Applications in Radiology

链接: https://arxiv.org/abs/2409.12973
作者: Inwoo Seo,Eunkyoung Bae,Joo-Young Jeon,Young-Sang Yoon,Jiho Cha
关键词-EN: Social problems stemming, Social problems, problems stemming, artificial intelligence, potential solution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 25 pages,3 figures, 4 tables, submitted to NPJ imaging

点击查看摘要

Abstract:Social problems stemming from the shortage of radiologists are intensifying, and artificial intelligence is being highlighted as a potential solution. Recently emerging large-scale generative AI has expanded from large language models (LLMs) to multi-modal models, showing potential to revolutionize the entire process of medical imaging. However, comprehensive reviews on their development status and future challenges are currently lacking. This scoping review systematically organizes existing literature on the clinical value of large-scale generative AI applications by following PCC guidelines. A systematic search was conducted across four databases: PubMed, EMbase, IEEE-Xplore, and Google Scholar, and 15 studies meeting the inclusion/exclusion criteria set by the researchers were reviewed. Most of these studies focused on improving the efficiency of report generation in specific parts of the interpretation process or on translating reports to aid patient understanding, with the latest studies extending to AI applications performing direct interpretations. All studies were quantitatively evaluated by clinicians, with most utilizing LLMs and only three employing multi-modal models. Both LLMs and multi-modal models showed excellent results in specific areas, but none yet outperformed radiologists in diagnostic performance. Most studies utilized GPT, with few using models specialized for the medical imaging domain. This study provides insights into the current state and limitations of large-scale generative AI-based applications in the medical imaging field, offering foundational data and suggesting that the era of medical imaging foundation models is on the horizon, which may fundamentally transform clinical practice in the near future.

[AI-79] RACE: Transformer-based user Representations from Attributed Clickstream Event sequences RECSYS

链接: https://arxiv.org/abs/2409.12972
作者: William Black,Alexander Manlove,Jack Pennington,Andrea Marchini,Ercument Ilhan,Vilda Markeviciute
关键词-EN: intricate browsing patterns, span numerous sessions, period of time, process of researching, making a purchase
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: RecSys Workshop on Recommenders in Tourism (RecTour 2024), October 14th-18th, 2024, co-located with the 18th ACM Conference on Recommender Systems, Bari, Italy

点击查看摘要

Abstract:For users navigating travel e-commerce websites, the process of researching products and making a purchase often results in intricate browsing patterns that span numerous sessions over an extended period of time. The resulting clickstream data chronicle these user journeys and present valuable opportunities to derive insights that can significantly enhance personalized recommendations. We introduce TRACE, a novel transformer-based approach tailored to generate rich user embeddings from live multi-session clickstreams for real-time recommendation applications. Prior works largely focus on single-session product sequences, whereas TRACE leverages site-wide page view sequences spanning multiple user sessions to model long-term engagement. Employing a multi-task learning framework, TRACE captures comprehensive user preferences and intents distilled into low-dimensional representations. We demonstrate TRACE’s superior performance over vanilla transformer and LLM-style architectures through extensive experiments on a large-scale travel e-commerce dataset of real user journeys, where the challenges of long page-histories and sparse targets are particularly prevalent. Visualizations of the learned embeddings reveal meaningful clusters corresponding to latent user states and behaviors, highlighting TRACE’s potential to enhance recommendation systems by capturing nuanced user interactions and preferences

[AI-80] MITHOS: Interactive Mixed Reality Training to Support Professional Socio-Emotional Interactions at Schools

链接: https://arxiv.org/abs/2409.12968
作者: Lara Chehayeb,Chirag Bhuvaneshwara,Manuel Anglet,Bernhard Hilpert,Ann-Kristin Meyer,Dimitra Tsovaltzi,Patrick Gebhard,Antje Biermann,Sinah Auchtor,Nils Lauinger,Julia Knopf,Andreas Kaiser,Fabian Kersting,Gregor Mehlmann,Florian Lingenfelser,Elisabeth André
关键词-EN: Teachers in challenging, challenging conflict situations, shame and self-blame, externalise as anger, feeling of incompetence
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Teachers in challenging conflict situations often experience shame and self-blame, which relate to the feeling of incompetence but may externalise as anger. Sensing mixed signals fails the contingency rule for developing affect regulation and may result in confusion for students about their own emotions and hinder their emotion regulation. Therefore, being able to constructively regulate emotions not only benefits individual experience of emotions but also fosters effective interpersonal emotion regulation and influences how a situation is managed. MITHOS is a system aimed at training teachers’ conflict resolution skills through realistic situative learning opportunities during classroom conflicts. In four stages, MITHOS supports teachers’ socio-emotional self-awareness, perspective-taking and positive regard. It provides: a) a safe virtual environment to train free social interaction and receive natural social feedback from reciprocal student-agent reactions, b) spatial situational perspective taking through an avatar, c) individual virtual reflection guidance on emotional experiences through co-regulation processes, and d) expert feedback on professional behavioural strategies. This chapter presents the four stages and their implementation in a semi-automatic Wizard-of-Oz (WoZ) System. The WoZ system affords collecting data that are used for developing the fully automated hybrid (machine learning and model-based) system, and to validate the underlying psychological and conflict resolution models. We present results validating the approach in terms of scenario realism, as well as a systematic testing of the effects of external avatar similarity on antecedents of self-awareness with behavior similarity. The chapter contributes to a common methodology of conducting interdisciplinary research for human-centered and generalisable XR and presents a system designed to support it.

[AI-81] OpenRANet: Neuralized Spectrum Access by Joint Subcarrier and Power Allocation with Optimization-based Deep Learning

链接: https://arxiv.org/abs/2409.12964
作者: Siya Chen,Chee Wei Tan,Xiangping Zhai,H. Vincent Poor
关键词-EN: next-generation radio access, Open RAN, radio access network, including emerging satellite-terrestrial, future Open RAN
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The next-generation radio access network (RAN), known as Open RAN, is poised to feature an AI-native interface for wireless cellular networks, including emerging satellite-terrestrial systems, making deep learning integral to its operation. In this paper, we address the nonconvex optimization challenge of joint subcarrier and power allocation in Open RAN, with the objective of minimizing the total power consumption while ensuring users meet their transmission data rate requirements. We propose OpenRANet, an optimization-based deep learning model that integrates machine-learning techniques with iterative optimization algorithms. We start by transforming the original nonconvex problem into convex subproblems through decoupling, variable transformation, and relaxation techniques. These subproblems are then efficiently solved using iterative methods within the standard interference function framework, enabling the derivation of primal-dual solutions. These solutions integrate seamlessly as a convex optimization layer within OpenRANet, enhancing constraint adherence, solution accuracy, and computational efficiency by combining machine learning with convex analysis, as shown in numerical experiments. OpenRANet also serves as a foundation for designing resource-constrained AI-native wireless optimization strategies for broader scenarios like multi-cell systems, satellite-terrestrial networks, and future Open RAN deployments with complex power consumption requirements.

[AI-82] DARDA: Domain-Aware Real-Time Dynamic Neural Network Adaptation

链接: https://arxiv.org/abs/2409.09753
作者: Shahriar Rifat,Jonathan Ashdown,Francesco Restuccia
关键词-EN: Deep Neural Networks, Test Time Adaptation, Test Time, Neural Networks, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test Time Adaptation (TTA) has emerged as a practical solution to mitigate the performance degradation of Deep Neural Networks (DNNs) in the presence of corruption/ noise affecting inputs. Existing approaches in TTA continuously adapt the DNN, leading to excessive resource consumption and performance degradation due to accumulation of error stemming from lack of supervision. In this work, we propose Domain-Aware Real-Time Dynamic Adaptation (DARDA) to address such issues. Our key approach is to proactively learn latent representations of some corruption types, each one associated with a sub-network state tailored to correctly classify inputs affected by that corruption. After deployment, DARDA adapts the DNN to previously unseen corruptions in an unsupervised fashion by (i) estimating the latent representation of the ongoing corruption; (ii) selecting the sub-network whose associated corruption is the closest in the latent space to the ongoing corruption; and (iii) adapting DNN state, so that its representation matches the ongoing corruption. This way, DARDA is more resource efficient and can swiftly adapt to new distributions caused by different corruptions without requiring a large variety of input data. Through experiments with two popular mobile edge devices - Raspberry Pi and NVIDIA Jetson Nano - we show that DARDA reduces energy consumption and average cache memory footprint respectively by 1.74x and 2.64x with respect to the state of the art, while increasing the performance by 10.4%, 5.7% and 4.4% on CIFAR-10, CIFAR-100 and TinyImagenet.

[AI-83] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

链接: https://arxiv.org/abs/2408.04811
作者: Moussa Koulako Bala Doumbouya,Ananjan Nandi,Gabriel Poesia,Davide Ghilardi,Anna Goldie,Federico Bianchi,Dan Jurafsky,Christopher D. Manning
关键词-EN: critical concern due, Large Language Models, resist generating harmful, Large Language, jailbreak attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The safety of Large Language Models (LLMs) remains a critical concern due to a lack of adequate benchmarks for systematically evaluating their ability to resist generating harmful content. Previous efforts towards automated red teaming involve static or templated sets of illicit requests and adversarial prompts which have limited utility given jailbreak attacks’ evolving and composable nature. We propose a novel dynamic benchmark of composable jailbreak attacks to move beyond static datasets and taxonomies of attacks and harms. Our approach consists of three components collectively called h4rm3l: (1) a domain-specific language that formally expresses jailbreak attacks as compositions of parameterized prompt transformation primitives, (2) bandit-based few-shot program synthesis algorithms that generate novel attacks optimized to penetrate the safety filters of a target black box LLM, and (3) open-source automated red-teaming software employing the previous two components. We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs. Several of our synthesized attacks are more effective than previously reported ones, with Attack Success Rates exceeding 90% on SOTA closed language models such as claude-3-haiku and GPT4-o. By generating datasets of jailbreak attacks in a unified formal representation, h4rm3l enables reproducible benchmarking and automated red-teaming, contributes to understanding LLM safety limitations, and supports the development of robust defenses in an increasingly LLM-integrated world. Warning: This paper and related research artifacts contain offensive and potentially disturbing prompts and model-generated content. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG) MSC classes: 68 ACMclasses: I.2; I.2.0; I.2.1; I.2.5; I.2.7; K.6.5; K.4.2 Cite as: arXiv:2408.04811 [cs.CR] (or arXiv:2408.04811v2 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04811 Focus to learn more arXiv-issued DOI via DataCite

[AI-84] me and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

链接: https://arxiv.org/abs/2409.13582
作者: Xuanru Zhou,Jiachen Lian,Cheol Jun Cho,Jingwen Liu,Zongli Ye,Jinming Zhang,Brittany Morin,David Baquirin,Jet Vonk,Zoe Ezzes,Zachary Miller,Maria Luisa Gorno Tempini,Gopala Anumanchipalli
关键词-EN: task to detect, problem, detection problem, detect dysfluencies, object detection problem
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at this https URL

[AI-85] A Deep Learning Approach for Pixel-level Material Classification via Hyperspectral Imaging

链接: https://arxiv.org/abs/2409.13498
作者: Savvas Sifnaios,George Arvanitakis,Fotios K. Konstantinidis,Georgios Tsimiklis,Angelos Amditis,Panayiotis Frangos
关键词-EN: Recent advancements, impacted various domains, significantly impacted, computer vision, Recent
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 15 figures, 6 equations

点击查看摘要

Abstract:Recent advancements in computer vision, particularly in detection, segmentation, and classification, have significantly impacted various domains. However, these advancements are tied to RGB-based systems, which are insufficient for applications in industries like waste sorting, pharmaceuticals, and defense, where advanced object characterization beyond shape or color is necessary. Hyperspectral (HS) imaging, capturing both spectral and spatial information, addresses these limitations and offers advantages over conventional technologies such as X-ray fluorescence and Raman spectroscopy, particularly in terms of speed, cost, and safety. This study evaluates the potential of combining HS imaging with deep learning for material characterization. The research involves: i) designing an experimental setup with HS camera, conveyor, and controlled lighting; ii) generating a multi-object dataset of various plastics (HDPE, PET, PP, PS) with semi-automated mask generation and Raman spectroscopy-based labeling; and iii) developing a deep learning model trained on HS images for pixel-level material classification. The model achieved 99.94% classification accuracy, demonstrating robustness in color, size, and shape invariance, and effectively handling material overlap. Limitations, such as challenges with black objects, are also discussed. Extending computer vision beyond RGB to HS imaging proves feasible, overcoming major limitations of traditional methods and showing strong potential for future applications. Comments: 13 pages, 15 figures, 6 equations Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5; I.2.10 Cite as: arXiv:2409.13498 [eess.IV] (or arXiv:2409.13498v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.13498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-86] Differentially Private Multimodal Laplacian Dropout (DP-MLD) for EEG Representative Learning

链接: https://arxiv.org/abs/2409.13440
作者: Xiaowen Fu,Bingxin Wang,Xinzhou Guo,Guoqing Liu,Yang Xiang
关键词-EN: shown great promise, multimodal EEG, EEG, shown great, great promise
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, multimodal electroencephalogram (EEG) learning has shown great promise in disease detection. At the same time, ensuring privacy in clinical studies has become increasingly crucial due to legal and ethical concerns. One widely adopted scheme for privacy protection is differential privacy (DP) because of its clear interpretation and ease of implementation. Although numerous methods have been proposed under DP, it has not been extensively studied for multimodal EEG data due to the complexities of models and signal data considered there. In this paper, we propose a novel Differentially Private Multimodal Laplacian Dropout (DP-MLD) scheme for multimodal EEG learning. Our approach proposes a novel multimodal representative learning model that processes EEG data by language models as text and other modal data by vision transformers as images, incorporating well-designed cross-attention mechanisms to effectively extract and integrate cross-modal features. To achieve DP, we design a novel adaptive feature-level Laplacian dropout scheme, where randomness allocation and performance are dynamically optimized within given privacy budgets. In the experiment on an open-source multimodal dataset of Freezing of Gait (FoG) in Parkinson’s Disease (PD), our proposed method demonstrates an approximate 4% improvement in classification accuracy, and achieves state-of-the-art performance in multimodal EEG learning under DP.

[AI-87] A generalizable framework for unlocking missing reactions in genome-scale metabolic networks using deep learning

链接: https://arxiv.org/abs/2409.13259
作者: Xiaoyi Liu,Hongpeng Yang,Chengwei Ai,Ruihan Dong,Yijie Ding,Qianqian Yuan,Jijun Tang,Fei Guo
关键词-EN: turn impedes advancements, Incomplete knowledge, metabolic processes hinders, processes hinders, hinders the accuracy
类目: Molecular Networks (q-bio.MN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Incomplete knowledge of metabolic processes hinders the accuracy of GEnome-scale Metabolic models (GEMs), which in turn impedes advancements in systems biology and metabolic engineering. Existing gap-filling methods typically rely on phenotypic data to minimize the disparity between computational predictions and experimental results. However, there is still a lack of an automatic and precise gap-filling method for initial state GEMs before experimental data and annotated genomes become available. In this study, we introduce CLOSEgaps, a deep learning-driven tool that addresses the gap-filling issue by modeling it as a hyperedge prediction problem within GEMs. Specifically, CLOSEgaps maps metabolic networks as hypergraphs and learns their hyper-topology features to identify missing reactions and gaps by leveraging hypothetical reactions. This innovative approach allows for the characterization and curation of both known and hypothetical reactions within metabolic networks. Extensive results demonstrate that CLOSEgaps accurately gap-filling over 96% of artificially introduced gaps for various GEMs. Furthermore, CLOSEgaps enhances phenotypic predictions for 24 GEMs and also finds a notable improvement in producing four crucial metabolites (Lactate, Ethanol, Propionate, and Succinate) in two organisms. As a broadly applicable solution for any GEM, CLOSEgaps represents a promising model to automate the gap-filling process and uncover missing connections between reactions and observed metabolic phenotypes.

[AI-88] Emergent Collective Reproduction via Evolving Neuronal Flocks

链接: https://arxiv.org/abs/2409.13254
作者: Nam H. Le,Richard Watson,Mike Levin,Chrys Buckley
关键词-EN: intricately merges self-organization, artificial life framework, reproductive groups, emergence of complex, study facilitates
类目: Populations and Evolution (q-bio.PE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 10 figures, conference

点击查看摘要

Abstract:This study facilitates the understanding of evolutionary transitions in individuality (ETIs) through a novel artificial life framework, named VitaNova, that intricately merges self-organization and natural selection to simulate the emergence of complex, reproductive groups. By dynamically modelling individual agents within an environment that challenges them with predators and spatial constraints, VitaNova elucidates the mechanisms by which simple agents evolve into cohesive units exhibiting collective reproduction. The findings underscore the synergy between self-organized behaviours and adaptive evolutionary strategies as fundamental drivers of ETIs. This approach not only contributes to a deeper understanding of higher-order biological individuality but also sets a new precedent in the empirical investigation of ETIs, challenging and extending current theoretical frameworks.

[AI-89] Unsupervised Attention-Based Multi-Source Domain Adaptation Framework for Drift Compensation in Electronic Nose Systems

链接: https://arxiv.org/abs/2409.13167
作者: Wenwen Zhang,Shuhao Hu,Zhengyuan Zhang,Yuanjin Zheng,Qi Jie Wang,Zhiping Lin
关键词-EN: self-developed E-nose system, reduced gas identification, identification accuracy due, E-nose system, gas identification
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continuous, long-term monitoring of hazardous, noxious, explosive, and flammable gases in industrial environments using electronic nose (E-nose) systems faces the significant challenge of reduced gas identification accuracy due to time-varying drift in gas sensors. To address this issue, we propose a novel unsupervised attention-based multi-source domain shared-private feature fusion adaptation (AMDS-PFFA) framework for gas identification with drift compensation in E-nose systems. The AMDS-PFFA model effectively leverages labeled data from multiple source domains collected during the initial stage to accurately identify gases in unlabeled gas sensor array drift signals from the target domain. To validate the model’s effectiveness, extensive experimental evaluations were conducted using both the University of California, Irvine (UCI) standard drift gas dataset, collected over 36 months, and drift signal data from our self-developed E-nose system, spanning 30 months. Compared to recent drift compensation methods, the AMDS-PFFA model achieves the highest average gas recognition accuracy with strong convergence, attaining 83.20% on the UCI dataset and 93.96% on data from our self-developed E-nose system across all target domain batches. These results demonstrate the superior performance of the AMDS-PFFA model in gas identification with drift compensation, significantly outperforming existing methods.

[AI-90] he Impact of Feature Embedding Placement in the Ansatz of a Quantum Kernel in QSVMs

链接: https://arxiv.org/abs/2409.13147
作者: Ilmo Salmenperä,Ilmars Kuhtarskis,Arianne Meijer van de Griend,Jukka K. Nurminen
关键词-EN: classical machine learning, machine learning models, Quantum Embedding Kernels, quantum kernels called, kernels called Quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 9 pages including references and appendix, 7 figures

点击查看摘要

Abstract:Designing a useful feature map for a quantum kernel is a critical task when attempting to achieve an advantage over classical machine learning models. The choice of circuit architecture, i.e. how feature-dependent gates should be interwoven with other gates is a relatively unexplored problem and becomes very important when using a model of quantum kernels called Quantum Embedding Kernels (QEK). We study and categorize various architectural patterns in QEKs and show that existing architectural styles do not behave as the literature supposes. We also produce a novel alternative architecture based on the old ones and show that it performs equally well while containing fewer gates than its older counterparts.

[AI-91] Personalized 2D Binary Patient Codes of Tissue Images and Immunogenomic Data Through Multimodal Self-Supervised Fusion

链接: https://arxiv.org/abs/2409.13115
作者: Areej Alsaafin,Abubakr Shafique,Saghir Alfasly,H.R.Tizhoosh
关键词-EN: offering promising avenues, enhancing patient care, artificial intelligence, offering promising, disease comprehension
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The field of medical diagnostics has witnessed a transformative convergence of artificial intelligence (AI) and healthcare data, offering promising avenues for enhancing patient care and disease comprehension. However, this integration of multimodal data, specifically histopathology whole slide images (WSIs) and genetic sequencing data, presents unique challenges due to modality disparities and the need for scalable computational solutions. This paper addresses the scarcity of multimodal solutions, primarily centered around unimodal data solutions, thus limiting the realization of the rich insights that can be derived from integrating images and genomic data. Here, we introduce MarbliX Multimodal Association and Retrieval with Binary Latent Indexed matriX,'' an innovative multimodal framework that integrates histopathology images with immunogenomic sequencing data, encapsulating them into a concise binary patient code, referred to as monogram.‘’ This binary representation facilitates the establishment of a comprehensive archive, enabling clinicians to match similar cases. The experimental results demonstrate the potential of MarbliX to empower healthcare professionals with in-depth insights, leading to more precise diagnoses, reduced variability, and expanded personalized treatment options, particularly in the context of cancer.

[AI-92] DenoMamba: A fused state-space model for low-dose CT denoising

链接: https://arxiv.org/abs/2409.13094
作者: Şaban Öztürk,Oğuz Can Duran,Tolga Çukur
关键词-EN: Low-dose computed tomography, lower potential risks, potential risks linked, Low-dose computed, advanced denoising algorithms
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure while relying on advanced denoising algorithms to maintain diagnostic quality in reconstructed images. The reigning paradigm in LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals. Naturally, the fidelity of these priors depend on the model’s ability to capture the broad range of contextual features evident in CT images. Earlier convolutional neural networks (CNN) are highly adept at efficiently capturing short-range spatial context, but their limited receptive fields reduce sensitivity to interactions over longer distances. Although transformers based on self-attention mechanisms have recently been posed to increase sensitivity to long-range context, they can suffer from suboptimal performance and efficiency due to elevated model complexity, particularly for high-resolution CT images. For high-quality restoration of LDCT images, here we introduce DenoMamba, a novel denoising method based on state-space modeling (SSM), that efficiently captures short- and long-range context in medical images. Following an hourglass architecture with encoder-decoder stages, DenoMamba employs a spatial SSM module to encode spatial context and a novel channel SSM module equipped with a secondary gated convolution network to encode latent features of channel context at each stage. Feature maps from the two modules are then consolidated with low-level input features via a convolution fusion module (CFM). Comprehensive experiments on LDCT datasets with 25% and 10% dose reduction demonstrate that DenoMamba outperforms state-of-the-art denoisers with average improvements of 1.4dB PSNR, 1.1% SSIM, and 1.6% RMSE in recovered image quality.

[AI-93] Hyperbolic Brain Representations

链接: https://arxiv.org/abs/2409.12990
作者: Alexander Joseph,Nathan Francis,Meijke Balay
关键词-EN: Artificial neural networks, artificial intelligence, hyperbolic geometry, Artificial neural, Artificial
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Artificial neural networks (ANN) were inspired by the architecture and functions of the human brain and have revolutionised the field of artificial intelligence (AI). Inspired by studies on the latent geometry of the brain we posit that an increase in the research and application of hyperbolic geometry in machine learning will lead to increased accuracy, improved feature space representations and more efficient models across a range of tasks. We look at the structure and functions of the human brain, highlighting the alignment between the brain’s hierarchical nature and hyperbolic geometry. By examining the brain’s complex network of neuron connections and its cognitive processes, we illustrate how hyperbolic geometry plays a pivotal role in human intelligence. Empirical evidence indicates that hyperbolic neural networks outperform Euclidean models for tasks including natural language processing, computer vision and complex network analysis, requiring fewer parameters and exhibiting better generalisation. Despite its nascent adoption, hyperbolic geometry holds promise for improving machine learning models and advancing the field toward AGI.

计算机视觉

[CV-0] Colorful Diffuse Intrinsic Image Decomposition in the Wild SIGGRAPH

链接: https://arxiv.org/abs/2409.13690
作者: Chris Careaga,Yağız Aksoy
关键词-EN: image decomposition aims, Intrinsic image decomposition, decomposition aims, surface reflectance, image editing applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 12 figures. Accepted to SIGGRAPH Asia 2024 (TOG). Webpage: this https URL

点击查看摘要

Abstract:Intrinsic image decomposition aims to separate the surface reflectance and the effects from the illumination given a single photograph. Due to the complexity of the problem, most prior works assume a single-color illumination and a Lambertian world, which limits their use in illumination-aware image editing applications. In this work, we separate an input image into its diffuse albedo, colorful diffuse shading, and specular residual components. We arrive at our result by gradually removing first the single-color illumination and then the Lambertian-world assumptions. We show that by dividing the problem into easier sub-problems, in-the-wild colorful diffuse shading estimation can be achieved despite the limited ground-truth datasets. Our extended intrinsic model enables illumination-aware analysis of photographs and can be used for image editing applications such as specularity removal and per-pixel white balancing.

[CV-1] mporally Aligned Audio for Video with Autoregression ICASSP2025

链接: https://arxiv.org/abs/2409.13689
作者: Ilpo Viertola,Vladimir Iashin,Esa Rahtu
关键词-EN: temporal alignment, achieve high temporal, high temporal alignment, precise temporal alignment, alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP 2025. Project page https://v-aura.notion.site

点击查看摘要

Abstract:We introduce V-AURA, the first autoregressive model to achieve high temporal alignment and relevance in video-to-audio generation. V-AURA uses a high-framerate visual feature extractor and a cross-modal audio-visual feature fusion strategy to capture fine-grained visual motion events and ensure precise temporal alignment. Additionally, we propose VisualSound, a benchmark dataset with high audio-visual relevance. VisualSound is based on VGGSound, a video dataset consisting of in-the-wild samples extracted from YouTube. During the curation, we remove samples where auditory events are not aligned with the visual ones. V-AURA outperforms current state-of-the-art models in temporal alignment and semantic relevance while maintaining comparable audio quality. Code, samples, VisualSound and models are available at https://v-aura.notion.site

[CV-2] Morphological Detection and Classification of Microplastics and Nanoplastics Emerged from Consumer Products by Deep Learning

链接: https://arxiv.org/abs/2409.13688
作者: Hadi Rezvani,Navid Zarrabi,Ishaan Mehta,Christopher Kolios,Hussein Ali Jaafar,Cheng-Hao Kao,Sajad Saeedi,Nariman Yousefi
关键词-EN: escalating global issue, Plastic pollution presents, global issue, impacting health, environmental systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Plastic pollution presents an escalating global issue, impacting health and environmental systems, with micro- and nanoplastics found across mediums from potable water to air. Traditional methods for studying these contaminants are labor-intensive and time-consuming, necessitating a shift towards more efficient technologies. In response, this paper introduces micro- and nanoplastics (MiNa), a novel and open-source dataset engineered for the automatic detection and classification of micro and nanoplastics using object detection algorithms. The dataset, comprising scanning electron microscopy images simulated under realistic aquatic conditions, categorizes plastics by polymer type across a broad size spectrum. We demonstrate the application of state-of-the-art detection algorithms on MiNa, assessing their effectiveness and identifying the unique challenges and potential of each method. The dataset not only fills a critical gap in available resources for microplastic research but also provides a robust foundation for future advancements in the field.

[CV-3] A Bottom-Up Approach to Class-Agnostic Image Segmentation

链接: https://arxiv.org/abs/2409.13687
作者: Sebastian Dille,Ari Blondal,Sylvain Paris,Yağız Aksoy
关键词-EN: image editing workflows, automating image editing, involves interactive tools, selection traditionally involves, traditionally involves interactive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Class-agnostic image segmentation is a crucial component in automating image editing workflows, especially in contexts where object selection traditionally involves interactive tools. Existing methods in the literature often adhere to top-down formulations, following the paradigm of class-based approaches, where object detection precedes per-object segmentation. In this work, we present a novel bottom-up formulation for addressing the class-agnostic segmentation problem. We supervise our network directly on the projective sphere of its feature space, employing losses inspired by metric learning literature as well as losses defined in a novel segmentation-space representation. The segmentation results are obtained through a straightforward mean-shift clustering of the estimated features. Our bottom-up formulation exhibits exceptional generalization capability, even when trained on datasets designed for class-based segmentation. We further showcase the effectiveness of our generic approach by addressing the challenging task of cell and nucleus segmentation. We believe that our bottom-up formulation will offer valuable insights into diverse segmentation challenges in the literature.

[CV-4] SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience

链接: https://arxiv.org/abs/2409.13678
作者: Elliot Chane-Sane,Joseph Amigo,Thomas Flayols,Ludovic Righetti,Nicolas Mansard
关键词-EN: limited sensory inputs, requiring navigation, sensory inputs, poses a significant, significant challenge
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CoRL 2024. Project website: this https URL

点击查看摘要

Abstract:Parkour poses a significant challenge for legged robots, requiring navigation through complex environments with agility and precision based on limited sensory inputs. In this work, we introduce a novel method for training end-to-end visual policies, from depth pixels to robot control commands, to achieve agile and safe quadruped locomotion. We formulate robot parkour as a constrained reinforcement learning (RL) problem designed to maximize the emergence of agile skills within the robot’s physical limits while ensuring safety. We first train a policy without vision using privileged information about the robot’s surroundings. We then generate experience from this privileged policy to warm-start a sample efficient off-policy RL algorithm from depth images. This allows the robot to adapt behaviors from this privileged experience to visual locomotion while circumventing the high computational costs of RL directly from pixels. We demonstrate the effectiveness of our method on a real Solo-12 robot, showcasing its capability to perform a variety of parkour skills such as walking, climbing, leaping, and crawling.

[CV-5] V3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians

链接: https://arxiv.org/abs/2409.13648
作者: Penghao Wang,Zhirui Zhang,Liao Wang,Kaixin Yao,Siyuan Xie,Jingyi Yu,Minye Wu,Lan Xu
关键词-EN: Experiencing high-fidelity volumetric, Experiencing high-fidelity, high-fidelity volumetric video, long-held dream, Viewing Volumetric Videos
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Experiencing high-fidelity volumetric video as seamlessly as 2D videos is a long-held dream. However, current dynamic 3DGS methods, despite their high rendering quality, face challenges in streaming on mobile devices due to computational and bandwidth constraints. In this paper, we introduce V\textsuperscript3(Viewing Volumetric Videos), a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. Additionally, we propose a two-stage training strategy to reduce storage requirements with rapid training speed. The first stage employs hash encoding and shallow MLP to learn motion, then reduces the number of Gaussians through pruning to meet the streaming requirements, while the second stage fine tunes other Gaussian attributes using residual entropy loss and temporal loss to improve temporal continuity. This strategy, which disentangles motion and appearance, maintains high rendering quality with compact storage requirements. Meanwhile, we designed a multi-platform player to decode and render 2D Gaussian videos. Extensive experiments demonstrate the effectiveness of V\textsuperscript3, outperforming other methods by enabling high-quality rendering and streaming on common devices, which is unseen before. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience, including smooth scrolling and instant sharing. Our project page with source code is available at this https URL.

[CV-6] Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning EMNLP2024

链接: https://arxiv.org/abs/2409.13641
作者: Daniele Rege Cambrin,Giuseppe Gallipoli,Irene Benedetto,Luca Cagliero,Paolo Garza
关键词-EN: Large Language Models, demonstrated impressive performance, Large Language, demonstrated impressive, Large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in EMNLP 2024 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance across various tasks. However, current training approaches combine standard cross-entropy loss with extensive data, human feedback, or ad hoc methods to enhance performance. These solutions are often not scalable or feasible due to their associated costs, complexity, or resource requirements. This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine-tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross-Entropy loss represents a sub-optimal choice, while models trained to minimize alternative (task-dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.

[CV-7] Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

链接: https://arxiv.org/abs/2409.13637
作者: Sen Lei,Xinyu Xiao,Heng-Chao Li,Zhenwei Shi,Qing Zhu
关键词-EN: assign pixel-wise labels, referring remote sensing, aims to identify, remote sensing image, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Given a language expression, referring remote sensing image segmentation (RRSIS) aims to identify the ground objects and assign pixel-wise labels within the imagery. The one of key challenges for this task is to capture discriminative multi-modal features via text-image alignment. However, the existing RRSIS methods use one vanilla and coarse alignment, where the language expression is directly extracted to be fused with the visual features. In this paper, we argue that a “fine-grained image-text alignment” can improve the extraction of multi-modal information. To this point, we here proposed a new referring remote sensing image segmentation method, termed FIANet, that fully exploits the visual and linguistic representations. Specifically, the original referring expression is regarded as context text, which is further decoupled into ground object text and spatial position text. The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts and learn better discriminative multi-modal representation. Meanwhile, to handle the various scales of ground objects in remote sensing, we introduce a Text-aware Multi-scale Enhancement Module (TMEM) to adaptively perform cross-scale fusion and intersections. We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D, and our method obtains superior performance over several state-of-the-art methods. The code will be publicly available.

[CV-8] FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs

链接: https://arxiv.org/abs/2409.13612
作者: Bowen Yan,Zhengsong Zhang,Liqiang Jing,Eftekhar Hossain,Xinya Du
关键词-EN: Large Vision-Language Models, widespread hallucination issues, development of Large, Large Vision-Language, assessments increasingly vital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid development of Large Vision-Language Models (LVLMs) often comes with widespread hallucination issues, making cost-effective and comprehensive assessments increasingly vital. Current approaches mainly rely on costly annotations and are not comprehensive – in terms of evaluating all aspects such as relations, attributes, and dependencies between aspects. Therefore, we introduce the FIHA (autonomous Fine-graIned Hallucination evAluation evaluation in LVLMs), which could access hallucination LVLMs in the LLM-free and annotation-free way and model the dependency between different types of hallucinations. FIHA can generate QA pairs on any image dataset at minimal cost, enabling hallucination assessment from both image and caption. Based on this approach, we introduce a benchmark called FIHA-v1, which consists of diverse questions on various images from MSCOCO and Foggy. Furthermore, we use the Davidson Scene Graph (DSG) to organize the structure among QA pairs, in which we can increase the reliability of the evaluation. We evaluate representative models using FIHA-v1, highlighting their limitations and challenges. We released our code and data.

[CV-9] MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension EMNLP2024

链接: https://arxiv.org/abs/2409.13609
作者: Ting Liu,Zunnan Xu,Yue Hu,Liangtao Shi,Zhiqiang Wang,Quanjun Yin
关键词-EN: Referring Expression Comprehension, Referring Expression, Expression Comprehension, local visual region, natural language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024

点击查看摘要

Abstract:Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.

[CV-10] owards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder

链接: https://arxiv.org/abs/2409.13606
作者: Aditya Kommineni,Digbalay Bose,Tiantian Feng,So Hyun Kim,Helen Tager-Flusberg,Somer Bishop,Catherine Lord,Sudarsana Kadiri,Shrikanth Narayanan
关键词-EN: Autism Spectrum Disorder, encompassing complex verbal, Autism Spectrum, Spectrum Disorder, encompassing complex
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Clinical videos in the context of Autism Spectrum Disorder are often long-form interactions between children and caregivers/clinical professionals, encompassing complex verbal and non-verbal behaviors. Objective analyses of these videos could provide clinicians and researchers with nuanced insights into the behavior of children with Autism Spectrum Disorder. Manually coding these videos is a time-consuming task and requires a high level of domain expertise. Hence, the ability to capture these interactions computationally can augment the manual effort and enable supporting the diagnostic procedure. In this work, we investigate the use of foundation models across three modalities: speech, video, and text, to analyse child-focused interaction sessions. We propose a unified methodology to combine multiple modalities by using large language models as reasoning agents. We evaluate their performance on two tasks with different information granularity: activity recognition and abnormal behavior detection. We find that the proposed multimodal pipeline provides robustness to modality-specific limitations and improves performance on the clinical video analysis compared to unimodal settings.

[CV-11] MeLIAD: Interpretable Few-Shot Anomaly Detection with Metric Learning and Entropy-based Scoring

链接: https://arxiv.org/abs/2409.13602
作者: Eirini Cholopoulou,Dimitris K. Iakovidis
关键词-EN: automating quality inspection, detecting defective products, plays a pivotal, quality inspection, pivotal role
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Anomaly detection (AD) plays a pivotal role in multimedia applications for detecting defective products and automating quality inspection. Deep learning (DL) models typically require large-scale annotated data, which are often highly imbalanced since anomalies are usually scarce. The black box nature of these models prohibits them from being trusted by users. To address these challenges, we propose MeLIAD, a novel methodology for interpretable anomaly detection, which unlike the previous methods is based on metric learning and achieves interpretability by design without relying on any prior distribution assumptions of true anomalies. MeLIAD requires only a few samples of anomalies for training, without employing any augmentation techniques, and is inherently interpretable, providing visualizations that offer insights into why an image is identified as anomalous. This is achieved by introducing a novel trainable entropy-based scoring component for the identification and localization of anomalous instances, and a novel loss function that jointly optimizes the anomaly scoring component with a metric learning objective. Experiments on five public benchmark datasets, including quantitative and qualitative evaluation of interpretability, demonstrate that MeLIAD achieves improved anomaly detection and localization performance compared to state-of-the-art methods.

[CV-12] YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models EMNLP2024

链接: https://arxiv.org/abs/2409.13592
作者: Abhilash Nandy,Yash Agarwal,Ashish Patwa,Millon Madhur Das,Aman Bansal,Ankit Raj,Pawan Goyal,Niloy Ganguly
关键词-EN: Understanding satire, Satirical Image Detection, current Vision-Language models, satire and humor, Image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: EMNLP 2024 Main (Long), 18 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at this https URL.

[CV-13] Portrait Video Editing Empowered by Multimodal Generative Priors SIGGRAPH

链接: https://arxiv.org/abs/2409.13591
作者: Xuan Gao,Haiyao Xiao,Chenglai Zhong,Shimin Hu,Yudong Guo,Juyong Zhang
关键词-EN: powerful portrait video, introduce PortraitGen, portrait video editing, Neural Gaussian Texture, portrait video
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Accepted by SIGGRAPH Asia 2024. Project Page: this https URL

点击查看摘要

Abstract:We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts. Traditional portrait video editing methods often struggle with 3D and temporal consistency, and typically lack in rendering quality and efficiency. To address these issues, we lift the portrait video frames to a unified dynamic 3D Gaussian field, which ensures structural and temporal coherence across frames. Furthermore, we design a novel Neural Gaussian Texture mechanism that not only enables sophisticated style editing but also achieves rendering speed over 100FPS. Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models. Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates. Extensive experiments demonstrate the temporal consistency, editing efficiency, and superior rendering quality of our method. The broad applicability of the proposed approach is demonstrated through various applications, including text-driven editing, image-driven editing, and relighting, highlighting its great potential to advance the field of video editing. Demo videos and released code are provided in our project page: this https URL

[CV-14] Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt

链接: https://arxiv.org/abs/2409.13576
作者: Xingtao Lin,Heqian Qiu,Lanxiao Wang,RUihang Wang,Linfeng XU,Hongliang Li
关键词-EN: Contrastive Language-Image Pre-trained, scene text detection, successfully adapted large-scale, adapted large-scale models, Recent advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in prompt tuning have successfully adapted large-scale models like Contrastive Language-Image Pre-trained (CLIP) for downstream tasks such as scene text detection. Typically, text prompt complements the text encoder’s input, focusing on global features while neglecting fine-grained details, leading to fine-grained text being ignored in task of scene text detection. In this paper, we propose the region prompt tuning (RPT) method for fine-grained scene text detection, where region text prompt proposed would help focus on fine-grained features. Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens, creating a one-to-one correspondence between characters and tokens. This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text. To achieve this, we introduce a sharing position embedding to link each character with its corresponding token and employ a bidirectional distance loss to align each region text prompt character with the target ``text’'. To refine the information at fine-grained level, we implement character-token level interactions before and after encoding. Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching, producing a final score map that could balance the global and local features and be fed into DBNet to detect the text. Experiments on benchmarks like ICDAR2015, TotalText, and CTW1500 demonstrate RPT impressive performance, underscoring its effectiveness for scene text detection.

[CV-15] ackling fluffy clouds: field boundaries detection using time series of S2 and/or S1 imagery

链接: https://arxiv.org/abs/2409.13568
作者: Foivos I. Diakogiannis,Zheng-Shu Zhou,Jeff Wang,Gonzalo Mata,Dave Henry,Roger Lawes,Amy Parker,Peter Caccetta,Rodrigo Ibata,Ondrej Hlinka,Jonathan Richetti,Kathryn Batchelor,Chris Herrmann,Andrew Toovey,John Taylor
关键词-EN: Accurate field boundary, field boundary delineation, digital agriculture, resource management, Accurate field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:Accurate field boundary delineation is a critical challenge in digital agriculture, impacting everything from crop monitoring to resource management. Existing methods often struggle with noise and fail to generalize across varied landscapes, particularly when dealing with cloud cover in optical remote sensing. In response, this study presents a new approach that leverages time series data from Sentinel-2 (S2) and Sentinel-1 (S1) imagery to improve performance under diverse cloud conditions, without the need for manual cloud filtering. We introduce a 3D Vision Transformer architecture specifically designed for satellite image time series, incorporating a memory-efficient attention mechanism. Two models are proposed: PTAViT3D, which handles either S2 or S1 data independently, and PTAViT3D-CA, which fuses both datasets to enhance accuracy. Both models are evaluated under sparse and dense cloud coverage by exploiting spatio-temporal correlations. Our results demonstrate that the models can effectively delineate field boundaries, even with partial (S2 or S2 and S1 data fusion) or dense cloud cover (S1), with the S1-based model providing performance comparable to S2 imagery in terms of spatial resolution. A key strength of this approach lies in its capacity to directly process cloud-contaminated imagery by leveraging spatio-temporal correlations in a memory-efficient manner. This methodology, used in the ePaddocks product to map Australia’s national field boundaries, offers a robust, scalable solution adaptable to varying agricultural environments, delivering precision and reliability where existing methods falter. Our code is available at this https URL.

[CV-16] Efficient Visualization of Neural Networks with Generative Models and Adversarial Perturbations

链接: https://arxiv.org/abs/2409.13559
作者: Athanasios Karagounis
关键词-EN: offering an improvement, paper presents, approach for deep, improvement over existing, generative network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:This paper presents a novel approach for deep visualization via a generative network, offering an improvement over existing methods. Our model simplifies the architecture by reducing the number of networks used, requiring only a generator and a discriminator, as opposed to the multiple networks traditionally involved. Additionally, our model requires less prior training knowledge and uses a non-adversarial training process, where the discriminator acts as a guide rather than a competitor to the generator. The core contribution of this work is its ability to generate detailed visualization images that align with specific class labels. Our model incorporates a unique skip-connection-inspired block design, which enhances label-directed image generation by propagating class information across multiple layers. Furthermore, we explore how these generated visualizations can be utilized as adversarial examples, effectively fooling classification networks with minimal perceptible modifications to the original images. Experimental results demonstrate that our method outperforms traditional adversarial example generation techniques in both targeted and non-targeted attacks, achieving up to a 94.5% fooling rate with minimal perturbation. This work bridges the gap between visualization methods and adversarial examples, proposing that fooling rate could serve as a quantitative measure for evaluating visualization quality. The insights from this study provide a new perspective on the interpretability of neural networks and their vulnerabilities to adversarial attacks.

[CV-17] rustworthy Hate Speech Detection Through Visual Augmentation

链接: https://arxiv.org/abs/2409.13557
作者: Ziyuan Yang,Ming Yan,Yingyu Chen,Hui Wang,Zexin Lu,Yi Zhang
关键词-EN: social media platforms, media platforms poses, hate speech, hate speech detection, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The surge of hate speech on social media platforms poses a significant challenge, with hate speech detection~(HSD) becoming increasingly critical. Current HSD methods focus on enriching contextual information to enhance detection performance, but they overlook the inherent uncertainty of hate speech. We propose a novel HSD method, named trustworthy hate speech detection method through visual augmentation (TrusV-HSD), which enhances semantic information through integration with diffused visual images and mitigates uncertainty with trustworthy loss. TrusV-HSD learns semantic representations by effectively extracting trustworthy information through multi-modal connections without paired data. Our experiments on public HSD datasets demonstrate the effectiveness of TrusV-HSD, showing remarkable improvements over conventional methods.

[CV-18] A preliminary study on continual learning in computer vision using Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2409.13550
作者: Alessandro Cacciatore,Valerio Morelli,Federica Paganica,Emanuele Frontoni,Lucia Migliorelli,Daniele Berardini
关键词-EN: Deep learning, multi-layer perceptrons, long been dominated, dominated by multi-layer, demonstrated superiority
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has long been dominated by multi-layer perceptrons (MLPs), which have demonstrated superiority over other optimizable models in various domains. Recently, a new alternative to MLPs has emerged - Kolmogorov-Arnold Networks (KAN)- which are based on a fundamentally different mathematical framework. According to their authors, KANs address several major issues in MLPs, such as catastrophic forgetting in continual learning scenarios. However, this claim has only been supported by results from a regression task on a toy 1D dataset. In this paper, we extend the investigation by evaluating the performance of KANs in continual learning tasks within computer vision, specifically using the MNIST datasets. To this end, we conduct a structured analysis of the behavior of MLPs and two KAN-based models in a class-incremental learning scenario, ensuring that the architectures involved have the same number of trainable parameters. Our results demonstrate that an efficient version of KAN outperforms both traditional MLPs and the original KAN implementation. We further analyze the influence of hyperparameters in MLPs and KANs, as well as the impact of certain trainable parameters in KANs, such as bias and scale weights. Additionally, we provide a preliminary investigation of recent KAN-based convolutional networks and compare their performance with that of traditional convolutional neural networks. Our codes can be found at this https URL.

[CV-19] FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

链接: https://arxiv.org/abs/2409.13540
作者: Jing Hao,Yuxiang Zhao,Song Chen,Yanpeng Sun,Qiang Chen,Gang Zhang,Kun Yao,Errui Ding,Jingdong Wang
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: this https URL

[CV-20] First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

链接: https://arxiv.org/abs/2409.13538
作者: Yingzhe Peng,Yixiao Yuan,Zitian Ao,Huapeng Zhou,Kangqi Wang,Qipeng Zhu,Xu Yang
关键词-EN: Video Question Answering, Multiple-choice Video Question, Perception Test Challenge, Question Answering, Multiple-choice Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.

[CV-21] Formula-Supervised Visual-Geometric Pre-training ECCV2024

链接: https://arxiv.org/abs/2409.13535
作者: Ryosuke Yamada,Kensho Hara,Hirokatsu Kataoka,Koshi Makihara,Nakamasa Inoue,Rio Yokota,Yutaka Satoh
关键词-EN: point clouds, images and point, unified transformer model, computer vision, object recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV2024

点击查看摘要

Abstract:Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP’s superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at this https URL.

[CV-22] Boosting Federated Domain Generalization: The Role of Advanced Pre-Trained Architectures

链接: https://arxiv.org/abs/2409.13527
作者: Avi Deb Raha,Apurba Adhikary,Mrityunjoy Gain,Yu Qiao,Choong Seon Hong
关键词-EN: Federated Domain Generalization, Vision Transformers, Swin Transformers, enhancing Federated Domain, Domain Generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we explore the efficacy of advanced pre-trained architectures, such as Vision Transformers (ViT), ConvNeXt, and Swin Transformers in enhancing Federated Domain Generalization. These architectures capture global contextual features and model long-range dependencies, making them promising candidates for improving cross-domain generalization. We conduct a broad study with in-depth analysis and systematically evaluate different variants of these architectures, using extensive pre-training datasets such as ImageNet-1K, ImageNet-21K, JFT-300M, and ImageNet-22K. Additionally, we compare self-supervised and supervised pre-training strategies to assess their impact on FDG performance. Our findings suggest that self-supervised techniques, which focus on reconstructing masked image patches, can better capture the intrinsic structure of images, thereby outperforming their supervised counterparts. Comprehensive evaluations on the Office-Home and PACS datasets demonstrate that adopting advanced architectures pre-trained on larger datasets establishes new benchmarks, achieving average accuracies of 84.46% and 92.55%, respectively. Additionally, we observe that certain variants of these advanced models, despite having fewer parameters, outperform larger ResNet models. This highlights the critical role of utilizing sophisticated architectures and diverse pre-training strategies to enhance FDG performance, especially in scenarios with limited computational resources where model efficiency is crucial. Our results indicate that federated learning systems can become more adaptable and efficient by leveraging these advanced methods, offering valuable insights for future research in FDG.

[CV-23] Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval

链接: https://arxiv.org/abs/2409.13513
作者: Morris Florek,David Tschirschwitz,Björn Barz,Volker Rodehorst
关键词-EN: Current image retrieval, image retrieval systems, face domain specificity, Current image, generalization issues
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current image retrieval systems often face domain specificity and generalization issues. This study aims to overcome these limitations by developing a computationally efficient training framework for a universal feature extractor that provides strong semantic image representations across various domains. To this end, we curated a multi-domain training dataset, called M4D-35k, which allows for resource-efficient training. Additionally, we conduct an extensive evaluation and comparison of various state-of-the-art visual-semantic foundation models and margin-based metric learning loss functions regarding their suitability for efficient universal feature extraction. Despite constrained computational resources, we achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721. This places our method at the second rank on the leaderboard, just 0.7 percentage points behind the best performing method. However, our model has 32% fewer overall parameters and 289 times fewer trainable parameters. Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points. We release our code and M4D-35k training set annotations at this https URL.

[CV-24] DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

链接: https://arxiv.org/abs/2409.13496
作者: Ling Wang,Chen Wu,Lin Wang
关键词-EN: long exposure time, motion blur caused, RGB cameras, Autonomous vehicles, time of RGB
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (\eg, color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, \eg, Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: \urlthis https URL.

[CV-25] Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study

链接: https://arxiv.org/abs/2409.13476
作者: Tirtha Chanda,Sarah Haggenmueller,Tabea-Clara Bucher,Tim Holland-Letz,Harald Kittler,Philipp Tschandl,Markus V. Heppt,Carola Berking,Jochen S. Utikal,Bastian Schilling,Claudia Buerger,Cristian Navarrete-Dechent,Matthias Goebeler,Jakob Nikolas Kather,Carolin V. Schneider,Benjamin Durani,Hendrike Durani,Martin Jansen,Juliane Wacker,Joerg Wacker,Reader Study Consortium,Titus J. Brinker
关键词-EN: enhancing clinicians’ confidence, Artificial intelligence, substantially improved dermatologists’, AI-driven decisions, enhancing clinicians’
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems have substantially improved dermatologists’ diagnostic accuracy for melanoma, with explainable AI (XAI) systems further enhancing clinicians’ confidence and trust in AI-driven decisions. Despite these advancements, there remains a critical need for objective evaluation of how dermatologists engage with both AI and XAI tools. In this study, 76 dermatologists participated in a reader study, diagnosing 16 dermoscopic images of melanomas and nevi using an XAI system that provides detailed, domain-specific explanations. Eye-tracking technology was employed to assess their interactions. Diagnostic performance was compared with that of a standard AI system lacking explanatory features. Our findings reveal that XAI systems improved balanced diagnostic accuracy by 2.8 percentage points relative to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and complex lesions were associated with elevated cognitive load, as evidenced by increased ocular fixations. These insights have significant implications for clinical practice, the design of AI tools for visual tasks, and the broader development of XAI in medical diagnostics.

[CV-26] PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

链接: https://arxiv.org/abs/2409.13475
作者: Jicheol Park,Dongwon Kim,Boseung Jeong,Suha Kwak
关键词-EN: employing free-form text, vast image collection, free-form text queries, Text-based person search, human part level
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-based person search, employing free-form text queries to identify individuals within a vast image collection, presents a unique challenge in aligning visual and textual representations, particularly at the human part level. Existing methods often struggle with part feature extraction and alignment due to the lack of direct part-level supervision and reliance on heuristic features. We propose a novel framework that leverages a part discovery module based on slot attention to autonomously identify and align distinctive parts across modalities, enhancing interpretability and retrieval accuracy without explicit part-level correspondence supervision. Additionally, text-based dynamic part attention adjusts the importance of each part, further improving retrieval outcomes. Our method is evaluated on three public benchmarks, significantly outperforming existing methods.

[CV-27] Robust Salient Object Detection on Compressed Images Using Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.13464
作者: Guibiao Liao,Wei Gao
关键词-EN: achieved substantial progress, SOD, Salient object detection, compressed images, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Salient object detection (SOD) has achieved substantial progress in recent years. In practical scenarios, compressed images (CI) serve as the primary medium for data transmission and storage. However, scant attention has been directed towards SOD for compressed images using convolutional neural networks (CNNs). In this paper, we are dedicated to strictly benchmarking and analyzing CNN-based salient object detection on compressed images. To comprehensively study this issue, we meticulously establish various CI SOD datasets from existing public SOD datasets. Subsequently, we investigate representative CNN-based SOD methods, assessing their robustness on compressed images (approximately 2.64 million images). Importantly, our evaluation results reveal two key findings: 1) current state-of-the-art CNN-based SOD models, while excelling on clean images, exhibit significant performance bottlenecks when applied to compressed images. 2) The principal factors influencing the robustness of CI SOD are rooted in the characteristics of compressed images and the limitations in saliency feature learning. Based on these observations, we propose a simple yet promising baseline framework that focuses on robust feature representation learning to achieve robust CNN-based CI SOD. Extensive experiments demonstrate the effectiveness of our approach, showcasing markedly improved robustness across various levels of image degradation, while maintaining competitive accuracy on clean data. We hope that our benchmarking efforts, analytical insights, and proposed techniques will contribute to a more comprehensive understanding of the robustness of CNN-based SOD algorithms, inspiring future research in the community.

[CV-28] Concept-Based Explanations in Computer Vision: Where Are We and Where Could We Go?

链接: https://arxiv.org/abs/2409.13456
作者: Jae Hee Lee,Georgii Mikriukov,Gesina Schwalbe,Stefan Wermter,Diedrich Wolter
关键词-EN: Concept-based XAI, reveal relevant regions, semantically meaningful parts, explaining neural vision, neural vision models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Concept-based XAI (C-XAI) approaches to explaining neural vision models are a promising field of research, since explanations that refer to concepts (i.e., semantically meaningful parts in an image) are intuitive to understand and go beyond saliency-based techniques that only reveal relevant regions. Given the remarkable progress in this field in recent years, it is time for the community to take a critical look at the advances and trends. Consequently, this paper reviews C-XAI methods to identify interesting and underexplored areas and proposes future research directions. To this end, we consider three main directions: the choice of concepts to explain, the choice of concept representation, and how we can control concepts. For the latter, we propose techniques and draw inspiration from the field of knowledge representation and learning, showing how this could enrich future C-XAI research.

[CV-29] owards the Discovery of Down Syndrome Brain Biomarkers Using Generative Models

链接: https://arxiv.org/abs/2409.13437
作者: Jordi Malé,Juan Fortea,Mateus Rozalem Aranha,Yann Heuzé,Neus Martínez-Abadías,Xavier Sevillano
关键词-EN: analyze brain morphology, Alzheimer disease, neurodevelopmental disorders, pinpointing regions, memory deficits
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Brain imaging has allowed neuroscientists to analyze brain morphology in genetic and neurodevelopmental disorders, such as Down syndrome, pinpointing regions of interest to unravel the neuroanatomical underpinnings of cognitive impairment and memory deficits. However, the connections between brain anatomy, cognitive performance and comorbidities like Alzheimer’s disease are still poorly understood in the Down syndrome population. The latest advances in artificial intelligence constitute an opportunity for developing automatic tools to analyze large volumes of brain magnetic resonance imaging scans, overcoming the bottleneck of manual analysis. In this study, we propose the use of generative models for detecting brain alterations in people with Down syndrome affected by various degrees of neurodegeneration caused by Alzheimer’s disease. To that end, we evaluate state-of-the-art brain anomaly detection models based on Variational Autoencoders and Diffusion Models, leveraging a proprietary dataset of brain magnetic resonance imaging scans. Following a comprehensive evaluation process, our study includes several key analyses. First, we conducted a qualitative evaluation by expert neuroradiologists. Second, we performed both quantitative and qualitative reconstruction fidelity studies for the generative models. Third, we carried out an ablation study to examine how the incorporation of histogram post-processing can enhance model performance. Finally, we executed a quantitative volumetric analysis of subcortical structures. Our findings indicate that some models effectively detect the primary alterations characterizing Down syndrome’s brain anatomy, including a smaller cerebellum, enlarged ventricles, and cerebral cortex reduction, as well as the parietal lobe alterations caused by Alzheimer’s disease.

[CV-30] Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling ECCV2024

链接: https://arxiv.org/abs/2409.13431
作者: Zixiao Wang,Hongtao Xie,YuXin Wang,Yadong Qu,Fengjun Guo,Pengwei Liu
关键词-EN: Existing scene text, expensive pixel-level labeling, insufficient training data, training data due, Existing scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at this https URL.

[CV-31] CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

链接: https://arxiv.org/abs/2409.13430
作者: Zhangchen Ye,Tao Jiang,Chenfeng Xu,Yiming Li,Hang Zhao
关键词-EN: occupancy prediction, depth estimation, significantly challenged, inherent limitations, limitations of monocular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \urlthis https URL.

[CV-32] HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device

链接: https://arxiv.org/abs/2409.13426
作者: Vladimir Guzov,Yifeng Jiang,Fangzhou Hong,Gerard Pons-Moll,Richard Newcombe,C. Karen Liu,Yuting Ye,Lingni Ma
关键词-EN: realistic full-body human, single head-mounted device, perform visual SLAM, outward-facing color camera, full-body human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper investigates the online generation of realistic full-body human motion using a single head-mounted device with an outward-facing color camera and the ability to perform visual SLAM. Given the inherent ambiguity of this setup, we introduce a novel system, HMD ^2 , designed to balance between motion reconstruction and generation. From a reconstruction standpoint, our system aims to maximally utilize the camera streams to produce both analytical and learned features, including head motion, SLAM point cloud, and image embeddings. On the generative front, HMD ^2 employs a multi-modal conditional motion Diffusion model, incorporating a time-series backbone to maintain temporal coherence in generated motions, and utilizes autoregressive in-painting to facilitate online motion inference with minimal latency (0.17 seconds). Collectively, we demonstrate that our system offers a highly effective and robust solution capable of scaling to an extensive dataset of over 200 hours collected in a wide range of complex indoor and outdoor environments using publicly available smart glasses.

[CV-33] Occupancy-Based Dual Contouring SIGGRAPH

链接: https://arxiv.org/abs/2409.13418
作者: Jisung Hwang,Minhyuk Sung
关键词-EN: dual contouring, achieving computation times, Manifold Dual Contouring, dual contouring method, points
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted to SIGGRAPH Asia (conference) 2024. Code: this https URL

点击查看摘要

Abstract:We introduce a dual contouring method that provides state-of-the-art performance for occupancy functions while achieving computation times of a few seconds. Our method is learning-free and carefully designed to maximize the use of GPU parallelization. The recent surge of implicit neural representations has led to significant attention to occupancy fields, resulting in a wide range of 3D reconstruction and generation methods based on them. However, the outputs of such methods have been underestimated due to the bottleneck in converting the resulting occupancy function to a mesh. Marching Cubes tends to produce staircase-like artifacts, and most subsequent works focusing on exploiting signed distance functions as input also yield suboptimal results for occupancy functions. Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information. We introduce auxiliary 2D points that are used to compute local surface normals along with the 1D points, helping identify 3D points via the quadric error function. To search the 1D, 2D, and 3D points, we develop fast algorithms that are parallelizable across all grid edges, faces, and cells. Our experiments with several 3D neural generative models and a 3D mesh dataset demonstrate that our method achieves the best fidelity compared to prior works.

[CV-34] Sine Wave Normalization for Deep Learning-Based Tumor Segmentation in CT/PET Imaging

链接: https://arxiv.org/abs/2409.13410
作者: Jintao Ren,Muheng Li,Stine Sofia Korreman
关键词-EN: autoPET III Challenge, III Challenge, automated tumor segmentation, autoPET III, PET scans
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: Report for Team WukongRT in the AutoPET III Challenge

点击查看摘要

Abstract:This report presents a normalization block for automated tumor segmentation in CT/PET scans, developed for the autoPET III Challenge. The key innovation is the introduction of the SineNormal, which applies periodic sine transformations to PET data to enhance lesion detection. By highlighting intensity variations and producing concentric ring patterns in PET highlighted regions, the model aims to improve segmentation accuracy, particularly for challenging multitracer PET datasets. The code for this project is available on GitHub (this https URL).

[CV-35] Evaluating the plausibility of synthetic images for improving automated endoscopic stone recognition

链接: https://arxiv.org/abs/2409.13409
作者: Ruben Gonzalez-Perez,Francisco Lopez-Tiro,Ivan Reyes-Amezcua,Eduardo Falcon-Morales,Rosa-Maria Rodriguez-Gueant,Jacques Hubert,Michel Daudon,Gilberto Ochoa-Ruiz,Christian Daul
关键词-EN: establishing personalized treatment, Morpho-Constitutional Analysis, Endoscopic Stone Recognition, kidney stone formation, avoid relapses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures, 1 table, conference paper

点击查看摘要

Abstract:Currently, the Morpho-Constitutional Analysis (MCA) is the de facto approach for the etiological diagnosis of kidney stone formation, and it is an important step for establishing personalized treatment to avoid relapses. More recently, research has focused on performing such tasks intra-operatively, an approach known as Endoscopic Stone Recognition (ESR). Both methods rely on features observed in the surface and the section of kidney stones to separate the analyzed samples into several sub-groups. However, given the high intra-observer variability and the complex operating conditions found in ESR, there is a lot of interest in using AI for computer-aided diagnosis. However, current AI models require large datasets to attain a good performance and for generalizing to unseen distributions. This is a major problem as large labeled datasets are very difficult to acquire, and some classes of kidney stones are very rare. Thus, in this paper, we present a method based on diffusion as a way of augmenting pre-existing ex-vivo kidney stone datasets. Our aim is to create plausible diverse kidney stone images that can be used for pre-training models using ex-vivo data. We show that by mixing natural and synthetic images of CCD images, it is possible to train models capable of performing very well on unseen intra-operative data. Our results show that is possible to attain an improvement of 10% in terms of accuracy compared to a baseline model pre-trained only on ImageNet. Moreover, our results show an improvement of 6% for surface images and 10% for section images compared to a model train on CCD images only, which demonstrates the effectiveness of using synthetic images.

[CV-36] Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

链接: https://arxiv.org/abs/2409.13407
作者: Li Zhou,Xu Yuan,Zenghui Sun,Zikun Zhou,Jingsong Lan
关键词-EN: Large Multimodal Models, extending large language, large language models, Multi-Granularity Large Multimodal, Large Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code and dataset will be released at this https URL . 7 pages, 4 figures with Supplementary Material

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual this http URL overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

[CV-37] Validation Exploration of Multimodal Deep-Learning Camera-Lidar Calibration models

链接: https://arxiv.org/abs/2409.13402
作者: Venkat Karramreddy,Liam Mitchell
关键词-EN: multi-modal sensor systems, implementing deep learning, deep learning architectures, article presents, presents an innovative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:This article presents an innovative study in exploring, evaluating, and implementing deep learning architectures for the calibration of multi-modal sensor systems. The focus behind this is to leverage the use of sensor fusion to achieve dynamic, real-time alignment between 3D LiDAR and 2D Camera sensors. static calibration methods are tedious and time-consuming, which is why we propose utilizing Conventional Neural Networks (CNN) coupled with geometrically informed learning to solve this issue. We leverage the foundational principles of Extrinsic LiDAR-Camera Calibration tools such as RegNet, CalibNet, and LCCNet by exploring open-source models that are available online and comparing our results with their corresponding research papers. Requirements for extracting these visual and measurable outputs involved tweaking source code, fine-tuning, training, validation, and testing for each of these frameworks for equal comparisons. This approach aims to investigate which of these advanced networks produces the most accurate and consistent predictions. Through a series of experiments, we reveal some of their shortcomings and areas for potential improvements along the way. We find that LCCNet yields the best results out of all the models that we validated.

[CV-38] PointSAM: Pointly-Supervised Segment Anything Model for Remote Sensing Images

链接: https://arxiv.org/abs/2409.13401
作者: Nanqing Liu,Xun Xu,Yongyi Su,Haojie Zhang,Heng-Chao Li
关键词-EN: remote sensing images, advanced foundational model, widely applied, advanced foundational, applied to remote
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages

点击查看摘要

Abstract:Segment Anything Model (SAM) is an advanced foundational model for image segmentation, widely applied to remote sensing images (RSIs). Due to the domain gap between RSIs and natural images, traditional methods typically use SAM as a source pre-trained model and fine-tune it with fully supervised masks. Unlike these methods, our work focuses on fine-tuning SAM using more convenient and challenging point annotations. Leveraging SAM’s zero-shot capabilities, we adopt a self-training framework that iteratively generates pseudo-labels for training. However, if the pseudo-labels contain noisy labels, there is a risk of error accumulation. To address this issue, we extract target prototypes from the target dataset and use the Hungarian algorithm to match them with prediction prototypes, preventing the model from learning in the wrong direction. Additionally, due to the complex backgrounds and dense distribution of objects in RSI, using point prompts may result in multiple objects being recognized as one. To solve this problem, we propose a negative prompt calibration method based on the non-overlapping nature of instance masks. In brief, we use the prompts of overlapping masks as corresponding negative signals, resulting in refined masks. Combining the above methods, we propose a novel Pointly-supervised Segment Anything Model named PointSAM. We conduct experiments on RSI datasets, including WHU, HRSID, and NWPU VHR-10, and the results show that our method significantly outperforms direct testing with SAM, SAM2, and other comparison methods. Furthermore, we introduce PointSAM as a point-to-box converter and achieve encouraging results, suggesting that this method can be extended to other point-supervised tasks. The code is available at this https URL.

[CV-39] Elite-EvGS: Learning Event-based 3D Gaussian Splatting by Distilling Event-to-Video Priors

链接: https://arxiv.org/abs/2409.13392
作者: Zixin Zhang,Kanghao Chen,Lin Wang
关键词-EN: bio-inspired sensors, sensors that output, output asynchronous, sparse event streams, Event cameras
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event cameras are bio-inspired sensors that output asynchronous and sparse event streams, instead of fixed frames. Benefiting from their distinct advantages, such as high dynamic range and high temporal resolution, event cameras have been applied to address 3D reconstruction, important for robotic mapping. Recently, neural rendering techniques, such as 3D Gaussian splatting (3DGS), have been shown successful in 3D reconstruction. However, it still remains under-explored how to develop an effective event-based 3DGS pipeline. In particular, as 3DGS typically depends on high-quality initialization and dense multiview constraints, a potential problem appears for the 3DGS optimization with events given its inherent sparse property. To this end, we propose a novel event-based 3DGS framework, named Elite-EvGS. Our key idea is to distill the prior knowledge from the off-the-shelf event-to-video (E2V) models to effectively reconstruct 3D scenes from events in a coarse-to-fine optimization manner. Specifically, to address the complexity of 3DGS initialization from events, we introduce a novel warm-up initialization strategy that optimizes a coarse 3DGS from the frames generated by E2V models and then incorporates events to refine the details. Then, we propose a progressive event supervision strategy that employs the window-slicing operation to progressively reduce the number of events used for supervision. This subtly relives the temporal randomness of the event frames, benefiting the optimization of local textural and global structural details. Experiments on the benchmark datasets demonstrate that Elite-EvGS can reconstruct 3D scenes with better textural and structural details. Meanwhile, our method yields plausible performance on the captured real-world data, including diverse challenging conditions, such as fast motion and low light scenes.

[CV-40] Feature-Centered First Order Structure Tensor Scale-Space in 2D and 3D

链接: https://arxiv.org/abs/2409.13389
作者: Pawel Tomasz Pieta,Anders Bjorholm Dahl,Jeppe Revall Frisvad,Siavash Arjomand Bigdeli,Anders Nymark Christensen
关键词-EN: order structure tensor, structure tensor scale-space, structure tensor, structure tensor method, analysis of imaged
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The structure tensor method is often used for 2D and 3D analysis of imaged structures, but its results are in many cases very dependent on the user’s choice of method parameters. We simplify this parameter choice in first order structure tensor scale-space by directly connecting the width of the derivative filter to the size of image features. By introducing a ring-filter step, we substitute the Gaussian integration/smoothing with a method that more accurately shifts the derivative filter response from feature edges to their center. We further demonstrate how extracted structural measures can be used to correct known inaccuracies in the scale map, resulting in a reliable representation of the feature sizes both in 2D and 3D. Compared to the traditional first order structure tensor, or previous structure tensor scale-space approaches, our solution is much more accurate and can serve as an out-of-the-box method for extracting a wide range of structural parameters with minimal user input.

[CV-41] RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning

链接: https://arxiv.org/abs/2409.13366
作者: Wenhui Diao,Haichen Yu,Kaiyue Kang,Tong Ling,Di Liu,Yingchao Feng,Hanbo Bi,Libo Ren,Xuexue Li,Yongqiang Mao,Xian Sun
关键词-EN: Aerial Remote Sensing, Aerial Remote, Remote Sensing, pose significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model’s detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model’s adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and effectiveness of RingMo-Aerial in enhancing the performance of ARS vision tasks.

[CV-42] ID-Guard: A Universal Framework for Combating Facial Manipulation via Breaking Identification

链接: https://arxiv.org/abs/2409.13349
作者: Zuomin Qu,Wei Lu,Xiangyang Luo,Qian Wang,Xiaochun Cao
关键词-EN: deep learning-based facial, learning-based facial manipulation, facial manipulation poses, misuse of deep, deep learning-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the retention of identity information of the person in the image, leading to stigmatization of the individual. In this paper, we propose a novel universal framework for combating facial manipulation, called ID-Guard. Specifically, this framework requires only a single forward pass of an encoder-decoder network to generate a cross-model universal adversarial perturbation corresponding to a specific facial image. To ensure anonymity in manipulated facial images, a novel Identity Destruction Module (IDM) is introduced to destroy the identifiable information in forged faces targetedly. Additionally, we optimize the perturbations produced by considering the disruption towards different facial manipulations as a multi-task learning problem and design a dynamic weights strategy to improve cross-model performance. The proposed framework reports impressive results in defending against multiple widely used facial manipulations, effectively distorting the identifiable regions in the manipulated facial images. In addition, our experiments reveal the ID-Guard’s ability to enable disrupted images to avoid face inpaintings and open-source image recognition systems.

[CV-43] V-Hands: Touchscreen-based Hand Tracking for Remote Whiteboard Interaction

链接: https://arxiv.org/abs/2409.13347
作者: Xinshuang Liu,Yizhong Zhang,Xin Tong
关键词-EN: immersive user experience, user experience, seamless integration, integration of drawn, drawn content
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In whiteboard-based remote communication, the seamless integration of drawn content and hand-screen interactions is essential for an immersive user experience. Previous methods either require bulky device setups for capturing hand gestures or fail to accurately track the hand poses from capacitive images. In this paper, we present a real-time method for precise tracking 3D poses of both hands from capacitive video frames. To this end, we develop a deep neural network to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver. Additionally, we design a device setup for capturing high-quality hand-screen interaction data and obtained a more accurate synchronized capacitive video and hand pose dataset. Our method improves the accuracy and stability of 3D hand tracking for capacitive frames while maintaining a compact device setup for remote communication. We validate our scheme design and its superior performance on 3D hand pose tracking and demonstrate the effectiveness of our method in whiteboard-based remote communication. Our code, model, and dataset are available at this https URL.

[CV-44] Imagine yourself: Tuning-Free Personalized Image Generation

链接: https://arxiv.org/abs/2409.13346
作者: Zecheng He,Bo Sun,Felix Juefei-Xu,Haoyu Ma,Ankit Ramchandani,Vincent Cheung,Siddharth Shah,Anmol Kalia,Harihar Subramanyam,Alireza Zareian,Li Chen,Ankit Jain,Ning Zhang,Peizhao Zhang,Roshan Sumbaly,Peter Vajda,Animesh Sinha
关键词-EN: demonstrated remarkable efficacy, Diffusion models, demonstrated remarkable, remarkable efficacy, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model’s SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.

[CV-45] A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing

链接: https://arxiv.org/abs/2409.13345
作者: Yi Ren,Tianyi Zhang,Zhixiong Han,Weibin Li,Zhiyang Wang,Wenbo Ji,Chenhao Qin,Chenbin Liang,Licheng Jiao
关键词-EN: adaptive fine-tuning algorithm, propose an adaptive, adaptive fine-tuning, data, algorithm
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose an adaptive fine-tuning algorithm for multimodal large models. The core steps of this algorithm involve two stages of truncation. First, the vast amount of data is projected into a semantic vector space, and the MiniBatchKMeans algorithm is used for automated clustering. This classification ensures that the data within each cluster exhibit high semantic similarity. Next, we process the data in each cluster, calculating the translational difference between the original and perturbed data in the multimodal large model’s vector space. This difference serves as a generalization metric for the data. Based on this metric, we select the data with high generalization potential for training. We applied this algorithm to train the InternLM-XComposer2-VL-7B model on two 3090 GPUs using one-third of the GeoChat multimodal remote sensing dataset. The results demonstrate that our algorithm outperforms the state-of-the-art baselines. various baselines. The model trained on our optimally chosen one-third dataset, based on experimental validation, exhibited only 1% reduction in performance across various remote sensing metrics compared to the model trained on the full dataset. This approach significantly preserved general-purpose capabilities while reducing training time by 68.2%. Furthermore, the model achieved scores of 89.86 and 77.19 on the UCMerced and AID evaluation datasets, respectively, surpassing the GeoChat dataset by 5.43 and 5.16 points. It only showed a 0.91-point average decrease on the LRBEN evaluation dataset.

[CV-46] Enhancing Fruit and Vegetable Detection in Unconstrained Environment with a Novel Dataset

链接: https://arxiv.org/abs/2409.13330
作者: Sandeep Khanna,Chiranjoy Chattopadhyay,Suman Kundu
关键词-EN: ensuring food quality, sustainable farming practices, fruits and vegetables, improving efficiency, modernizing agriculture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 8 figures, 6 tables, Scientia Horticulturae

点击查看摘要

Abstract:Automating the detection of fruits and vegetables using computer vision is essential for modernizing agriculture, improving efficiency, ensuring food quality, and contributing to technologically advanced and sustainable farming practices. This paper presents an end-to-end pipeline for detecting and localizing fruits and vegetables in real-world scenarios. To achieve this, we have curated a dataset named FRUVEG67 that includes images of 67 classes of fruits and vegetables captured in unconstrained scenarios, with only a few manually annotated samples per class. We have developed a semi-supervised data annotation algorithm (SSDA) that generates bounding boxes for objects to label the remaining non-annotated images. For detection, we introduce the Fruit and Vegetable Detection Network (FVDNet), an ensemble version of YOLOv7 featuring three distinct grid configurations. We employ an averaging approach for bounding-box prediction and a voting mechanism for class prediction. We have integrated Jensen-Shannon divergence (JSD) in conjunction with focal loss to better detect smaller objects. Our experimental results highlight the superiority of FVDNet compared to previous versions of YOLO, showcasing remarkable improvements in detection and localization performance. We achieved an impressive mean average precision (mAP) score of 0.78 across all classes. Furthermore, we evaluated the efficacy of FVDNet using open-category refrigerator images, where it demonstrates promising results.

[CV-47] owards Semi-supervised Dual-modal Semantic Segmentation

链接: https://arxiv.org/abs/2409.13325
作者: Qiulei Dong,Jianan Li,Shuang Deng
关键词-EN: point clouds, obtain point clouds, unlabeled point clouds, data acquisition techniques, acquisition techniques
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.

[CV-48] SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

链接: https://arxiv.org/abs/2409.13321
作者: Jinge Wu,Yunsoo Kim,Daqian Shi,David Cliffton,Fenglin Liu,Honghan Wu
关键词-EN: growing research interest, assist clinicians, large language models, success of large, growing research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re ^3 Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.

[CV-49] Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence

链接: https://arxiv.org/abs/2409.13291
作者: Alessandro Riva,Alessandro Raganato,Simone Melzi
关键词-EN: Current data-driven methodologies, point cloud matching, presenting significant challenges, matching demand extensive, cloud matching demand
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Current data-driven methodologies for point cloud matching demand extensive training time and computational resources, presenting significant challenges for model deployment and application. In the point cloud matching task, recent advancements with an encoder-only Transformer architecture have revealed the emergence of semantically meaningful patterns in the attention heads, particularly resembling Gaussian functions centered on each point of the input shape. In this work, we further investigate this phenomenon by integrating these patterns as fixed attention weights within the attention heads of the Transformer architecture. We evaluate two variants: one utilizing predetermined variance values for the Gaussians, and another where the variance values are treated as learnable parameters. Additionally we analyze the performances on noisy data and explore a possible way to improve robustness to noise. Our findings demonstrate that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization. Furthermore, we conducted an ablation study to identify the specific layers where the infused information is most impactful and to understand the reliance of the network on this information.

[CV-50] me Distributed Deep Learning models for Purely Exogenous Forecasting. Application to Water Table Depth Prediction using Weather Image Time Series

链接: https://arxiv.org/abs/2409.13284
作者: Matteo Salis,Abdourrahmane M. Atto,Stefano Ferraris,Rosa Meo
关键词-EN: resources management framework, sustainable resources management, Groundwater resources, Time Distributed Convolutional, management framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Groundwater resources are one of the most relevant elements in the water cycle, therefore developing models to accurately predict them is a pivotal task in the sustainable resources management framework. Deep Learning (DL) models have been revealed very effective in hydrology, especially by feeding spatially distributed data (e.g. raster data). In many regions, hydrological measurements are difficult to obtain regularly or periodically in time, and in some cases, last available data are not up to date. Reversely, weather data, which significantly impacts water resources, are usually more available and with higher quality. More specifically, we have proposed two different DL models to predict the water table depth in the Grana-Maira catchment (Piemonte, IT) using only exogenous weather image time series. To deal with the image time series, both models are made of a first Time Distributed Convolutional Neural Network (TDC) which encodes the image available at each time step into a vectorial representation. The first model, TDC-LSTM uses then a Sequential Module based on an LSTM layer to learn temporal relations and output the predictions. The second model, TDC-UnPWaveNet uses instead a new version of the WaveNet architecture, adapted here to output a sequence shorter and completely shifted in the future with respect to the input one. To this aim, and to deal with the different sequence lengths in the UnPWaveNet, we have designed a new Channel Distributed layer, that acts like a Time Distributed one but on the channel dimension, i.e. applying the same set of operations to each channel of the input. TDC-LSTM and TDC-UnPWaveNet have shown both remarkable results. However, the two models have focused on different learnable information: TDC-LSTM has focused more on lowering the bias, while the TDC-UnPWaveNet has focused more on the temporal dynamics maximising correlation and KGE.

[CV-51] Adaptive Margin Global Classifier for Exemplar-Free Class-Incremental Learning

链接: https://arxiv.org/abs/2409.13275
作者: Zhongren Yao,Xiaobin Chang
关键词-EN: Exemplar-free class-incremental learning, Exemplar-free class-incremental, presents a significant, class samples, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Exemplar-free class-incremental learning (EFCIL) presents a significant challenge as the old class samples are absent for new task learning. Due to the severe imbalance between old and new class samples, the learned classifiers can be easily biased toward the new ones. Moreover, continually updating the feature extractor under EFCIL can compromise the discriminative power of old class features, e.g., leading to less compact and more overlapping distributions across classes. Existing methods mainly focus on handling biased classifier learning. In this work, both cases are considered using the proposed method. Specifically, we first introduce a Distribution-Based Global Classifier (DBGC) to avoid bias factors in existing methods, such as data imbalance and sampling. More importantly, the compromised distributions of old classes are simulated via a simple operation, variance enlarging (VE). Incorporating VE based on DBGC results in a novel classification loss for EFCIL. This loss is proven equivalent to an Adaptive Margin Softmax Cross Entropy (AMarX). The proposed method is thus called Adaptive Margin Global Classifier (AMGC). AMGC is simple yet effective. Extensive experiments show that AMGC achieves superior image classification results on its own under a challenging EFCIL setting. Detailed analysis is also provided for further demonstration.

[CV-52] JoyHallo: Digital human model for Mandarin

链接: https://arxiv.org/abs/2409.13268
作者: Sheng Shi,Xuyang Cao,Jun Zhao,Guoxin Wang
关键词-EN: presents significant challenges, videos presents significant, creating Mandarin videos, significant challenges, Mandarin videos presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities. The code and models are available at this https URL.

[CV-53] 2M-X: Learning Expressive Text-to-Motion Generation from Partially Annotated Data

链接: https://arxiv.org/abs/2409.13251
作者: Mingdian Liu,Yilin Liu,Gurunandan Krishnan,Karl S Bayer,Bing Zhou
关键词-EN: profoundly impact animation, impact animation production, humanoid animation, impact animation, text prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 4 figures, conference paper

点击查看摘要

Abstract:The generation of humanoid animation from text prompts can profoundly impact animation production and AR/VR experiences. However, existing methods only generate body motion data, excluding facial expressions and hand movements. This limitation, primarily due to a lack of a comprehensive whole-body motion dataset, inhibits their readiness for production use. Recent attempts to create such a dataset have resulted in either motion inconsistency among different body parts in the artificially augmented data or lower quality in the data extracted from RGB videos. In this work, we propose T2M-X, a two-stage method that learns expressive text-to-motion generation from partially annotated data. T2M-X trains three separate Vector Quantized Variational AutoEncoders (VQ-VAEs) for body, hand, and face on respective high-quality data sources to ensure high-quality motion outputs, and a Multi-indexing Generative Pretrained Transformer (GPT) model with motion consistency loss for motion generation and coordination among different body parts. Our results show significant improvements over the baselines both quantitatively and qualitatively, demonstrating its robustness against the dataset limitations.

[CV-54] Deep Generative Adversarial Network for Occlusion Removal from a Single Image

链接: https://arxiv.org/abs/2409.13242
作者: Sankaraganesh Jonna,Moushumi Medhi,Rajiv Ranjan Sahay
关键词-EN: in-expensive imaging devices, enhanced capabilities, capabilities of in-expensive, devices have led, tremendous increase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Nowadays, the enhanced capabilities of in-expensive imaging devices have led to a tremendous increase in the acquisition and sharing of multimedia content over the Internet. Despite advances in imaging sensor technology, annoying conditions like \textitocclusions hamper photography and may deteriorate the performance of applications such as surveillance, detection, and recognition. Occlusion segmentation is difficult because of scale variations, illumination changes, and so on. Similarly, recovering a scene from foreground occlusions also poses significant challenges due to the complexity of accurately estimating the occluded regions and maintaining coherence with the surrounding context. In particular, image de-fencing presents its own set of challenges because of the diverse variations in shape, texture, color, patterns, and the often cluttered environment. This study focuses on the automatic detection and removal of occlusions from a single image. We propose a fully automatic, two-stage convolutional neural network for fence segmentation and occlusion completion. We leverage generative adversarial networks (GANs) to synthesize realistic content, including both structure and texture, in a single shot for inpainting. To assess zero-shot generalization, we evaluated our trained occlusion detection model on our proposed fence-like occlusion segmentation dataset. The dataset can be found on GitHub.

[CV-55] 3D-GSW: 3D Gaussian Splatting Watermark for Protecting Copyrights in Radiance Fields

链接: https://arxiv.org/abs/2409.13222
作者: Youngdong Jang,Hyunje Park,Feng Yang,Heeju Ko,Euijin Choo,Sangpil Kim
关键词-EN: Gaussian splatting, Gaussian, Gaussian Contribution Vector, Gaussian splatting model, Gaussian Contribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian splatting has been getting a lot of attention as an innovative method for representing 3D space due to rapid rendering and image quality. However, copyright protection for the 3D Gaussian splatting has not yet been introduced. In this paper, we present a novel watermarking method for 3D Gaussian splatting. The proposed method embeds a binary message into 3D Gaussians by fine-tuning the pre-trained 3D Gaussian splatting model. To achieve this, we present Frequency-Guided Densification (FGD) that utilizes Discrete Fourier Transform to find patches with high-frequencies and split 3D Gaussians based on 3D Gaussian Contribution Vector. It is each 3D Gaussian contribution to rendered pixel colors, improving both rendering quality and bit accuracy. Furthermore, we modify an adaptive gradient mask to enhance rendering quality. Our experiments show that our method can embed a watermark in 3D Gaussians imperceptibly with increased capacity and robustness against attacks. Our method reduces optimization cost and achieves state-of-the-art performance compared to other methods.

[CV-56] Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models

链接: https://arxiv.org/abs/2409.13174
作者: Hao Cheng,Erjia Xiao,Chengyuan Yu,Zhao Yao,Jiahang Cao,Qiang Zhang,Jiaxu Wang,Mengshu Sun,Kaidi Xu,Jindong Gu,Renjing Xu
关键词-EN: Large Language Models, Language Action Models, Vision Language Action, Multimodal Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, driven by advancements in Multimodal Large Language Models (MLLMs), Vision Language Action Models (VLAMs) are being proposed to achieve better performance in open-vocabulary scenarios for robotic manipulation tasks. Since manipulation tasks involve direct interaction with the physical world, ensuring robustness and safety during the execution of this task is always a very critical issue. In this paper, by synthesizing current safety research on MLLMs and the specific application scenarios of the manipulation task in the physical world, we comprehensively evaluate VLAMs in the face of potential physical threats. Specifically, we propose the Physical Vulnerability Evaluating Pipeline (PVEP) that can incorporate as many visual modal physical threats as possible for evaluating the physical robustness of VLAMs. The physical threats in PVEP specifically include Out-of-Distribution, Typography-based Visual Prompt, and Adversarial Patch Attacks. By comparing the performance fluctuations of VLAMs before and after being attacked, we provide generalizable \textbf\textitAnalyses of how VLAMs respond to different physical security threats.

[CV-57] Bilateral Sharpness-Aware Minimization for Flatter Minima

链接: https://arxiv.org/abs/2409.13173
作者: Jiaxin Deng,Junbiao Pang,Baochang Zhang,Qingming Huang
关键词-EN: Flatness Indicator Problem, Flatness Indicator, SAM, Sharpness-Aware Minimization, reducing a Max-Sharpness
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS). Despite the practical success, we empirically found that the MAxS behind SAM’s generalization enhancements face the “Flatness Indicator Problem” (FIP), where SAM only considers the flatness in the direction of gradient ascent, resulting in a next minimization region that is not sufficiently flat. A better Flatness Indicator (FI) would bring a better generalization of neural networks. Because SAM is a greedy search method in nature. In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS). By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM. The theoretical analysis proves that BSAM converges to local minima. Extensive experiments demonstrate that BSAM offers superior generalization performance and robustness compared to vanilla SAM across various tasks, i.e., classification, transfer learning, human pose estimation, and network quantization. Code is publicly available at: this https URL.

[CV-58] owards Zero-shot Point Cloud Anomaly Detection: A Multi-View Projection Framework

链接: https://arxiv.org/abs/2409.13162
作者: Yuqi Cheng,Yunkang Cao,Guoyang Xie,Zhichao Lu,Weiming Shen
关键词-EN: early-stage production constraints, face challenges due, data acquisition costs, Detecting anomalies, traditional unsupervised methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting anomalies within point clouds is crucial for various industrial applications, but traditional unsupervised methods face challenges due to data acquisition costs, early-stage production constraints, and limited generalization across product categories. To overcome these challenges, we introduce the Multi-View Projection (MVP) framework, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies. Specifically, MVP projects point cloud data into multi-view depth images, thereby translating point cloud anomaly detection into image anomaly detection. Following zero-shot image anomaly detection methods, pre-trained VLMs are utilized to detect anomalies on these depth images. Given that pre-trained VLMs are not inherently tailored for zero-shot point cloud anomaly detection and may lack specificity, we propose the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance. Extensive experiments on the MVTec 3D-AD and Real3D-AD demonstrate our proposed MVP framework’s superior zero-shot anomaly detection performance and the prompting techniques’ effectiveness. Real-world evaluations on automotive plastic part inspection further showcase that the proposed method can also be generalized to practical unseen scenarios. The code is available at this https URL.

[CV-59] High-Fidelity Mask-free Neural Surface Reconstruction for Virtual Reality

链接: https://arxiv.org/abs/2409.13158
作者: Haotian Bai,Yize Chen,Lin Wang
关键词-EN: creating editable digital, editable digital assets, images is crucial, crucial in creating, creating editable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object-centric surface reconstruction from multi-view images is crucial in creating editable digital assets for AR/VR. Due to the lack of geometric constraints, existing methods, e.g., NeuS necessitate annotating the object masks to reconstruct compact surfaces in mesh processing. Mask annotation, however, incurs considerable labor costs due to its cumbersome nature. This paper presents Hi-NeuS, a novel rendering-based framework for neural implicit surface reconstruction, aiming to recover compact and precise surfaces without multi-view object masks. Our key insight is that the overlapping regions in the object-centric views naturally highlight the object of interest as the camera orbits around objects. The object of interest can be specified by estimating the distribution of the rendering weights accumulated from multiple views, which implicitly identifies the surface that a user intends to capture. This inspires us to design a geometric refinement approach, which takes multi-view rendering weights to guide the signed distance functions (SDF) of neural surfaces in a self-supervised manner. Specifically, it retains these weights to resample a pseudo surface based on their distribution. This facilitates the alignment of the SDF to the object of interest. We then regularize the SDF’s bias for geometric consistency. Moreover, we propose to use unmasked Chamfer Distance(CD) to measure the extracted mesh without post-processing for more precise evaluation. Our approach has been validated through NeuS and its variant Neuralangelo, demonstrating its adaptability across different NeuS backbones. Extensive benchmark on the DTU dataset shows that our method reduces surface noise by about 20%, and improves the unmasked CD by around 30%, achieving better surface details. The superiority of Hi-NeuS is further validated on BlendedMVS and handheld camera captures for content creation.

[CV-60] Beyond Skip Connection: Pooling and Unpooling Design for Elimination Singularities

链接: https://arxiv.org/abs/2409.13154
作者: Chengkun Sun,Jinqian Pan,Juoli Jin,Russell Stevens Terry,Jiang Bian,Jie Xu
关键词-EN: Convolutional Neural Networks, deep Convolutional Neural, presents unique challenges, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Training deep Convolutional Neural Networks (CNNs) presents unique challenges, including the pervasive issue of elimination singularities, consistent deactivation of nodes leading to degenerate manifolds within the loss landscape. These singularities impede efficient learning by disrupting feature propagation. To mitigate this, we introduce Pool Skip, an architectural enhancement that strategically combines a Max Pooling, a Max Unpooling, a 3 times 3 convolution, and a skip connection. This configuration helps stabilize the training process and maintain feature integrity across layers. We also propose the Weight Inertia hypothesis, which underpins the development of Pool Skip, providing theoretical insights into mitigating degradation caused by elimination singularities through dimensional and affine compensation. We evaluate our method on a variety of benchmarks, focusing on both 2D natural and 3D medical imaging applications, including tasks such as classification and segmentation. Our findings highlight Pool Skip’s effectiveness in facilitating more robust CNN training and improving model performance.

[CV-61] Learning Visual Information Utility with PIXER

链接: https://arxiv.org/abs/2409.13151
作者: Yash Turkar,Timothy Chase Jr,Christo Aluckal,Karthik Dantu
关键词-EN: including autonomous robotics, Accurate feature detection, computer vision tasks, medical imaging, Accurate feature
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Accurate feature detection is fundamental for various computer vision tasks, including autonomous robotics, 3D reconstruction, medical imaging, and remote sensing. Despite advancements in enhancing the robustness of visual features, no existing method measures the utility of visual information before processing by specific feature-type algorithms. To address this gap, we introduce PIXER and the concept of “Featureness,” which reflects the inherent interest and reliability of visual information for robust recognition, independent of any specific feature type. Leveraging a generalization on Bayesian learning, our approach quantifies both the probability and uncertainty of a pixel’s contribution to robust visual utility in a single-shot process, avoiding costly operations such as Monte Carlo sampling and permitting customizable featureness definitions adaptable to a wide range of applications. We evaluate PIXER on visual odometry with featureness selectivity, achieving an average of 31% improvement in RMSE trajectory with 49% fewer features.

[CV-62] UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition

链接: https://arxiv.org/abs/2409.13148
作者: Zhenrong Zhang,Shuhang Liu,Pengfei Hu,Jiefeng Ma,Jun Du,Jianshu Zhang,Yu Hu
关键词-EN: analyzing large volumes, structure recognition technology, table structure, table structure recognition, digital era
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the digital era, table structure recognition technology is a critical tool for processing and analyzing large volumes of tabular data. Previous methods primarily focus on visual aspects of table structure recovery but often fail to effectively comprehend the textual semantics within tables, particularly for descriptive textual cells. In this paper, we introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model. UniTabNet employs a ``divide-and-conquer’’ strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure. We further enhance our framework with the Vision Guider, which directs the model’s focus towards pertinent areas, thereby boosting prediction accuracy. Additionally, we introduce the Language Guider to refine the model’s capability to understand textual semantics in table images. Evaluated on prominent table structure datasets such as PubTabNet, PubTables1M, WTW, and iFLYTAB, UniTabNet achieves a new state-of-the-art performance, demonstrating the efficacy of our approach. The code will also be made publicly available.

[CV-63] Score-Based Multibeam Point Cloud Denoising

链接: https://arxiv.org/abs/2409.13143
作者: Li Ling,Yiping Xie,Nils Bore,John Folkesson
关键词-EN: Multibeam echo-sounder, cheaper MBES sensors, MBES, de-facto sensor, bathymetry mapping
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 IEEE OES AUV Symposium

点击查看摘要

Abstract:Multibeam echo-sounder (MBES) is the de-facto sensor for bathymetry mapping. In recent years, cheaper MBES sensors and global mapping initiatives have led to exponential growth of available data. However, raw MBES data contains 1-25% of noise that requires semi-automatic filtering using tools such as Combined Uncertainty and Bathymetric Estimator (CUBE). In this work, we draw inspirations from the 3D point cloud community and adapted a score-based point cloud denoising network for MBES outlier detection and denoising. We trained and evaluated this network on real MBES survey data. The proposed method was found to outperform classical methods, and can be readily integrated into existing MBES standard workflow. To facilitate future research, the code and pretrained model are available online.

[CV-64] Interpret the Predictions of Deep Networks via Re-Label Distillation ICME2021

链接: https://arxiv.org/abs/2409.13137
作者: Yingying Hua,Shiming Ge,Daichi Zhang
关键词-EN: Interpreting the predictions, synthetic images, black-box deep network, facilitate the reliability, deep network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Published by IEEE ICME 2021

点击查看摘要

Abstract:Interpreting the predictions of a black-box deep network can facilitate the reliability of its deployment. In this work, we propose a re-label distillation approach to learn a direct map from the input to the prediction in a self-supervision manner. The image is projected into a VAE subspace to generate some synthetic images by randomly perturbing its latent vector. Then, these synthetic images can be annotated into one of two classes by identifying whether their labels shift. After that, using the labels annotated by the deep network as teacher, a linear student model is trained to approximate the annotations by mapping these synthetic images to the classes. In this manner, these re-labeled synthetic images can well describe the local classification mechanism of the deep network, and the learned student can provide a more intuitive explanation towards the predictions. Extensive experiments verify the effectiveness of our approach qualitatively and quantitatively.

[CV-65] Federated Learning with Label-Masking Distillation ACM-MM2023

链接: https://arxiv.org/abs/2409.13136
作者: Jianghu Lu,Shikun Li,Kexin Bao,Pengju Wang,Zhenxing Qian,Shiming Ge
关键词-EN: Federated learning, collaboratively train models, multiple local clients, privacy-preserving manner, manner to collaboratively
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2023

点击查看摘要

Abstract:Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at this https URL.

[CV-66] BGDB: Bernoulli-Gaussian Decision Block with Improved Denoising Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2409.13116
作者: Chengkun Sun,Jinqian Pan,Russell Stevens Terry,Jiang Bian,Jie Xu
关键词-EN: constructing complex feature, enhance discriminative classifiers, complex feature spaces, central limit theorem, enhance discriminative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative models can enhance discriminative classifiers by constructing complex feature spaces, thereby improving performance on intricate datasets. Conventional methods typically augment datasets with more detailed feature representations or increase dimensionality to make nonlinear data linearly separable. Utilizing a generative model solely for feature space processing falls short of unlocking its full potential within a classifier and typically lacks a solid theoretical foundation. We base our approach on a novel hypothesis: the probability information (logit) derived from a single model training can be used to generate the equivalent of multiple training sessions. Leveraging the central limit theorem, this synthesized probability information is anticipated to converge toward the true probability more accurately. To achieve this goal, we propose the Bernoulli-Gaussian Decision Block (BGDB), a novel module inspired by the Central Limit Theorem and the concept that the mean of multiple Bernoulli trials approximates the probability of success in a single trial. Specifically, we utilize Improved Denoising Diffusion Probabilistic Models (IDDPM) to model the probability of Bernoulli Trials. Our approach shifts the focus from reconstructing features to reconstructing logits, transforming the logit from a single iteration into logits analogous to those from multiple experiments. We provide the theoretical foundations of our approach through mathematical analysis and validate its effectiveness through experimental evaluation using various datasets for multiple imaging tasks, including both classification and segmentation.

[CV-67] Evolution and challenges of computer vision and deep learning technologies for analysing mixed construction and demolition waste

链接: https://arxiv.org/abs/2409.13112
作者: Adrian Langley,Matthew Lonergan,Tao Huang,Mostafa Rahimi Azghadi
关键词-EN: enhancing business returns, Improving the automatic, CDW, composition is crucial, business returns
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Improving the automatic and timely recognition of construction and demolition waste (CDW) composition is crucial for enhancing business returns, economic outcomes, and sustainability. Technologies like computer vision, artificial intelligence (AI), robotics, and internet of things (IoT) are increasingly integrated into waste processing to achieve these goals. While deep learning (DL) models show promise in recognising homogeneous CDW piles, few studies assess their performance with mixed, highly contaminated material in commercial settings. Drawing on extensive experience at a CDW materials recovery facility (MRF) in Sydney, Australia, we explore the challenges and opportunities in developing an advanced automated mixed CDW management system. We begin with an overview of the evolution of waste management in the construction industry, highlighting its environmental, economic, and societal impacts. We review various CDW analysis techniques, concluding that DL-based visual methods are the optimal solution. Additionally, we examine the progression of sensor and camera technologies for CDW analysis as well as the evolution of DL algorithms focused on object detection and material segmentation. We also discuss CDW datasets, their curation, and innovative methods for their creation. Finally, we share insights on CDW visual analysis, addressing technical and commercial challenges, research trends, and future directions for mixed CDW analysis. This paper aims to improve the efficiency of CDW management by providing valuable insights for ongoing and future research and development efforts in this critical sector.

[CV-68] UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

链接: https://arxiv.org/abs/2409.13106
作者: Jinho Park,Se Young Chun,Mingoo Seok
关键词-EN: Data-driven visual-inertial odometry, Data-driven visual-inertial, autonomous robots, received highlights, crucial compartment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO – an ultra-lightweight (1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error – 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.

[CV-69] ERIC: Estimating Rainfall with Commodity Doorbell Camera for Precision Residential Irrigation

链接: https://arxiv.org/abs/2409.13104
作者: Tian Liu,Liuyi Jin,Radu Stoleru,Amran Haroon,Charles Swanson,Kexin Feng
关键词-EN: nearby weather stations, adjust irrigation amounts, nearby weather, weather stations, stations to adjust
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: BuildSys 2024

点击查看摘要

Abstract:Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed a cost-effective irrigation system, dubbed ERIC, which employs machine learning models to estimate rainfall from commodity doorbell camera footage and optimizes irrigation schedules without human intervention. Specifically, we: a) designed novel visual and audio features with lightweight neural network models to infer rainfall from the camera at the edge, preserving user privacy; b) built a complete end-to-end irrigation system on Raspberry Pi 4, costing only 75. We deployed the system across five locations (collecting over 750 hours of video) with varying backgrounds and light conditions. Comprehensive evaluation validates that ERIC achieves state-of-the-art rainfall estimation performance (~ 5mm/day), saving 9,112 gallons/month of water, translating to 28.56/month in utility savings.

[CV-70] Interpretable Action Recognition on Hard to Classify Actions ECCV2024

链接: https://arxiv.org/abs/2409.13091
作者: Anastasia Anichenko,Frank Guerin,Andrew Gilbert
关键词-EN: investigate a human-like, model, human-like interpretable model, video understanding, human-like interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, This manuscript has been accepted at the Human-inspired Computer Vision (HCV) ECCV 2024 Workshop. arXiv admin note: text overlap with arXiv:2107.05319

点击查看摘要

Abstract:We investigate a human-like interpretable model of video understanding. Humans recognise complex activities in video by recognising critical spatio-temporal relations among explicitly recognised objects and parts, for example, an object entering the aperture of a container. To mimic this we build on a model which uses positions of objects and hands, and their motions, to recognise the activity taking place. To improve this model we focussed on three of the most confused classes (for this model) and identified that the lack of 3D information was the major problem. To address this we extended our basic model by adding 3D awareness in two ways: (1) A state-of-the-art object detection model was fine-tuned to determine the difference between “Container” and “NotContainer” in order to integrate object shape information into the existing object features. (2) A state-of-the-art depth estimation model was used to extract depth values for individual objects and calculate depth relations to expand the existing relations used our interpretable model. These 3D extensions to our basic model were evaluated on a subset of three superficially similar “Putting” actions from the Something-Something-v2 dataset. The results showed that the container detector did not improve performance, but the addition of depth relations made a significant improvement to performance.

[CV-71] Real-time estimation of overt attention from dynamic features of the face using deep-learning

链接: https://arxiv.org/abs/2409.13084
作者: Aimar Silvan Ortubay,Lucas C. Parra,Jens Madsen
关键词-EN: focus during class, eye movements, movements, student eye movements, Inter-Subject Correlation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 3 figures

点击查看摘要

Abstract:Students often drift in and out of focus during class. Effective teachers recognize this and re-engage them when necessary. With the shift to remote learning, teachers have lost the visual feedback needed to adapt to varying student engagement. We propose using readily available front-facing video to infer attention levels based on movements of the eyes, head, and face. We train a deep learning model to predict a measure of attention based on overt eye movements. Specifically, we measure Inter-Subject Correlation of eye movements in ten-second intervals while students watch the same educational videos. In 3 different experiments (N=83) we show that the trained model predicts this objective metric of attention on unseen data with R^2 =0.38, and on unseen subjects with R^2 =0.26-0.30. The deep network relies mostly on a student’s eye movements, but to some extent also on movements of the brows, cheeks, and head. In contrast to Inter-Subject Correlation of the eyes, the model can estimate attentional engagement from individual students’ movements without needing reference data from an attentive group. This enables a much broader set of online applications. The solution is lightweight and can operate on the client side, which mitigates some of the privacy concerns associated with online attention monitoring.

[CV-72] Embedding Geometries of Contrastive Language-Image Pre-Training ECCV2024

链接: https://arxiv.org/abs/2409.13079
作者: Jason Chuan-Chih Chou,Nahid Alam
关键词-EN: InfoNCE loss, loss for contrastive, widely popular, popular for bridging, CLIP
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - Beyond Euclidean Workshop

点击查看摘要

Abstract:Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP’s original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

[CV-73] What does guidance do? A fine-grained analysis in a simple setting

链接: https://arxiv.org/abs/2409.13074
作者: Muthu Chidambaram,Khashayar Gatmiry,Sitan Chen,Holden Lee,Jianfeng Lu
关键词-EN: conditional likelihood raised, data distribution tilted, originally motivated, likelihood raised, guidance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2409.13074 [cs.LG] (or arXiv:2409.13074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-74] Cross-Chirality Palmprint Verification: Left is Right for the Right Palmprint

链接: https://arxiv.org/abs/2409.13056
作者: Chengrui Gao,Ziyuan Yang,Tiong-Sik Ng,Min Zhu,Andrew Beng Jin Teoh
关键词-EN: high discriminative power, prominent biometric authentication, user-friendly nature, recognition has emerged, prominent biometric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Palmprint recognition has emerged as a prominent biometric authentication method, owing to its high discriminative power and user-friendly nature. This paper introduces a novel Cross-Chirality Palmprint Verification (CCPV) framework that challenges the conventional wisdom in traditional palmprint verification systems. Unlike existing methods that typically require storing both left and right palmprints, our approach enables verification using either palm while storing only one palmprint template. The core of our CCPV framework lies in a carefully designed matching rule. This rule involves flipping both the gallery and query palmprints and calculating the average distance between each pair as the final matching distance. This approach effectively reduces matching variance and enhances overall system robustness. We introduce a novel cross-chirality loss function to construct a discriminative and robust cross-chirality feature space. This loss enforces representation consistency across four palmprint variants: left, right, flipped left, and flipped right. The resulting compact feature space, coupled with the model’s enhanced discriminative representation capability, ensures robust performance across various scenarios. We conducted extensive experiments to validate the efficacy of our proposed method. The evaluation encompassed multiple public datasets and considered both closed-set and open-set settings. The results demonstrate the CCPV framework’s effectiveness and highlight its potential for real-world applications in palmprint authentication systems.

[CV-75] MGSO: Monocular Real-time Photometric SLAM with Efficient 3D Gaussian Splatting ICRA2025

链接: https://arxiv.org/abs/2409.13055
作者: Yan Song Hu,Nicolas Abboud,Muhammad Qasim Ali,Adam Srebrnjak Yang,Imad Elhajj,Daniel Asmar,Yuhao Chen,John S. Zelek
关键词-EN: SLAM, mapping is computationally, computationally challenging, resource-limited devices, Real-time SLAM
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper Contribution to the ICRA 2025 Conference. Currently being reviewed

点击查看摘要

Abstract:Real-time SLAM with dense 3D mapping is computationally challenging, especially on resource-limited devices. The recent development of 3D Gaussian Splatting (3DGS) offers a promising approach for real-time dense 3D reconstruction. However, existing 3DGS-based SLAM systems struggle to balance hardware simplicity, speed, and map quality. Most systems excel in one or two of the aforementioned aspects but rarely achieve all. A key issue is the difficulty of initializing 3D Gaussians while concurrently conducting SLAM. To address these challenges, we present Monocular GSO (MGSO), a novel real-time SLAM system that integrates photometric SLAM with 3DGS. Photometric SLAM provides dense structured point clouds for 3DGS initialization, accelerating optimization and producing more efficient maps with fewer Gaussians. As a result, experiments show that our system generates reconstructions with a balance of quality, memory efficiency, and speed that outperforms the state-of-the-art. Furthermore, our system achieves all results using RGB inputs. We evaluate the Replica, TUM-RGBD, and EuRoC datasets against current live dense reconstruction systems. Not only do we surpass contemporary systems, but experiments also show that we maintain our performance on laptop hardware, making it a practical solution for robotics, A/R, and other real-time applications.

[CV-76] ACE: Tumor-Aware Counterfactual Explanations

链接: https://arxiv.org/abs/2409.13045
作者: Eleonora Beatrice Rossi,Eleonora Lopez,Danilo Comminiello
关键词-EN: advanced diagnostic capabilities, significantly advanced diagnostic, diagnostic capabilities, enhancing both accuracy, accuracy and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The paper has been accepted at Italian Workshop on Neural Networks (WIRN) 2024

点击查看摘要

Abstract:The application of deep learning in medical imaging has significantly advanced diagnostic capabilities, enhancing both accuracy and efficiency. Despite these benefits, the lack of transparency in these AI models, often termed “black boxes,” raises concerns about their reliability in clinical settings. Explainable AI (XAI) aims to mitigate these concerns by developing methods that make AI decisions understandable and trustworthy. In this study, we propose Tumor Aware Counterfactual Explanations (TACE), a framework designed to generate reliable counterfactual explanations for medical images. Unlike existing methods, TACE focuses on modifying tumor-specific features without altering the overall organ structure, ensuring the faithfulness of the counterfactuals. We achieve this by including an additional step in the generation process which allows to modify only the region of interest (ROI), thus yielding more reliable counterfactuals as the rest of the organ remains unchanged. We evaluate our method on mammography images and brain MRI. We find that our method far exceeds existing state-of-the-art techniques in quality, faithfulness, and generation speed of counterfactuals. Indeed, more faithful explanations lead to a significant improvement in classification success rates, with a 10.69% increase for breast cancer and a 98.02% increase for brain tumors. The code of our work is available at this https URL.

[CV-77] DNI: Dilutional Noise Initialization for Diffusion Video Editing ECCV2024

链接: https://arxiv.org/abs/2409.13037
作者: Sunjae Yoon,Gwanhyeong Koo,Ji Woo Hong,Chang D. Yoo
关键词-EN: Text-based diffusion video, video editing systems, diffusion video editing, input video, Text-based diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 11 figures, ECCV 2024

点击查看摘要

Abstract:Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution’ which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.

[CV-78] Across-Game Engagement Modelling via Few-Shot Learning ECCV2024

链接: https://arxiv.org/abs/2409.13002
作者: Kosmas Pinitas,Konstantinos Makantasis,Georgios N. Yannakakis
关键词-EN: learning artificial intelligence, involves learning artificial, maintain high performance, artificial intelligence, Domain generalisation involves
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 17 pages, accepted for publication at ECCV 2024 CV2 Workshop

点击查看摘要

Abstract:Domain generalisation involves learning artificial intelligence (AI) models that can maintain high performance across diverse domains within a specific task. In video games, for instance, such AI models can supposedly learn to detect player actions across different games. Despite recent advancements in AI, domain generalisation for modelling the users’ experience remains largely unexplored. While video games present unique challenges and opportunities for the analysis of user experience – due to their dynamic and rich contextual nature – modelling such experiences is limited by generally small datasets. As a result, conventional modelling methods often struggle to bridge the domain gap between users and games due to their reliance on large labelled training data and assumptions of common distributions of user experience. In this paper, we tackle this challenge by introducing a framework that decomposes the general domain-agnostic modelling of user experience into several domain-specific and game-dependent tasks that can be solved via few-shot learning. We test our framework on a variation of the publicly available GameVibe corpus, designed specifically to test a model’s ability to predict user engagement across different first-person shooter games. Our findings demonstrate the superior performance of few-shot learners over traditional modelling methods and thus showcase the potential of few-shot learning for robust experience modelling in video games and beyond.

[CV-79] A New People-Object Interaction Dataset and NVS Benchmarks

链接: https://arxiv.org/abs/2409.12980
作者: Shuai Guo,Houqiang Zhong,Qiuwen Wang,Ziyu Chen,Yijie Gao,Jiajing Yuan,Chenyu Zhang,Rong Xie,Li Song
关键词-EN: received increasing attention, increasing attention, received increasing, Recently, human-object interaction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, NVS in human-object interaction scenes has received increasing attention. Existing human-object interaction datasets mainly consist of static data with limited views, offering only RGB images or videos, mostly containing interactions between a single person and objects. Moreover, these datasets exhibit complexities in lighting environments, poor synchronization, and low resolution, hindering high-quality human-object interaction studies. In this paper, we introduce a new people-object interaction dataset that comprises 38 series of 30-view multi-person or single-person RGB-D video sequences, accompanied by camera parameters, foreground masks, SMPL models, some point clouds, and mesh files. Video sequences are captured by 30 Kinect Azures, uniformly surrounding the scene, each in 4K resolution 25 FPS, and lasting for 1 \sim 19 seconds. Meanwhile, we evaluate some SOTA NVS models on our dataset to establish the NVS benchmarks. We hope our work can inspire further research in humanobject interaction.

[CV-80] Semantic Meta-Split Learning: A TinyML Scheme for Few-Shot Wireless Image Classification

链接: https://arxiv.org/abs/2409.12978
作者: Eslam Eldeeb,Mohammad Shehab,Hirley Alves,Mohamed-Slim Alouini
关键词-EN: transmits significant information, emerging technology, transmits significant, significant information, SGO
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic and goal-oriented (SGO) communication is an emerging technology that only transmits significant information for a given task. Semantic communication encounters many challenges, such as computational complexity at end users, availability of data, and privacy-preserving. This work presents a TinyML-based semantic communication framework for few-shot wireless image classification that integrates split-learning and meta-learning. We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving. In addition, meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks. The proposed algorithm is tested using a data set of images of hand-written letters. In addition, we present an uncertainty analysis of the predictions using conformal prediction (CP) techniques. Simulation results show that the proposed Semantic-MSL outperforms conventional schemes by achieving 20 % gain on classification accuracy using fewer data points, yet less training energy consumption.

[CV-81] Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements Applications And Challenges

链接: https://arxiv.org/abs/2409.12977
作者: James E. Gallagher,Edward J. Oughton
关键词-EN: powerful tools supporting, tools supporting diverse, autonomous vehicles, infrastructure monitoring, environmental assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multispectral imaging and deep learning have emerged as powerful tools supporting diverse use cases from autonomous vehicles, to agriculture, infrastructure monitoring and environmental assessment. The combination of these technologies has led to significant advancements in object detection, classification, and segmentation tasks in the non-visible light spectrum. This paper considers 400 total papers, reviewing 200 in detail to provide an authoritative meta-review of multispectral imaging technologies, deep learning models, and their applications, considering the evolution and adaptation of You Only Look Once (YOLO) methods. Ground-based collection is the most prevalent approach, totaling 63% of the papers reviewed, although uncrewed aerial systems (UAS) for YOLO-multispectral applications have doubled since 2020. The most prevalent sensor fusion is Red-Green-Blue (RGB) with Long-Wave Infrared (LWIR), comprising 39% of the literature. YOLOv5 remains the most used variant for adaption to multispectral applications, consisting of 33% of all modified YOLO models reviewed. 58% of multispectral-YOLO research is being conducted in China, with broadly similar research quality to other countries (with a mean journal impact factor of 4.45 versus 4.36 for papers not originating from Chinese institutions). Future research needs to focus on (i) developing adaptive YOLO architectures capable of handling diverse spectral inputs that do not require extensive architectural modifications, (ii) exploring methods to generate large synthetic multispectral datasets, (iii) advancing multispectral YOLO transfer learning techniques to address dataset scarcity, and (iv) innovating fusion research with other sensor types beyond RGB and LWIR.

[CV-82] he Era of Foundation Models in Medical Imaging is Approaching : A Scoping Review of the Clinical Value of Large-Scale Generative AI Applications in Radiology

链接: https://arxiv.org/abs/2409.12973
作者: Inwoo Seo,Eunkyoung Bae,Joo-Young Jeon,Young-Sang Yoon,Jiho Cha
关键词-EN: Social problems stemming, Social problems, problems stemming, artificial intelligence, potential solution
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 25 pages,3 figures, 4 tables, submitted to NPJ imaging

点击查看摘要

Abstract:Social problems stemming from the shortage of radiologists are intensifying, and artificial intelligence is being highlighted as a potential solution. Recently emerging large-scale generative AI has expanded from large language models (LLMs) to multi-modal models, showing potential to revolutionize the entire process of medical imaging. However, comprehensive reviews on their development status and future challenges are currently lacking. This scoping review systematically organizes existing literature on the clinical value of large-scale generative AI applications by following PCC guidelines. A systematic search was conducted across four databases: PubMed, EMbase, IEEE-Xplore, and Google Scholar, and 15 studies meeting the inclusion/exclusion criteria set by the researchers were reviewed. Most of these studies focused on improving the efficiency of report generation in specific parts of the interpretation process or on translating reports to aid patient understanding, with the latest studies extending to AI applications performing direct interpretations. All studies were quantitatively evaluated by clinicians, with most utilizing LLMs and only three employing multi-modal models. Both LLMs and multi-modal models showed excellent results in specific areas, but none yet outperformed radiologists in diagnostic performance. Most studies utilized GPT, with few using models specialized for the medical imaging domain. This study provides insights into the current state and limitations of large-scale generative AI-based applications in the medical imaging field, offering foundational data and suggesting that the era of medical imaging foundation models is on the horizon, which may fundamentally transform clinical practice in the near future.

[CV-83] Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

链接: https://arxiv.org/abs/2409.12969
作者: Gracjan Góral,Alicja Ziarko,Michal Nauman,Maciej Wołczyk
关键词-EN: enables individuals, Visual perspective-taking, individuals to anticipate, anticipate the actions, Vision Language Models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at this https URL

[CV-84] DARDA: Domain-Aware Real-Time Dynamic Neural Network Adaptation

链接: https://arxiv.org/abs/2409.09753
作者: Shahriar Rifat,Jonathan Ashdown,Francesco Restuccia
关键词-EN: Deep Neural Networks, Test Time Adaptation, Test Time, Neural Networks, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test Time Adaptation (TTA) has emerged as a practical solution to mitigate the performance degradation of Deep Neural Networks (DNNs) in the presence of corruption/ noise affecting inputs. Existing approaches in TTA continuously adapt the DNN, leading to excessive resource consumption and performance degradation due to accumulation of error stemming from lack of supervision. In this work, we propose Domain-Aware Real-Time Dynamic Adaptation (DARDA) to address such issues. Our key approach is to proactively learn latent representations of some corruption types, each one associated with a sub-network state tailored to correctly classify inputs affected by that corruption. After deployment, DARDA adapts the DNN to previously unseen corruptions in an unsupervised fashion by (i) estimating the latent representation of the ongoing corruption; (ii) selecting the sub-network whose associated corruption is the closest in the latent space to the ongoing corruption; and (iii) adapting DNN state, so that its representation matches the ongoing corruption. This way, DARDA is more resource efficient and can swiftly adapt to new distributions caused by different corruptions without requiring a large variety of input data. Through experiments with two popular mobile edge devices - Raspberry Pi and NVIDIA Jetson Nano - we show that DARDA reduces energy consumption and average cache memory footprint respectively by 1.74x and 2.64x with respect to the state of the art, while increasing the performance by 10.4%, 5.7% and 4.4% on CIFAR-10, CIFAR-100 and TinyImagenet.

[CV-85] Improved Unet brain tumor image segmentation based on GSConv module and ECA attention mechanism

链接: https://arxiv.org/abs/2409.13626
作者: Qiyuan Tian,Zhuoyue Wang,Xiaoling Cui
关键词-EN: medical image segmentation, deep learning algorithm, learning algorithm based, improved U-Net model, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages; Accepted by CONF-CDS 2024 conference already

点击查看摘要

Abstract:An improved model of medical image segmentation for brain tumor is discussed, which is a deep learning algorithm based on U-Net architecture. Based on the traditional U-Net, we introduce GSConv module and ECA attention mechanism to improve the performance of the model in medical image segmentation tasks. With these improvements, the new U-Net model is able to extract and utilize multi-scale features more efficiently while flexibly focusing on important channels, resulting in significantly improved segmentation results. During the experiment, the improved U-Net model is trained and evaluated systematically. By looking at the loss curves of the training set and the test set, we find that the loss values of both rapidly decline to the lowest point after the eighth epoch, and then gradually converge and stabilize. This shows that our model has good learning ability and generalization ability. In addition, by monitoring the change in the mean intersection ratio (mIoU), we can see that after the 35th epoch, the mIoU gradually approaches 0.8 and remains stable, which further validates the model. Compared with the traditional U-Net, the improved version based on GSConv module and ECA attention mechanism shows obvious advantages in segmentation effect. Especially in the processing of brain tumor image edges, the improved model can provide more accurate segmentation results. This achievement not only improves the accuracy of medical image analysis, but also provides more reliable technical support for clinical diagnosis.

[CV-86] Analyzing the Effect of k-Space Features in MRI Classification Models

链接: https://arxiv.org/abs/2409.13589
作者: Pascal Passigan,Vayd Ramkumar
关键词-EN: Artificial Intelligence, high-accuracy systems function, integration of Artificial, black boxes, transparent reasoning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) in medical diagnostics is often hindered by model opacity, where high-accuracy systems function as “black boxes” without transparent reasoning. This limitation is critical in clinical settings, where trust and reliability are paramount. To address this, we have developed an explainable AI methodology tailored for medical imaging. By employing a Convolutional Neural Network (CNN) that analyzes MRI scans across both image and frequency domains, we introduce a novel approach that incorporates Uniform Manifold Approximation and Projection UMAP] for the visualization of latent input embeddings. This approach not only enhances early training efficiency but also deepens our understanding of how additional features impact the model predictions, thereby increasing interpretability and supporting more accurate and intuitive diagnostic inferences

[CV-87] Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation?

链接: https://arxiv.org/abs/2409.13548
作者: Alexander Jaus,Simon Reiß,Jens Klesiek,Rainer Stiefelhagen
关键词-EN: datacentric track, false negative volume, model, larger datasets lead, enhance model accuracy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we describe our approach to compete in the autoPET3 datacentric track. While conventional wisdom suggests that larger datasets lead to better model performance, recent studies indicate that excluding certain training samples can enhance model accuracy. We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics by producing a large number of false positives particularly for PSMA-PETs. We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch. Using the proposed approach we manage to drive down the false negative volume and improve upon the baseline model in both false negative volume and dice score on the preliminary test set. Code and pre-trained models are available at this http URL.

[CV-88] Physics-Informed Latent Diffusion for Multimodal Brain MRI Synthesis

链接: https://arxiv.org/abs/2409.13532
作者: Sven Lüpke,Yousef Yeganeh,Ehsan Adeli,Nassir Navab,Azade Farshad
关键词-EN: representing multiple modalities, Recent advances, medical imaging, imaging have shown, shown promise
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in generative models for medical imaging have shown promise in representing multiple modalities. However, the variability in modality availability across datasets limits the general applicability of the synthetic data they produce. To address this, we present a novel physics-informed generative model capable of synthesizing a variable number of brain MRI modalities, including those not present in the original dataset. Our approach utilizes latent diffusion models and a two-step generative process: first, unobserved physical tissue property maps are synthesized using a latent diffusion model, and then these maps are combined with a physical signal model to generate the final MRI scan. Our experiments demonstrate the efficacy of this approach in generating unseen MR contrasts and preserving physical plausibility. Furthermore, we validate the distributions of generated tissue properties by comparing them to those measured in real brain tissue.

[CV-89] A Deep Learning Approach for Pixel-level Material Classification via Hyperspectral Imaging

链接: https://arxiv.org/abs/2409.13498
作者: Savvas Sifnaios,George Arvanitakis,Fotios K. Konstantinidis,Georgios Tsimiklis,Angelos Amditis,Panayiotis Frangos
关键词-EN: Recent advancements, impacted various domains, significantly impacted, computer vision, Recent
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 15 figures, 6 equations

点击查看摘要

Abstract:Recent advancements in computer vision, particularly in detection, segmentation, and classification, have significantly impacted various domains. However, these advancements are tied to RGB-based systems, which are insufficient for applications in industries like waste sorting, pharmaceuticals, and defense, where advanced object characterization beyond shape or color is necessary. Hyperspectral (HS) imaging, capturing both spectral and spatial information, addresses these limitations and offers advantages over conventional technologies such as X-ray fluorescence and Raman spectroscopy, particularly in terms of speed, cost, and safety. This study evaluates the potential of combining HS imaging with deep learning for material characterization. The research involves: i) designing an experimental setup with HS camera, conveyor, and controlled lighting; ii) generating a multi-object dataset of various plastics (HDPE, PET, PP, PS) with semi-automated mask generation and Raman spectroscopy-based labeling; and iii) developing a deep learning model trained on HS images for pixel-level material classification. The model achieved 99.94% classification accuracy, demonstrating robustness in color, size, and shape invariance, and effectively handling material overlap. Limitations, such as challenges with black objects, are also discussed. Extending computer vision beyond RGB to HS imaging proves feasible, overcoming major limitations of traditional methods and showing strong potential for future applications. Comments: 13 pages, 15 figures, 6 equations Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5; I.2.10 Cite as: arXiv:2409.13498 [eess.IV] (or arXiv:2409.13498v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2409.13498 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-90] A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

链接: https://arxiv.org/abs/2409.13477
作者: Chinmay Rao,Matthias van Osch,Nicola Pezzotti,Jeroen de Bresser,Laurens Beljaards,Jakob Meineke,Elwin de Weerdt,Huangling Lu,Mariya Doneva,Marius Staring
关键词-EN: multiple MRI contrasts, undersampled subsequent contrast, MRI contrasts, multiple MRI, subsequent contrast
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Since multiple MRI contrasts of the same anatomy contain redundant information, one contrast can be used as a prior for guiding the reconstruction of an undersampled subsequent contrast. To this end, several learning-based guided reconstruction methods have been proposed. However, two key challenges remain - (a) the requirement of large paired training datasets and (b) the lack of intuitive understanding of the model’s internal representation and utilization of the shared information. We propose a modular two-stage approach for guided reconstruction, addressing these challenges. A content/style model of two-contrast image data is learned in a largely unpaired manner and is subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Based on this, incorporating prior information into the reconstruction reduces to simply replacing the aliased reconstruction content with clean content derived from the reference scan. We name this novel approach PnP-MUNIT. Various aspects like interpretability and convergence are explored via simulations. Furthermore, its practicality is demonstrated on the NYU fastMRI DICOM dataset and two in-house raw datasets, obtaining up to 32.6% more acceleration over learning-based non-guided reconstruction for a given SSIM. In a radiological task, PnP-MUNIT allowed 33.3% more acceleration over clinical reconstruction at diagnostic quality.

[CV-91] Classification of 4 types of White blood cell images

链接: https://arxiv.org/abs/2409.13442
作者: Rabia Asghar,Arslan Shaukat,Usman Akram,Rimsha Tariq
关键词-EN: white blood cells, blood cells, white blood, bacterial infections, blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human immune system contains white blood cells (WBC) that are good indicator of many diseases like bacterial infections, AIDS, cancer, spleen, etc. White blood cells have been sub classified into four types: monocytes, lymphocytes, eosinophils and neutrophils on the basis of their nucleus, shape and cytoplasm. Traditionally in laboratories, pathologists and hematologists analyze these blood cells through microscope and then classify them manually. This manual process takes more time and increases the chance of human error. Hence, there is a need to automate this process. In this paper, first we have used different CNN pre-train models such as ResNet-50, InceptionV3, VGG16 and MobileNetV2 to automatically classify the white blood cells. These pre-train models are applied on Kaggle dataset of microscopic images. Although we achieved reasonable accuracy ranging between 92 to 95%, still there is need to enhance the performance. Hence, inspired by these architectures, a framework has been proposed to automatically categorize the four kinds of white blood cells with increased accuracy. The aim is to develop a convolution neural network (CNN) based classification system with decent generalization ability. The proposed CNN model has been tested on white blood cells images from Kaggle and LISC datasets. Accuracy achieved is 99.57% and 98.67% for both datasets respectively. Our proposed convolutional neural network-based model provides competitive performance as compared to previous results reported in literature.

[CV-92] Longitudinal Segmentation of MS Lesions via Temporal Difference Weighting MICCAI2024

链接: https://arxiv.org/abs/2409.13416
作者: Maximilian Rokuss,Yannick Kirchhoff,Saikat Roy,Balint Kovacs,Constantin Ulrich,Tassilo Wald,Maximilian Zenk,Stefan Denner,Fabian Isensee,Philipp Vollmuth,Jens Kleesiek,Klaus Maier-Hein
关键词-EN: Multiple Sclerosis, monitoring disease progression, treatment efficacy, Accurate segmentation, longitudinal MRI scans
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024 LDTM

点击查看摘要

Abstract:Accurate segmentation of Multiple Sclerosis (MS) lesions in longitudinal MRI scans is crucial for monitoring disease progression and treatment efficacy. Although changes across time are taken into account when assessing images in clinical practice, most existing deep learning methods treat scans from different timepoints separately. Among studies utilizing longitudinal images, a simple channel-wise concatenation is the primary albeit suboptimal method employed to integrate timepoints. We introduce a novel approach that explicitly incorporates temporal differences between baseline and follow-up scans through a unique architectural inductive bias called Difference Weighting Block. It merges features from two timepoints, emphasizing changes between scans. We achieve superior scores in lesion segmentation (Dice Score, Hausdorff distance) as well as lesion detection (lesion-level F_1 score) as compared to state-of-the-art longitudinal and single timepoint models across two datasets. Our code is made publicly available at this http URL.

[CV-93] MCICSAM: Monte Carlo-guided Interpolation Consistency Segment Anything Model for Semi-Supervised Prostate Zone Segmentation

链接: https://arxiv.org/abs/2409.13371
作者: Guantian Huang,Beibei Li,Xiaobing Fan,Aritrick Chatterjee,Cheng Wei,Shouliang Qi,Wei Qian,Dianning He
关键词-EN: treating prostate-related diseases, Accurate segmentation, prostate-related diseases, pivotal for diagnosing, diagnosing and treating
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Accurate segmentation of various regions within the prostate is pivotal for diagnosing and treating prostate-related diseases. However, the scarcity of labeled data, particularly in specialized medical fields like prostate imaging, poses a significant challenge. Segment Anything Model (SAM) is a new large model for natural image segmentation, but there are some challenges in medical imaging. In order to better utilize the powerful feature extraction capability of SAM as well as to address the problem of low data volume for medical image annotation, we use Low-Rank Adaptation (LoRA) and semi-supervised learning methods of Monte Carlo guided interpolation consistency (MCIC) to enhance the fine-tuned SAM. We propose Monte Carlo-guided Interpolation Consistency Segment Anything Model (MCICSAM) for application to semi-supervised learning based prostate region segmentation. In the unlabeled data section, MCIC performs two different interpolation transformations on the input data and incorporates Monte Carlo uncertainty analysis in the output, forcing the model to be consistent in its predictions. The consistency constraints imposed on these interpolated samples allow the model to fit the distribution of unlabeled data better, ultimately improving its performance in semi-supervised scenarios. We use Dice and Hausdorff Distance at 95th percentile (HD95) to validate model performance. MCICSAM yieldes Dice with 79.38% and 89.95%, along with improves HD95 values of 3.12 and 2.27 for transition zone and transition zone. At the same time MCICSAM demonstrates strong generalizability. This method is expected to bring new possibilities in the field of prostate image segmentation.

[CV-94] Understanding Stain Separation Improves Cross-Scanner Adenocarcinoma Segmentation with Joint Multi-Task Learning

链接: https://arxiv.org/abs/2409.13246
作者: Ho Heon Kim,Won Chan Jeong,Young Shin Ko,Young Jin Park
关键词-EN: made significant advances, image variability due, domain shift, tissue preparation, differences in organs
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Digital pathology has made significant advances in tumor diagnosis and segmentation, but image variability due to differences in organs, tissue preparation, and acquisition - known as domain shift - limits the effectiveness of current algorithms. The COSAS (Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation) challenge addresses this issue by improving the resilience of segmentation algorithms to domain shift, with Task 2 focusing on adenocarcinoma segmentation using a diverse dataset from six scanners, pushing the boundaries of clinical diagnostics. Our approach employs unsupervised learning through stain separation within a multi-task learning framework using a multi-decoder autoencoder. This model isolates stain matrix and stain density, allowing it to handle color variation and improve generalization across scanners. We further enhanced the robustness of the model with a mixture of stain augmentation techniques and used a U-net architecture for segmentation. The novelty of our method lies in the use of stain separation within a multi-task learning framework, which effectively disentangles histological structures from color variations. This approach shows promise for improving segmentation accuracy and generalization across different histopathological stains, paving the way for more reliable diagnostic tools in digital pathology.

[CV-95] Multiscale Encoder and Omni-Dimensional Dynamic Convolution Enrichment in nnU-Net for Brain Tumor Segmentation MICCAI2023

链接: https://arxiv.org/abs/2409.13229
作者: Sahaj K. Mistry,Sourav Saini,Aashray Gupta,Aayush Gupta,Sunny Rai,Vinit Jakhetiya,Ujjwal Baid,Sharath Chandra Guntuku
关键词-EN: Brain tumor segmentation, Brain tumor, computer-aided diagnosis, plays a crucial, crucial role
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 3 figures. Accepted at MICCAI 2023, to be published in Springer LNCS. GitHub: this https URL

点击查看摘要

Abstract:Brain tumor segmentation plays a crucial role in computer-aided diagnosis. This study introduces a novel segmentation algorithm utilizing a modified nnU-Net architecture. Within the nnU-Net architecture’s encoder section, we enhance conventional convolution layers by incorporating omni-dimensional dynamic convolution layers, resulting in improved feature representation. Simultaneously, we propose a multi-scale attention strategy that harnesses contemporary insights from various scales. Our model’s efficacy is demonstrated on diverse datasets from the BraTS-2023 challenge. Integrating omni-dimensional dynamic convolution (ODConv) layers and multi-scale features yields substantial improvement in the nnU-Net architecture’s performance across multiple tumor segmentation datasets. Remarkably, our proposed model attains good accuracy during validation for the BraTS Africa dataset. The ODconv source code along with full training code is available on GitHub.

[CV-96] Deep Learning based Optical Image Super-Resolution via Generative Diffusion Models for Layerwise in-situ LPBF Monitoring

链接: https://arxiv.org/abs/2409.13171
作者: Francis Ogoke,Sumesh Kalambettu Suresh,Jesse Adamczyk,Dan Bolintineanu,Anthony Garland,Michael Heiden,Amir Barati Farimani
关键词-EN: Powder Bed Fusion, Laser Powder Bed, Bed Fusion, Laser Powder, Powder Bed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stochastic formation of defects during Laser Powder Bed Fusion (L-PBF) negatively impacts its adoption for high-precision use cases. Optical monitoring techniques can be used to identify defects based on layer-wise imaging, but these methods are difficult to scale to high resolutions due to cost and memory constraints. Therefore, we implement generative deep learning models to link low-cost, low-resolution images of the build plate to detailed high-resolution optical images of the build plate, enabling cost-efficient process monitoring. To do so, a conditional latent probabilistic diffusion model is trained to produce realistic high-resolution images of the build plate from low-resolution webcam images, recovering the distribution of small-scale features and surface roughness. We first evaluate the performance of the model by analyzing the reconstruction quality of the generated images using peak-signal-to-noise-ratio (PSNR), structural similarity index measure (SSIM) and wavelet covariance metrics that describe the preservation of high-frequency information. Additionally, we design a framework based upon the Segment Anything foundation model to recreate the 3D morphology of the printed part and analyze the surface roughness of the reconstructed samples. Finally, we explore the zero-shot generalization capabilities of the implemented framework to other part geometries by creating synthetic low-resolution data.

[CV-97] GASA-UNet: Global Axial Self-Attention U-Net for 3D Medical Image Segmentation

链接: https://arxiv.org/abs/2409.13146
作者: Chengkun Sun,Russell Stevens Terry,Jiang Bian,Jie Xu
关键词-EN: ambiguous organ boundaries, Global Axial Self-Attention, crucial but challenging, Accurate segmentation, differentiation of pathological
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate segmentation of multiple organs and the differentiation of pathological tissues in medical imaging are crucial but challenging, especially for nuanced classifications and ambiguous organ boundaries. To tackle these challenges, we introduce GASA-UNet, a refined U-Net-like model featuring a novel Global Axial Self-Attention (GASA) block. This block processes image data as a 3D entity, with each 2D plane representing a different anatomical cross-section. Voxel features are defined within this spatial context, and a Multi-Head Self-Attention (MHSA) mechanism is utilized on extracted 1D patches to facilitate connections across these planes. Positional embeddings (PE) are incorporated into our attention framework, enriching voxel features with spatial context and enhancing tissue classification and organ edge delineation. Our model has demonstrated promising improvements in segmentation performance, particularly for smaller anatomical structures, as evidenced by enhanced Dice scores and Normalized Surface Dice (NSD) on three benchmark datasets, i.e., BTCV, AMOS, and KiTS23.

[CV-98] Personalized 2D Binary Patient Codes of Tissue Images and Immunogenomic Data Through Multimodal Self-Supervised Fusion

链接: https://arxiv.org/abs/2409.13115
作者: Areej Alsaafin,Abubakr Shafique,Saghir Alfasly,H.R.Tizhoosh
关键词-EN: offering promising avenues, enhancing patient care, artificial intelligence, offering promising, disease comprehension
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The field of medical diagnostics has witnessed a transformative convergence of artificial intelligence (AI) and healthcare data, offering promising avenues for enhancing patient care and disease comprehension. However, this integration of multimodal data, specifically histopathology whole slide images (WSIs) and genetic sequencing data, presents unique challenges due to modality disparities and the need for scalable computational solutions. This paper addresses the scarcity of multimodal solutions, primarily centered around unimodal data solutions, thus limiting the realization of the rich insights that can be derived from integrating images and genomic data. Here, we introduce MarbliX Multimodal Association and Retrieval with Binary Latent Indexed matriX,'' an innovative multimodal framework that integrates histopathology images with immunogenomic sequencing data, encapsulating them into a concise binary patient code, referred to as monogram.‘’ This binary representation facilitates the establishment of a comprehensive archive, enabling clinicians to match similar cases. The experimental results demonstrate the potential of MarbliX to empower healthcare professionals with in-depth insights, leading to more precise diagnoses, reduced variability, and expanded personalized treatment options, particularly in the context of cancer.

[CV-99] DenoMamba: A fused state-space model for low-dose CT denoising

链接: https://arxiv.org/abs/2409.13094
作者: Şaban Öztürk,Oğuz Can Duran,Tolga Çukur
关键词-EN: Low-dose computed tomography, lower potential risks, potential risks linked, Low-dose computed, advanced denoising algorithms
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) lower potential risks linked to radiation exposure while relying on advanced denoising algorithms to maintain diagnostic quality in reconstructed images. The reigning paradigm in LDCT denoising is based on neural network models that learn data-driven image priors to separate noise evoked by dose reduction from underlying tissue signals. Naturally, the fidelity of these priors depend on the model’s ability to capture the broad range of contextual features evident in CT images. Earlier convolutional neural networks (CNN) are highly adept at efficiently capturing short-range spatial context, but their limited receptive fields reduce sensitivity to interactions over longer distances. Although transformers based on self-attention mechanisms have recently been posed to increase sensitivity to long-range context, they can suffer from suboptimal performance and efficiency due to elevated model complexity, particularly for high-resolution CT images. For high-quality restoration of LDCT images, here we introduce DenoMamba, a novel denoising method based on state-space modeling (SSM), that efficiently captures short- and long-range context in medical images. Following an hourglass architecture with encoder-decoder stages, DenoMamba employs a spatial SSM module to encode spatial context and a novel channel SSM module equipped with a secondary gated convolution network to encode latent features of channel context at each stage. Feature maps from the two modules are then consolidated with low-level input features via a convolution fusion module (CFM). Comprehensive experiments on LDCT datasets with 25% and 10% dose reduction demonstrate that DenoMamba outperforms state-of-the-art denoisers with average improvements of 1.4dB PSNR, 1.1% SSIM, and 1.6% RMSE in recovered image quality.

[CV-100] DiffSSD: A Diffusion-Based Dataset For Speech Forensics ICASSP

链接: https://arxiv.org/abs/2409.13049
作者: Kratika Bhagtani,Amit Kumar Singh Yadav,Paolo Bestagini,Edward J. Delp
关键词-EN: synthetic speech, speech, synthetic, synthetic speech detectors, detecting synthetic speech
类目: Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
*备注: Submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2025

点击查看摘要

Abstract:Diffusion-based speech generators are ubiquitous. These methods can generate very high quality synthetic speech and several recent incidents report their malicious use. To counter such misuse, synthetic speech detectors have been developed. Many of these detectors are trained on datasets which do not include diffusion-based synthesizers. In this paper, we demonstrate that existing detectors trained on one such dataset, ASVspoof2019, do not perform well in detecting synthetic speech from recent diffusion-based synthesizers. We propose the Diffusion-Based Synthetic Speech Dataset (DiffSSD), a dataset consisting of about 200 hours of labeled speech, including synthetic speech generated by 8 diffusion-based open-source and 2 commercial generators. We also examine the performance of existing synthetic speech detectors on DiffSSD in both closed-set and open-set scenarios. The results highlight the importance of this dataset in detecting synthetic speech generated from recent open-source and commercial speech generators.

[CV-101] AutoPET III Challenge: PET/CT Semantic Segmentation

链接: https://arxiv.org/abs/2409.13006
作者: Reza Safdari,Mohammad Koohi-Moghaddam,Kyongtae Tyler Bae
关键词-EN: AutoPET III challenge, two-stage deep learning-based, deep learning-based approach, III challenge, AutoPET III
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we implemented a two-stage deep learning-based approach to segment lesions in PET/CT images for the AutoPET III challenge. The first stage utilized a DynUNet model for coarse segmentation, identifying broad regions of interest. The second stage refined this segmentation using an ensemble of SwinUNETR, SegResNet, and UNet models. Preprocessing involved resampling images to a common resolution and normalization, while data augmentation techniques such as affine transformations and intensity adjustments were applied to enhance model generalization. The dataset was split into 80% training and 20% validation, excluding healthy cases. This method leverages multi-stage segmentation and model ensembling to achieve precise lesion segmentation, aiming to improve robustness and overall performance.

[CV-102] Semi-overcomplete convolutional auto-encoder embedding as shape priors for deep vessel segmentation

链接: https://arxiv.org/abs/2409.13001
作者: Amine Sadikine,Bogdan Badic,Jean-Pierre Tasu,Vincent Noblet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: medical image analysis, image analysis, recently experienced, experienced a widespread, widespread interest
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, conference

点击查看摘要

Abstract:The extraction of blood vessels has recently experienced a widespread interest in medical image analysis. Automatic vessel segmentation is highly desirable to guide clinicians in computer-assisted diagnosis, therapy or surgical planning. Despite a good ability to extract large anatomical structures, the capacity of U-Net inspired architectures to automatically delineate vascular systems remains a major issue, especially given the scarcity of existing datasets. In this paper, we present a novel approach that integrates into deep segmentation shape priors from a Semi-Overcomplete Convolutional Auto-Encoder (S-OCAE) embedding. Compared to standard Convolutional Auto-Encoders (CAE), it exploits an over-complete branch that projects data onto higher dimensions to better characterize tiny structures. Experiments on retinal and liver vessel extraction, respectively performed on publicly-available DRIVE and 3D-IRCADb datasets, highlight the effectiveness of our method compared to U-Net trained without and with shape priors from a traditional CAE.

机器学习

[LG-0] he Impact of Large Language Models in Academia: from Writing to Speaking

链接: https://arxiv.org/abs/2409.13686
作者: Mingmeng Geng,Caixi Chen,Yanru Wu,Dongping Chen,Yao Wan,Pan Zhou
关键词-EN: Large language models, increasingly impacting human, Large language, impacting human society, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Large language models (LLMs) are increasingly impacting human society, particularly in textual information. Based on more than 30,000 papers and 1,000 presentations from machine learning conferences, we examined and compared the words used in writing and speaking, representing the first large-scale investigating study of how LLMs influence the two main modes of verbal communication and expression within the same group of people. Our empirical results show that LLM-style words such as “significant” have been used more frequently in abstracts and oral presentations. The impact on speaking is beginning to emerge and is likely to grow in the future, calling attention to the implicit influence and ripple effect of LLMs on human society.

[LG-1] he FIX Benchmark: Extracting Features Interpretable to eXperts

链接: https://arxiv.org/abs/2409.13684
作者: Helen Jin,Shreya Havaldar,Chaehyeon Kim,Anton Xue,Weiqiu You,Helen Qu,Marco Gatti,Daniel A Hashimoto,Bhuvnesh Jain,Amin Madani,Masao Sako,Lyle Ungar,Eric Wong
关键词-EN: explain model predictions, model predictions, explain model, implicitly assume, features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we have developed feature interpretability objectives across diverse real-world settings and unified them into a single framework that is the FIX benchmark. We find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.

[LG-2] SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience

链接: https://arxiv.org/abs/2409.13678
作者: Elliot Chane-Sane,Joseph Amigo,Thomas Flayols,Ludovic Righetti,Nicolas Mansard
关键词-EN: limited sensory inputs, requiring navigation, sensory inputs, poses a significant, significant challenge
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CoRL 2024. Project website: this https URL

点击查看摘要

Abstract:Parkour poses a significant challenge for legged robots, requiring navigation through complex environments with agility and precision based on limited sensory inputs. In this work, we introduce a novel method for training end-to-end visual policies, from depth pixels to robot control commands, to achieve agile and safe quadruped locomotion. We formulate robot parkour as a constrained reinforcement learning (RL) problem designed to maximize the emergence of agile skills within the robot’s physical limits while ensuring safety. We first train a policy without vision using privileged information about the robot’s surroundings. We then generate experience from this privileged policy to warm-start a sample efficient off-policy RL algorithm from depth images. This allows the robot to adapt behaviors from this privileged experience to visual locomotion while circumventing the high computational costs of RL directly from pixels. We demonstrate the effectiveness of our method on a real Solo-12 robot, showcasing its capability to perform a variety of parkour skills such as walking, climbing, leaping, and crawling.

[LG-3] Recent Advances in Non-convex Smoothness Conditions and Applicability to Deep Linear Neural Networks

链接: https://arxiv.org/abs/2409.13672
作者: Vivak Patel,Christian Varner
关键词-EN: smooth optimization problems, optimization problems arising, convergence analyses, presence of non-convexity, non-convexity in smooth
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The presence of non-convexity in smooth optimization problems arising from deep learning have sparked new smoothness conditions in the literature and corresponding convergence analyses. We discuss these smoothness conditions, order them, provide conditions for determining whether they hold, and evaluate their applicability to training a deep linear neural network for binary classification.

[LG-4] A Generative Framework for Predictive Modeling of Multiple Chronic Conditions Using Graph Variational Autoencoder and Bandit-Optimized Graph Neural Network

链接: https://arxiv.org/abs/2409.13671
作者: Julian Carvajal Rico,Adel Alaeddini,Syed Hasib Akhter Faruqui,Susan P Fisher-Hoch,Joseph B Mccormick
关键词-EN: multiple chronic conditions, MCC significantly impacts, Predicting the emergence, impacts patient outcomes, significantly impacts patient
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting the emergence of multiple chronic conditions (MCC) is crucial for early intervention and personalized healthcare, as MCC significantly impacts patient outcomes and healthcare costs. Graph neural networks (GNNs) are effective methods for modeling complex graph data, such as those found in MCC. However, a significant challenge with GNNs is their reliance on an existing graph structure, which is not readily available for MCC. To address this challenge, we propose a novel generative framework for GNNs that constructs a representative underlying graph structure by utilizing the distribution of the data to enhance predictive analytics for MCC. Our framework employs a graph variational autoencoder (GVAE) to capture the complex relationships in patient data. This allows for a comprehensive understanding of individual health trajectories and facilitates the creation of diverse patient stochastic similarity graphs while preserving the original feature set. These variations of patient stochastic similarity graphs, generated from the GVAE decoder, are then processed by a GNN using a novel Laplacian regularization technique to refine the graph structure over time and improves the prediction accuracy of MCC. A contextual Bandit is designed to evaluate the stochastically generated graphs and identify the best-performing graph for the GNN model iteratively until model convergence. We validate the performance of the proposed contextual Bandit algorithm against \varepsilon -Greedy and multi-armed Bandit algorithms on a large cohort (n = 1,592) of patients with MCC. These advancements highlight the potential of the proposed approach to transform predictive healthcare analytics, enabling a more personalized and proactive approach to MCC management.

[LG-5] DiffFluid: Plain Diffusion Models are Effective Predictors of Flow Dynamics

链接: https://arxiv.org/abs/2409.13665
作者: Dongyu Luo,Jianyu Wu,Jing Wang,Hairun Xie,Xiangyu Yue,Shixiang Tang
关键词-EN: high Reynolds number, Reynolds number, Transformers are effective, high Reynolds, plain diffusion models
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We showcase the plain diffusion models with Transformers are effective predictors of fluid dynamics under various working conditions, e.g., Darcy flow and high Reynolds number. Unlike traditional fluid dynamical solvers that depend on complex architectures to extract intricate correlations and learn underlying physical states, our approach formulates the prediction of flow dynamics as the image translation problem and accordingly leverage the plain diffusion model to tackle the problem. This reduction in model design complexity does not compromise its ability to capture complex physical states and geometric features of fluid dynamical equations, leading to high-precision solutions. In preliminary tests on various fluid-related benchmarks, our DiffFluid achieves consistent state-of-the-art performance, particularly in solving the Navier-Stokes equations in fluid dynamics, with a relative precision improvement of +44.8%. In addition, we achieved relative improvements of +14.0% and +11.3% in the Darcy flow equation and the airfoil problem with Euler’s equation, respectively. Code will be released at this https URL upon acceptance.

[LG-6] Analysis of Gene Regulatory Networks from Gene Expression Using Graph Neural Networks

链接: https://arxiv.org/abs/2409.13664
作者: Hakan T. Otal,Abdulhamit Subasi,Furkan Kurt,M. Abdullah Canbaz,Yasin Uzun
关键词-EN: understanding cellular processes, Unraveling the complexities, Graph Neural Networks, Gene Regulatory Networks, crucial for understanding
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Social and Information Networks (cs.SI)
*备注: 24 Pages, 6 Figures

点击查看摘要

Abstract:Unraveling the complexities of Gene Regulatory Networks (GRNs) is crucial for understanding cellular processes and disease mechanisms. Traditional computational methods often struggle with the dynamic nature of these networks. This study explores the use of Graph Neural Networks (GNNs), a powerful approach for modeling graph-structured data like GRNs. Utilizing a Graph Attention Network v2 (GATv2), our study presents a novel approach to the construction and interrogation of GRNs, informed by gene expression data and Boolean models derived from literature. The model’s adeptness in accurately predicting regulatory interactions and pinpointing key regulators is attributed to advanced attention mechanisms, a hallmark of the GNN framework. These insights suggest that GNNs are primed to revolutionize GRN analysis, addressing traditional limitations and offering richer biological insights. The success of GNNs, as highlighted by our model’s reliance on high-quality data, calls for enhanced data collection methods to sustain progress. The integration of GNNs in GRN research is set to pioneer developments in personalized medicine, drug discovery, and our grasp of biological systems, bolstered by the structural analysis of networks for improved node and edge prediction.

[LG-7] Adaptive Mixture Importance Sampling for Automated Ads Auction Tuning RECSYS’24

链接: https://arxiv.org/abs/2409.13655
作者: Yimeng Jia,Kaushal Paneri,Rong Huang,Kailash Singh Maurya,Pavan Mallapragada,Yifan Shi
关键词-EN: key performance indicators, large-scale recommender systems, optimizing key performance, paper introduces Adaptive, introduces Adaptive Mixture
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:This paper introduces Adaptive Mixture Importance Sampling (AMIS) as a novel approach for optimizing key performance indicators (KPIs) in large-scale recommender systems, such as online ad auctions. Traditional importance sampling (IS) methods face challenges in dynamic environments, particularly in navigating through complexities of multi-modal landscapes and avoiding entrapment in local optima for the optimization task. Instead of updating importance weights and mixing samples across iterations, as in canonical adaptive IS and multiple IS, our AMIS framework leverages a mixture distribution as the proposal distribution and dynamically adjusts both the mixture parameters and their mixing rates at each iteration, thereby enhancing search diversity and efficiency. Through extensive offline simulations, we demonstrate that AMIS significantly outperforms simple Gaussian Importance Sampling (GIS), particularly in noisy environments. Moreover, our approach is validated in real-world scenarios through online A/B experiments on a major search engine, where AMIS consistently identifies optimal tuning points that are more likely to be adopted as mainstream configurations. These findings indicate that AMIS enhances convergence in noisy environments, leading to more accurate and reliable decision-making in the context of importance sampling off-policy estimators. Comments: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24 Subjects: Machine Learning (cs.LG); Applications (stat.AP) MSC classes: 68T05, 65C05, 68Q87 ACMclasses: G.3; I.2.6; I.6.8 Cite as: arXiv:2409.13655 [cs.LG] (or arXiv:2409.13655v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Neural filtering for Neural Network-based Models of Dynamic Systems

链接: https://arxiv.org/abs/2409.13654
作者: Parham Oveissi,Turibius Rozario,Ankit Goel
关键词-EN: complex nonlinear functions, neural, neural filter, modeling dynamic systems, neural networks
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:The application of neural networks in modeling dynamic systems has become prominent due to their ability to estimate complex nonlinear functions. Despite their effectiveness, neural networks face challenges in long-term predictions, where the prediction error diverges over time, thus degrading their accuracy. This paper presents a neural filter to enhance the accuracy of long-term state predictions of neural network-based models of dynamic systems. Motivated by the extended Kalman filter, the neural filter combines the neural network state predictions with the measurements from the physical system to improve the estimated state’s accuracy. The neural filter’s improvements in prediction accuracy are demonstrated through applications to four nonlinear dynamical systems. Numerical experiments show that the neural filter significantly improves prediction accuracy and bounds the state estimate covariance, outperforming the neural network predictions.

[LG-9] OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

链接: https://arxiv.org/abs/2409.13652
作者: Stephen Zhang,Vardan Papyan
关键词-EN: recent paradigm shift, found great success, prohibitively expensive costs, high memory consumption, large-scale foundation models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that utilizes the second moment information in the input embeddings to decompose the model weights into a sum of sparse and low-rank matrices. Without any retraining, OATS achieves state-of-the-art performance when compressing models by up to 60% on large language models such as Llama-3 and Phi-3 and vision transformers such as ViT and DINOv2 while delivering up to 1.37\times the CPU acceleration versus a model that was comparably pruned.

[LG-10] DP2-FedSAM: Enhancing Differentially Private Federated Learning Through Personalized Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2409.13645
作者: Zhenxiao Zhang,Yuanxiong Guo,Yanmin Gong
关键词-EN: distributed machine learning, machine learning approach, Federated learning, private federated learning, differentially private federated
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning approach that allows multiple clients to collaboratively train a model without sharing their raw data. To prevent sensitive information from being inferred through the model updates shared in FL, differentially private federated learning (DPFL) has been proposed. DPFL ensures formal and rigorous privacy protection in FL by clipping and adding random noise to the shared model updates. However, the existing DPFL methods often result in severe model utility degradation, especially in settings with data heterogeneity. To enhance model utility, we propose a novel DPFL method named DP ^2 -FedSAM: Differentially Private and Personalized Federated Learning with Sharpness-Aware Minimization. DP ^2 -FedSAM leverages personalized partial model-sharing and sharpness-aware minimization optimizer to mitigate the adverse impact of noise addition and clipping, thereby significantly improving model utility without sacrificing privacy. From a theoretical perspective, we provide a rigorous theoretical analysis of the privacy and convergence guarantees of our proposed method. To evaluate the effectiveness of DP ^2 -FedSAM, we conduct extensive evaluations based on common benchmark datasets. Our results verify that our method improves the privacy-utility trade-off compared to the existing DPFL methods, particularly in heterogeneous data settings.

[LG-11] Non-overlapping Schwarz-type Domain Decomposition Method for Physics and Equality Constrained Artificial Neural Networks

链接: https://arxiv.org/abs/2409.13644
作者: Qifeng Hu,Shamsulhaq Basir,Inanc Senocak
关键词-EN: Schwarz-type domain decomposition, Schwarz-type domain, physics-informed machine learning, decomposition method employing, domain decomposition method
类目: Machine Learning (cs.LG)
*备注: 44 pages, 17 figures

点击查看摘要

Abstract:We introduce a non-overlapping, Schwarz-type domain decomposition method employing a generalized interface condition, tailored for physics-informed machine learning of partial differential equations (PDEs) in both forward and inverse scenarios. Our method utilizes physics and equality constrained artificial neural networks (PECANN) in each subdomain. Diverging from the original PECANN method, which uses initial and boundary conditions to constrain the PDEs alone, our method jointly employs both the boundary conditions and PDEs to constrain a specially formulated generalized interface loss function for each subdomain. This modification enhances the learning of subdomain-specific interface parameters, while delaying information exchange between neighboring subdomains, and thereby significantly reduces communication overhead. By utilizing an augmented Lagrangian method with a conditionally adaptive update strategy, the constrained optimization problem in each subdomain is transformed into a dual unconstrained problem. This approach enables neural network training without the need for ad-hoc tuning of model parameters. We demonstrate the generalization ability and robust parallel performance of our method across a range of forward and inverse problems, with solid parallel scaling performance up to 32 processes using the Message Passing Interface model. A key strength of our approach is its capability to solve both Laplace’s and Helmholtz equations with multi-scale solutions within a unified framework, highlighting its broad applicability and efficiency.

[LG-12] Benchmarking Reliability of Deep Learning Models for Pathological Gait Classification ALT

链接: https://arxiv.org/abs/2409.13643
作者: Abhishek Jaiswal,Nisheeth Srivastava
关键词-EN: important open problem, early diagnosis, open problem, important open, diagnosis and treatment
类目: Machine Learning (cs.LG)
*备注: 23 pages, Accepted in Machine Learning for Healthcare(MLHC) 2024, JMLR Volume 252

点击查看摘要

Abstract:Early detection of neurodegenerative disorders is an important open problem, since early diagnosis and treatment may yield a better prognosis. Researchers have recently sought to leverage advances in machine learning algorithms to detect symptoms of altered gait, possibly corresponding to the emergence of neurodegenerative etiologies. However, while several claims of positive and accurate detection have been made in the recent literature, using a variety of sensors and algorithms, solutions are far from being realized in practice. This paper analyzes existing approaches to identify gaps inhibiting translation. Using a set of experiments across three Kinect-simulated and one real Parkinson’s patient datasets, we highlight possible sources of errors and generalization failures in these approaches. Based on these observations, we propose our strong baseline called Asynchronous Multi-Stream Graph Convolutional Network (AMS-GCN) that can reliably differentiate multiple categories of pathological gaits across datasets.

[LG-13] ransformers in Uniform TC0

链接: https://arxiv.org/abs/2409.13629
作者: David Chiang
关键词-EN: average-hard attention transformers, circuit complexity class, Previous work, attention transformers, softmax-attention transformers
类目: Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Previous work has shown that the languages recognized by average-hard attention transformers (AHATs) and softmax-attention transformers (SMATs) are within the circuit complexity class TC ^0 . However, these results assume limited-precision arithmetic: using floating-point numbers with O(log n) bits (where n is the length of the input string), Strobl showed that AHATs can be approximated in L-uniform TC ^0 , and Merrill and Sabharwal showed that SMATs can be approximated in DLOGTIME-uniform TC ^0 . Here, we improve these results, showing that AHATs with no approximation, SMATs with O(poly(n)) bits of floating-point precision, and SMATs with at most 2^-O(poly(n)) absolute error are all in DLOGTIME-uniform TC ^0 .

[LG-14] Beauty Beyond Words: Explainable Beauty Product Recommendations Using Ingredient-Based Product Attributes

链接: https://arxiv.org/abs/2409.13628
作者: Siliang Liu,Rahul Suresh,Amin Banitalebi-Dehkordi
关键词-EN: Accurate attribute, trust with customers, Accurate attribute extraction, building trust, Accurate
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 18th ACM Conference on Recommender Systems, Workshop on Strategic and Utility-aware REcommendation

点击查看摘要

Abstract:Accurate attribute extraction is critical for beauty product recommendations and building trust with customers. This remains an open problem, as existing solutions are often unreliable and incomplete. We present a system to extract beauty-specific attributes using end-to-end supervised learning based on beauty product ingredients. A key insight to our system is a novel energy-based implicit model architecture. We show that this implicit model architecture offers significant benefits in terms of accuracy, explainability, robustness, and flexibility. Furthermore, our implicit model can be easily fine-tuned to incorporate additional attributes as they become available, making it more useful in real-world applications. We validate our model on a major e-commerce skincare product catalog dataset and demonstrate its effectiveness. Finally, we showcase how ingredient-based attribute extraction contributes to enhancing the explainability of beauty recommendations.

[LG-15] owards Child-Inclusive Clinical Video Understanding for Autism Spectrum Disorder

链接: https://arxiv.org/abs/2409.13606
作者: Aditya Kommineni,Digbalay Bose,Tiantian Feng,So Hyun Kim,Helen Tager-Flusberg,Somer Bishop,Catherine Lord,Sudarsana Kadiri,Shrikanth Narayanan
关键词-EN: Autism Spectrum Disorder, encompassing complex verbal, Autism Spectrum, Spectrum Disorder, encompassing complex
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Clinical videos in the context of Autism Spectrum Disorder are often long-form interactions between children and caregivers/clinical professionals, encompassing complex verbal and non-verbal behaviors. Objective analyses of these videos could provide clinicians and researchers with nuanced insights into the behavior of children with Autism Spectrum Disorder. Manually coding these videos is a time-consuming task and requires a high level of domain expertise. Hence, the ability to capture these interactions computationally can augment the manual effort and enable supporting the diagnostic procedure. In this work, we investigate the use of foundation models across three modalities: speech, video, and text, to analyse child-focused interaction sessions. We propose a unified methodology to combine multiple modalities by using large language models as reasoning agents. We evaluate their performance on two tasks with different information granularity: activity recognition and abnormal behavior detection. We find that the proposed multimodal pipeline provides robustness to modality-specific limitations and improves performance on the clinical video analysis compared to unimodal settings.

[LG-16] Prithvi WxC: Foundation Model for Weather and Climate

链接: https://arxiv.org/abs/2409.13598
作者: Johannes Schmude,Sujit Roy,Will Trojak,Johannes Jakubik,Daniel Salles Civitarese,Shraddha Singh,Julian Kuehnert,Kumar Ankur,Aman Gupta,Christopher E Phillips,Romeo Kienzler,Daniela Szwarcman,Vishal Gaur,Rajat Shinde,Rohit Lal,Arlindo Da Silva,Jorge Luis Guevara Diaz,Anne Jones,Simon Pfreundschuh,Amy Lin,Aditi Sheshadri,Udaysankar Nair,Valentine Anantharaj,Hendrik Hamann,Campbell Watson,Manil Maskey,Tsengdar J Lee,Juan Bernabe Moreno,Rahul Ramachandran
关键词-EN: traditional numerical weather, numerical weather prediction, HPC systems, running on HPC, prediction models running
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Triggered by the realization that AI emulators can rival the performance of traditional numerical weather prediction models running on HPC systems, there is now an increasing number of large AI models that address use cases such as forecasting, downscaling, or nowcasting. While the parallel developments in the AI literature focus on foundation models – models that can be effectively tuned to address multiple, different use cases – the developments on the weather and climate side largely focus on single-use cases with particular emphasis on mid-range forecasting. We close this gap by introducing Prithvi WxC, a 2.3 billion parameter foundation model developed using 160 variables from the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Prithvi WxC employs an encoder-decoder-based architecture, incorporating concepts from various recent transformer models to effectively capture both regional and global dependencies in the input data. The model has been designed to accommodate large token counts to model weather phenomena in different topologies at fine resolutions. Furthermore, it is trained with a mixed objective that combines the paradigms of masked reconstruction with forecasting. We test the model on a set of challenging downstream tasks namely: Autoregressive rollout forecasting, Downscaling, Gravity wave flux parameterization, and Extreme events estimation. The pretrained model with 2.3 billion parameters, along with the associated fine-tuning workflows, has been publicly released as an open-source contribution via Hugging Face.

[LG-17] Neurosymbolic Conformal Classification

链接: https://arxiv.org/abs/2409.13585
作者: Arthur Ledaguenel,Céline Hudelot,Mostepha Khouadjia
关键词-EN: improvement of Machine, driven by Deep, Machine Learning, Deep Learning, drastic improvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 0 figures. arXiv admin note: text overlap with arXiv:2404.08404

点击查看摘要

Abstract:The last decades have seen a drastic improvement of Machine Learning (ML), mainly driven by Deep Learning (DL). However, despite the resounding successes of ML in many domains, the impossibility to provide guarantees of conformity and the fragility of ML systems (faced with distribution shifts, adversarial attacks, etc.) have prevented the design of trustworthy AI systems. Several research paths have been investigated to mitigate this fragility and provide some guarantees regarding the behavior of ML systems, among which are neurosymbolic AI and conformal prediction. Neurosymbolic artificial intelligence is a growing field of research aiming to combine neural network learning capabilities with the reasoning abilities of symbolic systems. One of the objective of this hybridization can be to provide theoritical guarantees that the output of the system will comply with some prior knowledge. Conformal prediction is a set of techniques that enable to take into account the uncertainty of ML systems by transforming the unique prediction into a set of predictions, called a confidence set. Interestingly, this comes with statistical guarantees regarding the presence of the true label inside the confidence set. Both approaches are distribution-free and model-agnostic. In this paper, we see how these two approaches can complement one another. We introduce several neurosymbolic conformal prediction techniques and explore their different characteristics (size of confidence sets, computational complexity, etc.).

[LG-18] Deep Learning and Machine Learning Advancing Big Data Analytics and Management: Tensorflow Pretrained Models

链接: https://arxiv.org/abs/2409.13566
作者: Keyu Chen,Ziqian Bi,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Ming Liu,Ming Li,Xuanhe Pan,Jiawei Xu,Jinlang Wang,Pohsun Feng
关键词-EN: providing detailed guidance, providing detailed, object detection, application of TensorFlow, detailed guidance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This book contains 148 pages and 7 figures

点击查看摘要

Abstract:This book focuses on the application of TensorFlow pre-trained models in deep learning, providing detailed guidance on effectively using these models for tasks such as image classification and object detection. It covers practical implementations of modern architectures like ResNet, MobileNet, and EfficientNet, demonstrating the power of transfer learning through real-world examples and experiments. The book compares linear probing and model fine-tuning, offering visualizations using techniques such as PCA, t-SNE, and UMAP to help readers intuitively understand the impact of different approaches. Designed for beginners to advanced users, this book includes complete example code and step-by-step instructions, enabling readers to quickly master how to leverage pre-trained models to improve performance in practical scenarios. By blending theoretical insights with hands-on practice, this book equips readers with the knowledge to confidently tackle various deep learning challenges.

[LG-19] A preliminary study on continual learning in computer vision using Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2409.13550
作者: Alessandro Cacciatore,Valerio Morelli,Federica Paganica,Emanuele Frontoni,Lucia Migliorelli,Daniele Berardini
关键词-EN: Deep learning, multi-layer perceptrons, long been dominated, dominated by multi-layer, demonstrated superiority
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has long been dominated by multi-layer perceptrons (MLPs), which have demonstrated superiority over other optimizable models in various domains. Recently, a new alternative to MLPs has emerged - Kolmogorov-Arnold Networks (KAN)- which are based on a fundamentally different mathematical framework. According to their authors, KANs address several major issues in MLPs, such as catastrophic forgetting in continual learning scenarios. However, this claim has only been supported by results from a regression task on a toy 1D dataset. In this paper, we extend the investigation by evaluating the performance of KANs in continual learning tasks within computer vision, specifically using the MNIST datasets. To this end, we conduct a structured analysis of the behavior of MLPs and two KAN-based models in a class-incremental learning scenario, ensuring that the architectures involved have the same number of trainable parameters. Our results demonstrate that an efficient version of KAN outperforms both traditional MLPs and the original KAN implementation. We further analyze the influence of hyperparameters in MLPs and KANs, as well as the impact of certain trainable parameters in KANs, such as bias and scale weights. Additionally, we provide a preliminary investigation of recent KAN-based convolutional networks and compare their performance with that of traditional convolutional neural networks. Our codes can be found at this https URL.

[LG-20] Certified Adversarial Robustness via Partition-based Randomized Smoothing

链接: https://arxiv.org/abs/2409.13546
作者: Hossein Goli,Farzan Farnia
关键词-EN: additive Gaussian noise, Gaussian noise, classifiers requires robustness, requires robustness certificates, network classifiers requires
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A reliable application of deep neural network classifiers requires robustness certificates against adversarial perturbations. Gaussian smoothing is a widely analyzed approach to certifying robustness against norm-bounded perturbations, where the certified prediction radius depends on the variance of the Gaussian noise and the confidence level of the neural net’s prediction under the additive Gaussian noise. However, in application to high-dimensional image datasets, the certified radius of the plain Gaussian smoothing could be relatively small, since Gaussian noise with high variances can significantly harm the visibility of an image. In this work, we propose the Pixel Partitioning-based Randomized Smoothing (PPRS) methodology to boost the neural net’s confidence score and thus the robustness radius of the certified prediction. We demonstrate that the proposed PPRS algorithm improves the visibility of the images under additive Gaussian noise. We discuss the numerical results of applying PPRS to standard computer vision datasets and neural network architectures. Our empirical findings indicate a considerable improvement in the certified accuracy and stability of the prediction model to the additive Gaussian noise in randomized smoothing.

[LG-21] Graph Similarity Regularized Softmax for Semi-Supervised Node Classification

链接: https://arxiv.org/abs/2409.13544
作者: Yiming Yang,Jun Liu,Wei Wan
关键词-EN: Graph Neural Networks, Neural Networks, powerful deep learning, deep learning models, learning models designed
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful deep learning models designed for graph-structured data, demonstrating effectiveness across a wide range of applications.The softmax function is the most commonly used classifier for semi-supervised node classification. However, the softmax function lacks spatial information of the graph structure. In this paper, we propose a graph similarity regularized softmax for GNNs in semi-supervised node classification. By incorporating non-local total variation (TV) regularization into the softmax activation function, we can more effectively capture the spatial information inherent in graphs. The weights in the non-local gradient and divergence operators are determined based on the graph’s adjacency matrix. We apply the proposed method into the architecture of GCN and GraphSAGE, testing them on citation and webpage linking datasets, respectively. Numerical experiments demonstrate its good performance in node classification and generalization capabilities. These results indicate that the graph similarity regularized softmax is effective on both assortative and disassortative graphs.

[LG-22] First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge

链接: https://arxiv.org/abs/2409.13538
作者: Yingzhe Peng,Yixiao Yuan,Zitian Ao,Huapeng Zhou,Kangqi Wang,Qipeng Zhu,Xu Yang
关键词-EN: Video Question Answering, Multiple-choice Video Question, Perception Test Challenge, Question Answering, Multiple-choice Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this report, we present our first-place solution to the Multiple-choice Video Question Answering (QA) track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content. To address this challenge, we leveraged the powerful QwenVL2 (7B) model and fine-tune it on the provided training set. Additionally, we employed model ensemble strategies and Test Time Augmentation to boost performance. Through continuous optimization, our approach achieved a Top-1 Accuracy of 0.7647 on the leaderboard.

[LG-23] Using High-Level Patterns to Estimate How Humans Predict a Robot will Behave

链接: https://arxiv.org/abs/2409.13533
作者: Sagar Parekh,Lauren Bramblett,Nicola Bezzo,Dylan P. Losey
关键词-EN: autonomous car, human, car, robot, behavior
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A human interacting with a robot often forms predictions of what the robot will do next. For instance, based on the recent behavior of an autonomous car, a nearby human driver might predict that the car is going to remain in the same lane. It is important for the robot to understand the human’s prediction for safe and seamless interaction: e.g., if the autonomous car knows the human thinks it is not merging – but the autonomous car actually intends to merge – then the car can adjust its behavior to prevent an accident. Prior works typically assume that humans make precise predictions of robot behavior. However, recent research on human-human prediction suggests the opposite: humans tend to approximate other agents by predicting their high-level behaviors. We apply this finding to develop a second-order theory of mind approach that enables robots to estimate how humans predict they will behave. To extract these high-level predictions directly from data, we embed the recent human and robot trajectories into a discrete latent space. Each element of this latent space captures a different type of behavior (e.g., merging in front of the human, remaining in the same lane) and decodes into a vector field across the state space that is consistent with the underlying behavior type. We hypothesize that our resulting high-level and course predictions of robot behavior will correspond to actual human predictions. We provide initial evidence in support of this hypothesis through a proof-of-concept user study.

[LG-24] owards Long-Context Time Series Foundation Models

链接: https://arxiv.org/abs/2409.13530
作者: Nina Żukowska,Mononito Goswami,Michał Wiliński,Willa Potosnak,Artur Dubrawski
关键词-EN: shown impressive performance, Time series foundation, Time series, variety of tasks, zero-shot settings
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series foundation models have shown impressive performance on a variety of tasks, across a wide range of domains, even in zero-shot settings. However, most of these models are designed to handle short univariate time series as an input. This limits their practical use, especially in domains such as healthcare with copious amounts of long and multivariate data with strong temporal and intra-variate dependencies. Our study bridges this gap by cataloging and systematically comparing various context expansion techniques from both language and time series domains, and introducing a novel compressive memory mechanism to allow encoder-only TSFMs to effectively model intra-variate dependencies. We demonstrate the benefits of our approach by imbuing MOMENT, a recent family of multi-task time series foundation models, with the multivariate context.

[LG-25] SatFed: A Resource-Efficient LEO Satellite-Assisted Heterogeneous Federated Learning Framework

链接: https://arxiv.org/abs/2409.13503
作者: Yuxin Zhang,Zheng Lin,Zhe Chen,Zihan Fang,Wenjun Zhu,Xianhao Chen,Jin Zhao,Yue Gao
关键词-EN: Traditional federated learning, congestion significantly hinder, frameworks rely heavily, hinder model convergence, increasing bandwidth congestion
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 12 figures

点击查看摘要

Abstract:Traditional federated learning (FL) frameworks rely heavily on terrestrial networks, where coverage limitations and increasing bandwidth congestion significantly hinder model convergence. Fortunately, the advancement of low-Earth orbit (LEO) satellite networks offers promising new communication avenues to augment traditional terrestrial FL. Despite this potential, the limited satellite-ground communication bandwidth and the heterogeneous operating environments of ground devices-including variations in data, bandwidth, and computing power-pose substantial challenges for effective and robust satellite-assisted FL. To address these challenges, we propose SatFed, a resource-efficient satellite-assisted heterogeneous FL framework. SatFed implements freshness-based model prioritization queues to optimize the use of highly constrained satellite-ground bandwidth, ensuring the transmission of the most critical models. Additionally, a multigraph is constructed to capture real-time heterogeneous relationships between devices, including data distribution, terrestrial bandwidth, and computing capability. This multigraph enables SatFed to aggregate satellite-transmitted models into peer guidance, enhancing local training in heterogeneous environments. Extensive experiments with real-world LEO satellite networks demonstrate that SatFed achieves superior performance and robustness compared to state-of-the-art benchmarks.

[LG-26] Invertible ResNets for Inverse Imaging Problems: Competitive Performance with Provable Regularization Properties

链接: https://arxiv.org/abs/2409.13482
作者: Clemens Arndt,Judith Nickel
关键词-EN: solving inverse problems, inverse problems, demonstrated remarkable performance, Learning-based methods, demonstrated remarkable
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-based methods have demonstrated remarkable performance in solving inverse problems, particularly in image reconstruction tasks. Despite their success, these approaches often lack theoretical guarantees, which are crucial in sensitive applications such as medical imaging. Recent works by Arndt et al (2023 Inverse Problems 39 125018, 2024 Inverse Problems 40 045021) addressed this gap by analyzing a data-driven reconstruction method based on invertible residual networks (iResNets). They revealed that, under reasonable assumptions, this approach constitutes a convergent regularization scheme. However, the performance of the reconstruction method was only validated on academic toy problems and small-scale iResNet architectures. In this work, we address this gap by evaluating the performance of iResNets on two real-world imaging tasks: a linear blurring operator and a nonlinear diffusion operator. To do so, we extend some of the theoretical results from Arndt et al to encompass nonlinear inverse problems and offer insights for the design of large-scale performant iResNet architectures. Through numerical experiments, we compare the performance of our iResNet models against state-of-the-art neural networks, confirming their efficacy. Additionally, we numerically investigate the theoretical guarantees of this approach and demonstrate how the invertibility of the network enables a deeper analysis of the learned forward operator and its learned regularization.

[LG-27] Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models

链接: https://arxiv.org/abs/2409.13474
作者: Anmol Mekala,Vineeth Dorna,Shreya Dubey,Abhishek Lalwani,David Koleczek,Mukund Rungta,Sadid Hasan,Elita Lobo
关键词-EN: Machine unlearning aims, specific training data, forget set, Large Language Models, Machine unlearning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning aims to efficiently eliminate the influence of specific training data, known as the forget set, from the model. However, existing unlearning methods for Large Language Models (LLMs) face a critical challenge: they rely solely on negative feedback to suppress responses related to the forget set, which often results in nonsensical or inconsistent outputs, diminishing model utility and posing potential privacy risks. To address this limitation, we propose a novel approach called Alternate Preference Optimization (AltPO), which combines negative feedback with in-domain positive feedback on the forget set. Additionally, we introduce new evaluation metrics to assess the quality of responses related to the forget set. Extensive experiments show that our approach not only enables effective unlearning but also avoids undesirable model behaviors while maintaining overall model performance.

[LG-28] Flotta: a Secure and Flexible Spark-inspired Federated Learning Framework

链接: https://arxiv.org/abs/2409.13473
作者: Claudio Bonesana,Daniele Malpetti,Sandra Mitrović,Francesca Mangili,Laura Azzimonti
关键词-EN: Federated Learning framework, sensitive data distributed, contexts requiring high, requiring high levels, train machine learning
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at FLTA 2024: The 2nd IEEE International Conference on Federated Learning Technologies and Applications

点击查看摘要

Abstract:We present Flotta, a Federated Learning framework designed to train machine learning models on sensitive data distributed across a multi-party consortium conducting research in contexts requiring high levels of security, such as the biomedical field. Flotta is a Python package, inspired in several aspects by Apache Spark, which provides both flexibility and security and allows conducting research using solely machines internal to the consortium. In this paper, we describe the main components of the framework together with a practical use case to illustrate the framework’s capabilities and highlight its security, flexibility and user-friendliness.

[LG-29] Deterministic versus stochastic dynamical classifiers: opposing random adversarial attacks with noise

链接: https://arxiv.org/abs/2409.13470
作者: Lorenzo Chicchi,Duccio Fanelli,Diego Febbe,Lorenzo Buffoni,Francesca Di Patti,Lorenzo Giambagli,Raffele Marino
关键词-EN: Continuous-Variable Firing Rate, excitatory biological neurons, dynamically assisted classifier, veritable dynamically assisted, Firing Rate
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Continuous-Variable Firing Rate (CVFR) model, widely used in neuroscience to describe the intertangled dynamics of excitatory biological neurons, is here trained and tested as a veritable dynamically assisted classifier. To this end the model is supplied with a set of planted attractors which are self-consistently embedded in the inter-nodes coupling matrix, via its spectral decomposition. Learning to classify amounts to sculp the basin of attraction of the imposed equilibria, directing different items towards the corresponding destination target, which reflects the class of respective pertinence. A stochastic variant of the CVFR model is also studied and found to be robust to aversarial random attacks, which corrupt the items to be classified. This remarkable finding is one of the very many surprising effects which arise when noise and dynamical attributes are made to mutually resonate.

[LG-30] Higher-Order Message Passing for Glycan Representation Learning NEURIPS2024

链接: https://arxiv.org/abs/2409.13467
作者: Roman Joeres,Daniel Bojar
关键词-EN: monosaccharides forming extended, complex biological sequence, non-linear sequences, biological sequence, forming extended
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Submitted to MLSB Workshop at NeurIPS 2024

点击查看摘要

Abstract:Glycans are the most complex biological sequence, with monosaccharides forming extended, non-linear sequences. As post-translational modifications, they modulate protein structure, function, and interactions. Due to their diversity and complexity, predictive models of glycan properties and functions are still insufficient. Graph Neural Networks (GNNs) are deep learning models designed to process and analyze graph-structured data. These architectures leverage the connectivity and relational information in graphs to learn effective representations of nodes, edges, and entire graphs. Iteratively aggregating information from neighboring nodes, GNNs capture complex patterns within graph data, making them particularly well-suited for tasks such as link prediction or graph classification across domains. This work presents a new model architecture based on combinatorial complexes and higher-order message passing to extract features from glycan structures into a latent space representation. The architecture is evaluated on an improved GlycanML benchmark suite, establishing a new state-of-the-art performance. We envision that these improvements will spur further advances in computational glycosciences and reveal the roles of glycans in biology.

[LG-31] Global Outlier Detection in a Federated Learning Setting with Isolation Forest

链接: https://arxiv.org/abs/2409.13466
作者: Daniele Malpetti,Laura Azzimonti
关键词-EN: detecting global outliers, federated learning setting, learning setting, cross-silo scenarios, strategy for detecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for publication at FLTA 2024: The 2nd IEEE International Conference on Federated Learning Technologies and Applications

点击查看摘要

Abstract:We present a novel strategy for detecting global outliers in a federated learning setting, targeting in particular cross-silo scenarios. Our approach involves the use of two servers and the transmission of masked local data from clients to one of the servers. The masking of the data prevents the disclosure of sensitive information while still permitting the identification of outliers. Moreover, to further safeguard privacy, a permutation mechanism is implemented so that the server does not know which client owns any masked data point. The server performs outlier detection on the masked data, using either Isolation Forest or its extended version, and then communicates outlier information back to the clients, allowing them to identify and remove outliers in their local datasets before starting any subsequent federated model training. This approach provides comparable results to a centralized execution of Isolation Forest algorithms on plain data.

[LG-32] Data Compression using Rank-1 Lattices for Parameter Estimation in Machine Learning

链接: https://arxiv.org/abs/2409.13453
作者: Michael Gnewuch,Kumar Harsha,Marcin Wnuk
关键词-EN: supervised machine learning, machine learning, regularized versions, supervised machine, standard loss functions
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 1 figure

点击查看摘要

Abstract:The mean squared error and regularized versions of it are standard loss functions in supervised machine learning. However, calculating these losses for large data sets can be computationally demanding. Modifying an approach of J. Dick and M. Feischl [Journal of Complexity 67 (2021)], we present algorithms to reduce extensive data sets to a smaller size using rank-1 lattices. Rank-1 lattices are quasi-Monte Carlo (QMC) point sets that are, if carefully chosen, well-distributed in a multidimensional unit cube. The compression strategy in the preprocessing step assigns every lattice point a pair of weights depending on the original data and responses, representing its relative importance. As a result, the compressed data makes iterative loss calculations in optimization steps much faster. We analyze the errors of our QMC data compression algorithms and the cost of the preprocessing step for functions whose Fourier coefficients decay sufficiently fast so that they lie in certain Wiener algebras or Korobov spaces. In particular, we prove that our approach can lead to arbitrary high convergence rates as long as the functions are sufficiently smooth.

[LG-33] Noise-Robust and Resource-Efficient ADMM-based Federated Learning

链接: https://arxiv.org/abs/2409.13451
作者: Ehsan Lari,Reza Arablouei,Vinay Chakravarthi Gogineni,Stefan Werner
关键词-EN: leverages client-server communications, leverages client-server, decentralized data, communication noise, Federated learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注: 13 pages, 10 figures, Submitted to IEEE Open Journal of Signal Processing

点击查看摘要

Abstract:Federated learning (FL) leverages client-server communications to train global models on decentralized data. However, communication noise or errors can impair model accuracy. To address this problem, we propose a novel FL algorithm that enhances robustness against communication noise while also reducing communication load. We derive the proposed algorithm through solving the weighted least-squares (WLS) regression problem as an illustrative example. We first frame WLS regression as a distributed convex optimization problem over a federated network employing random scheduling for improved communication efficiency. We then apply the alternating direction method of multipliers (ADMM) to iteratively solve this problem. To counteract the detrimental effects of cumulative communication noise, we introduce a key modification by eliminating the dual variable and implementing a new local model update at each participating client. This subtle yet effective change results in using a single noisy global model update at each client instead of two, improving robustness against additive communication noise. Furthermore, we incorporate another modification enabling clients to continue local updates even when not selected by the server, leading to substantial performance improvements. Our theoretical analysis confirms the convergence of our algorithm in both mean and the mean-square senses, even when the server communicates with a random subset of clients over noisy links at each iteration. Numerical results validate the effectiveness of our proposed algorithm and corroborate our theoretical findings.

[LG-34] A User Study on Contrastive Explanations for Multi-Effector Temporal Planning with Non-Stationary Costs

链接: https://arxiv.org/abs/2409.13427
作者: Xiaowei Liu,Kevin McAreavey,Weiru Liu
关键词-EN: adopt constrastive explanations, smart homes, adopt constrastive, end-user application, temporal planning
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we adopt constrastive explanations within an end-user application for temporal planning of smart homes. In this application, users have requirements on the execution of appliance tasks, pay for energy according to dynamic energy tariffs, have access to high-capacity battery storage, and are able to sell energy to the grid. The concurrent scheduling of devices makes this a multi-effector planning problem, while the dynamic tariffs yield costs that are non-stationary (alternatively, costs that are stationary but depend on exogenous events). These characteristics are such that the planning problems are generally not supported by existing PDDL-based planners, so we instead design a custom domain-dependent planner that scales to reasonable appliance numbers and time horizons. We conduct a controlled user study with 128 participants using an online crowd-sourcing platform based on two user stories. Our results indicate that users provided with contrastive questions and explanations have higher levels of satisfaction, tend to gain improved understanding, and rate the helpfulness more favourably with the recommended AI schedule compared to those without access to these features.

[LG-35] Causal Reinforcement Learning for Optimisation of Robot Dynamics in Unknown Environments

链接: https://arxiv.org/abs/2409.13423
作者: Julian Gerald Dcruz,Sam Mahoney,Jia Yun Chua,Adoundeth Soukhabandith,John Mugabe,Weisi Guo,Miguel Arana-Catania
关键词-EN: Autonomous operations, Causal Reinforcement Learning, unknown environments, environments are challenging, challenging due
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 12 figures, 3 tables. To be presented in 10th IEEE International Smart Cities Conference (ISC2-2024)

点击查看摘要

Abstract:Autonomous operations of robots in unknown environments are challenging due to the lack of knowledge of the dynamics of the interactions, such as the objects’ movability. This work introduces a novel Causal Reinforcement Learning approach to enhancing robotics operations and applies it to an urban search and rescue (SAR) scenario. Our proposed machine learning architecture enables robots to learn the causal relationships between the visual characteristics of the objects, such as texture and shape, and the objects’ dynamics upon interaction, such as their movability, significantly improving their decision-making processes. We conducted causal discovery and RL experiments demonstrating the Causal RL’s superior performance, showing a notable reduction in learning times by over 24.5% in complex situations, compared to non-causal models.

[LG-36] State space models emergence and ergodicity: How many parameters are needed for stable predictions?

链接: https://arxiv.org/abs/2409.13421
作者: Ingvar Ziemann,Nikolai Matni,George J. Pappas
关键词-EN: self-supervised learning, parameters, large language models, simple theoretical model, learning linear dynamical
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:How many parameters are required for a model to execute a given task? It has been argued that large language models, pre-trained via self-supervised learning, exhibit emergent capabilities such as multi-step reasoning as their number of parameters reach a critical scale. In the present work, we explore whether this phenomenon can analogously be replicated in a simple theoretical model. We show that the problem of learning linear dynamical systems – a simple instance of self-supervised learning – exhibits a corresponding phase transition. Namely, for every non-ergodic linear system there exists a critical threshold such that a learner using fewer parameters than said threshold cannot achieve bounded error for large sequence lengths. Put differently, in our model we find that tasks exhibiting substantial long-range correlation require a certain critical number of parameters – a phenomenon akin to emergence. We also investigate the role of the learner’s parametrization and consider a simple version of a linear dynamical system with hidden state – an imperfectly observed random walk in \mathbbR . For this situation, we show that there exists no learner using a linear filter which can succesfully learn the random walk unless the filter length exceeds a certain threshold depending on the effective memory length and horizon of the problem.

[LG-37] Occupancy-Based Dual Contouring SIGGRAPH

链接: https://arxiv.org/abs/2409.13418
作者: Jisung Hwang,Minhyuk Sung
关键词-EN: dual contouring, achieving computation times, Manifold Dual Contouring, dual contouring method, points
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted to SIGGRAPH Asia (conference) 2024. Code: this https URL

点击查看摘要

Abstract:We introduce a dual contouring method that provides state-of-the-art performance for occupancy functions while achieving computation times of a few seconds. Our method is learning-free and carefully designed to maximize the use of GPU parallelization. The recent surge of implicit neural representations has led to significant attention to occupancy fields, resulting in a wide range of 3D reconstruction and generation methods based on them. However, the outputs of such methods have been underestimated due to the bottleneck in converting the resulting occupancy function to a mesh. Marching Cubes tends to produce staircase-like artifacts, and most subsequent works focusing on exploiting signed distance functions as input also yield suboptimal results for occupancy functions. Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information. We introduce auxiliary 2D points that are used to compute local surface normals along with the 1D points, helping identify 3D points via the quadric error function. To search the 1D, 2D, and 3D points, we develop fast algorithms that are parallelizable across all grid edges, faces, and cells. Our experiments with several 3D neural generative models and a 3D mesh dataset demonstrate that our method achieves the best fidelity compared to prior works.

[LG-38] Credit Card Fraud Detection: A Deep Learning Approach

链接: https://arxiv.org/abs/2409.13406
作者: Sourav Verma,Joydip Dhar
关键词-EN: Credit card, credit card transactions, electronic transactions, fraudulent credit card, recent times
类目: Machine Learning (cs.LG)
*备注: Part of the M.Tech. thesis. Sourav Verma, ABV-Indian Institute of Information Technology, Gwalior 2013-18

点击查看摘要

Abstract:Credit card is one of the most extensive methods of instalment for both online and offline mode of payment for electronic transactions in recent times. credit cards invention has provided significant ease in electronic transactions. However, it has also provided new fraud opportunities for criminals, which results in increased fraud rates. Substantial amount of money has been lost by many institutions and individuals due to fraudulent credit card transactions. Adapting improved and dynamic fraud recognition frameworks thus became essential for all credit card distributing banks to mitigate their losses. In fact, the problem of fraudulent credit card transactions implicates a number of relevant real-time challenges, namely: Concept drift, Class imbalance, and Verification latency. However, the vast majority of current systems are based on artificial intelligence (AI), Fuzzy logic, Machine Learning, Data mining, Genetic Algorithms, and so on, rely on assumptions that hardly address all the relevant challenges of fraud-detection system (FDS). This paper aims to understand implement Deep Learning algorithms in order to obtain a high fraud coverage with very low false positive rate. Also, it aims to implement an auto-encoder as an unsupervised (semi-supervised) method of learning common patterns. Keywords: Credit card fraud, Fraud-detection system (FDS), Electronic transactions, Concept drift, Class imbalance, Verification latency, Machine Learning, Deep Learning

[LG-39] ALPEC: A Comprehensive Evaluation Framework and Dataset for Machine Learning-Based Arousal Detection in Clinical Practice

链接: https://arxiv.org/abs/2409.13367
作者: Stefan Kraft,Andreas Theissler,Vera Wienhausen-Wilke,Philipp Walter,Gjergji Kasneci
关键词-EN: diagnosing sleep disorders, sleep disorders, diagnosing sleep, Machine Learning, essential for diagnosing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting arousals in sleep is essential for diagnosing sleep disorders. However, using Machine Learning (ML) in clinical practice is impeded by fundamental issues, primarily due to mismatches between clinical protocols and ML methods. Clinicians typically annotate only the onset of arousals, while ML methods rely on annotations for both the beginning and end. Additionally, there is no standardized evaluation methodology tailored to clinical needs for arousal detection models. This work addresses these issues by introducing a novel post-processing and evaluation framework emphasizing approximate localization and precise event count (ALPEC) of arousals. We recommend that ML practitioners focus on detecting arousal onsets, aligning with clinical practice. We examine the impact of this shift on current training and evaluation schemes, addressing simplifications and challenges. We utilize a novel comprehensive polysomnographic dataset (CPS) that reflects the aforementioned clinical annotation constraints and includes modalities not present in existing polysomnographic datasets. We release the dataset alongside this paper, demonstrating the benefits of leveraging multimodal data for arousal onset detection. Our findings significantly contribute to integrating ML-based arousal detection in clinical settings, reducing the gap between technological advancements and clinical needs.

[LG-40] FPBoost: Fully Parametric Gradient Boosting for Survival Analysis

链接: https://arxiv.org/abs/2409.13363
作者: Alberto Archetti,Eugenio Lomurno,Diego Piccinotti,Matteo Matteucci
关键词-EN: valuable clinical insights, extracting valuable clinical, tool for analyzing, data and extracting, clinical insights
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Survival analysis is a critical tool for analyzing time-to-event data and extracting valuable clinical insights. Recently, numerous machine learning techniques leveraging neural networks and decision trees have been developed for this task. Among these, the most successful approaches often rely on specific assumptions about the shape of the modeled hazard function. These assumptions include proportional hazard, accelerated failure time, or discrete estimation at a predefined set of time points. In this study, we propose a novel paradigm for survival model design based on the weighted sum of individual fully parametric hazard contributions. We build upon well-known ensemble techniques to deliver a novel contribution to the field by applying additive hazard functions, improving over approaches based on survival or cumulative hazard functions. Furthermore, the proposed model, which we call FPBoost, is the first algorithm to directly optimize the survival likelihood via gradient boosting. We evaluated our approach across a diverse set of datasets, comparing it against a variety of state-of-the-art models. The results demonstrate that FPBoost improves risk estimation, according to both concordance and calibration metrics.

[LG-41] Generative Aerodynamic Design with Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2409.13328
作者: Thomas Wagenaar,Simone Mancini,Andrés Mateo-Gabín
关键词-EN: evaluate and iteratively, iteratively improve, number of expensive, geometries, expensive simulations
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages, 11 figures, DLRK 2024

点击查看摘要

Abstract:The optimization of geometries for aerodynamic design often relies on a large number of expensive simulations to evaluate and iteratively improve the geometries. It is possible to reduce the number of simulations by providing a starting geometry that has properties close to the desired requirements, often in terms of lift and drag, aerodynamic moments and surface areas. We show that generative models have the potential to provide such starting geometries by generalizing geometries over a large dataset of simulations. In particular, we leverage diffusion probabilistic models trained on XFOIL simulations to synthesize two-dimensional airfoil geometries conditioned on given aerodynamic features and constraints. The airfoils are parameterized with Bernstein polynomials, ensuring smoothness of the generated designs. We show that the models are able to generate diverse candidate designs for identical requirements and constraints, effectively exploring the design space to provide multiple starting points to optimization procedures. However, the quality of the candidate designs depends on the distribution of the simulated designs in the dataset. Importantly, the geometries in this dataset must satisfy other requirements and constraints that are not used in conditioning of the diffusion model, to ensure that the generated geometries are physical.

[LG-42] SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

链接: https://arxiv.org/abs/2409.13321
作者: Jinge Wu,Yunsoo Kim,Daqian Shi,David Cliffton,Fenglin Liu,Honghan Wu
关键词-EN: growing research interest, assist clinicians, large language models, success of large, growing research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re ^3 Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models.

[LG-43] A Ring-Based Distributed Algorithm for Learning High-Dimensional Bayesian Networks

链接: https://arxiv.org/abs/2409.13314
作者: Jorge D. Laborda,Pablo Torrijos,José M. Puerta,José A. Gámez
关键词-EN: Learning Bayesian Networks, Bayesian Networks, Greedy Equivalence Search, Learning Bayesian, time-consuming task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning Bayesian Networks (BNs) from high-dimensional data is a complex and time-consuming task. Although there are approaches based on horizontal (instances) or vertical (variables) partitioning in the literature, none can guarantee the same theoretical properties as the Greedy Equivalence Search (GES) algorithm, except those based on the GES algorithm itself. In this paper, we propose a directed ring-based distributed method that uses GES as the local learning algorithm, ensuring the same theoretical properties as GES but requiring less CPU time. The method involves partitioning the set of possible edges and constraining each processor in the ring to work only with its received subset. The global learning process is an iterative algorithm that carries out several rounds until a convergence criterion is met. In each round, each processor receives a BN from its predecessor in the ring, fuses it with its own BN model, and uses the result as the starting solution for a local learning process constrained to its set of edges. Subsequently, it sends the model obtained to its successor in the ring. Experiments were carried out on three large domains (400-1000 variables), demonstrating our proposal’s effectiveness compared to GES and its fast version (fGES).

[LG-44] MeMoir: A Software-Driven Covert Channel based on Memory Usage

链接: https://arxiv.org/abs/2409.13310
作者: Jeferson Gonzalez-Gomez,Jose Alejandro Ibarra-Campos,Jesus Yamir Sandoval-Morales,Lars Bauer,Jörg Henkel
关键词-EN: Covert channel, modern computing systems, continuously studied, studied as severe, severe threats
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Covert channel attacks have been continuously studied as severe threats to modern computing systems. Software-based covert channels are a typically hard-to-detect branch of these attacks, since they leverage virtual resources to establish illegitimate communication between malicious actors. In this work, we present MeMoir: a novel software-driven covert channel that, for the first time, utilizes memory usage as the medium for the channel. We implemented the new covert channel on two real-world platforms with different architectures: a general-purpose Intel x86-64-based desktop computer and an ARM64-based embedded system. Our results show that our new architecture- and hardware-agnostic covert channel is effective and achieves moderate transmission rates with very low error. Moreover, we present a real use-case for our attack where we were able to communicate information from a Hyper-V virtualized enviroment to a Windows 11 host system. In addition, we implement a machine learning-based detector that can predict whether an attack is present in the system with an accuracy of more than 95% with low false positive and false negative rates by monitoring the use of system memory. Finally, we introduce a noise-based countermeasure that effectively mitigates the attack while inducing a low power overhead in the system compared to other normal applications.

[LG-45] Predicting DNA fragmentation: A non-destructive analogue to chemical assays using machine learning

链接: https://arxiv.org/abs/2409.13306
作者: Byron A Jacobs,Ifthakaar Shaik,Frando Lin
关键词-EN: vitro fertilisation, IVF, sperm DNA, sperm, sperm cell DNA
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Globally, infertility rates are increasing, with 2.5% of all births being assisted by in vitro fertilisation (IVF) in 2022. Male infertility is the cause for approximately half of these cases. The quality of sperm DNA has substantial impact on the success of IVF. The assessment of sperm DNA is traditionally done through chemical assays which render sperm cells ineligible for IVF. Many compounding factors lead to the population crisis, with fertility rates dropping globally in recent history. As such assisted reproductive technologies (ART) have been the focus of recent research efforts. Simultaneously, artificial intelligence has grown ubiquitous and is permeating more aspects of modern life. With the advent of state-of-the-art machine learning and its exceptional performance in many sectors, this work builds on these successes and proposes a novel framework for the prediction of sperm cell DNA fragmentation from images of unstained sperm. Rendering a predictive model which preserves sperm integrity and allows for optimal selection of sperm for IVF.

[LG-46] OMG-RL:Offline Model-based Guided Reward Learning for Heparin Treatment

链接: https://arxiv.org/abs/2409.13299
作者: Yooseok Lim,Sujee Lee
关键词-EN: medical decision-making processes, personalized medical decision-making, individual patient conditions, Accurate diagnosis, medication dosing strategies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of individual patient conditions and appropriate medication dosing strategies are core elements of personalized medical decision-making processes. This therapeutic procedure, which entails recursively assessing the patient’s condition and administering suitable medications, can effectively be modeled as a reinforcement learning (RL) problem. Crucially, the success of RL in this context depends on the establishment of a well-defined reward function that accurately represents the optimal treatment strategy. However, defining the learning direction in RL with only a limited set of explicit indicators complicates the task due to the inherent complexity of the required domain knowledge. This approach may also increase the likelihood that the RL policy does not adequately reflect the clinician’s treatment intentions, which are determined by considering various situations and indicators. In this study, we focus on developing a reward function that reflects the clinician’s intentions and introduce Offline Model-based Guided Reward Learning (OMG-RL), which performs offline inverse reinforcement learning (IRL) aligned with the offline RL environment. Through OMG-RL, we learn a parameterized reward function that includes the expert’s intentions from limited data, thereby enhancing the agent’s policy. We validate the proposed approach on the heparin dosing task. The results demonstrate that policy learning through OMG-RL is meaningful and confirm that the learned policy is positively reinforced in terms of activated partial thromboplastin time (aPTT), a key indicator for monitoring the effects of heparin. This approach can be broadly utilized not only for the heparin dosing problem but also for RL-based medication dosing tasks in general.

[LG-47] me Distributed Deep Learning models for Purely Exogenous Forecasting. Application to Water Table Depth Prediction using Weather Image Time Series

链接: https://arxiv.org/abs/2409.13284
作者: Matteo Salis,Abdourrahmane M. Atto,Stefano Ferraris,Rosa Meo
关键词-EN: resources management framework, sustainable resources management, Groundwater resources, Time Distributed Convolutional, management framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Groundwater resources are one of the most relevant elements in the water cycle, therefore developing models to accurately predict them is a pivotal task in the sustainable resources management framework. Deep Learning (DL) models have been revealed very effective in hydrology, especially by feeding spatially distributed data (e.g. raster data). In many regions, hydrological measurements are difficult to obtain regularly or periodically in time, and in some cases, last available data are not up to date. Reversely, weather data, which significantly impacts water resources, are usually more available and with higher quality. More specifically, we have proposed two different DL models to predict the water table depth in the Grana-Maira catchment (Piemonte, IT) using only exogenous weather image time series. To deal with the image time series, both models are made of a first Time Distributed Convolutional Neural Network (TDC) which encodes the image available at each time step into a vectorial representation. The first model, TDC-LSTM uses then a Sequential Module based on an LSTM layer to learn temporal relations and output the predictions. The second model, TDC-UnPWaveNet uses instead a new version of the WaveNet architecture, adapted here to output a sequence shorter and completely shifted in the future with respect to the input one. To this aim, and to deal with the different sequence lengths in the UnPWaveNet, we have designed a new Channel Distributed layer, that acts like a Time Distributed one but on the channel dimension, i.e. applying the same set of operations to each channel of the input. TDC-LSTM and TDC-UnPWaveNet have shown both remarkable results. However, the two models have focused on different learnable information: TDC-LSTM has focused more on lowering the bias, while the TDC-UnPWaveNet has focused more on the temporal dynamics maximising correlation and KGE.

[LG-48] Efficient Training of Deep Neural Operator Networks via Randomized Sampling

链接: https://arxiv.org/abs/2409.13280
作者: Sharmila Karumuri,Lori Graham-Brady,Somdatta Goswami
关键词-EN: employ deep neural, deep neural networks, infinite-dimensional function spaces, Deep operator network, Neural operators
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural operators (NOs) employ deep neural networks to learn mappings between infinite-dimensional function spaces. Deep operator network (DeepONet), a popular NO architecture, has demonstrated success in the real-time prediction of complex dynamics across various scientific and engineering applications. In this work, we introduce a random sampling technique to be adopted during the training of DeepONet, aimed at improving the generalization ability of the model, while significantly reducing the computational time. The proposed approach targets the trunk network of the DeepONet model that outputs the basis functions corresponding to the spatiotemporal locations of the bounded domain on which the physical system is defined. Traditionally, while constructing the loss function, DeepONet training considers a uniform grid of spatiotemporal points at which all the output functions are evaluated for each iteration. This approach leads to a larger batch size, resulting in poor generalization and increased memory demands, due to the limitations of the stochastic gradient descent (SGD) optimizer. The proposed random sampling over the inputs of the trunk net mitigates these challenges, improving generalization and reducing memory requirements during training, resulting in significant computational gains. We validate our hypothesis through three benchmark examples, demonstrating substantial reductions in training time while achieving comparable or lower overall test errors relative to the traditional training approach. Our results indicate that incorporating randomization in the trunk network inputs during training enhances the efficiency and robustness of DeepONet, offering a promising avenue for improving the framework’s performance in modeling complex physical systems.

[LG-49] Inductive Spatial Temporal Prediction Under Data Drift with Informative Graph Neural Network

链接: https://arxiv.org/abs/2409.13253
作者: Jialun Zheng,Divya Saxena,Jiannong Cao,Hanchen Yang,Penghui Ruan
关键词-EN: highly dynamic scenarios, Inductive spatial temporal, traffic systems, Inductive spatial, crucial for highly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inductive spatial temporal prediction can generalize historical data to predict unseen data, crucial for highly dynamic scenarios (e.g., traffic systems, stock markets). However, external events (e.g., urban structural growth, market crash) and emerging new entities (e.g., locations, stocks) can undermine prediction accuracy by inducing data drift over time. Most existing studies extract invariant patterns to counter data drift but ignore pattern diversity, exhibiting poor generalization to unseen entities. To address this issue, we design an Informative Graph Neural Network (INF-GNN) to distill diversified invariant patterns and improve prediction accuracy under data drift. Firstly, we build an informative subgraph with a uniquely designed metric, Relation Importance (RI), that can effectively select stable entities and distinct spatial relationships. This subgraph further generalizes new entities’ data via neighbors merging. Secondly, we propose an informative temporal memory buffer to help the model emphasize valuable timestamps extracted using influence functions within time intervals. This memory buffer allows INF-GNN to discern influential temporal patterns. Finally, RI loss optimization is designed for pattern consolidation. Extensive experiments on real-world dataset under substantial data drift demonstrate that INF-GNN significantly outperforms existing alternatives.

[LG-50] Exploring the ability of the Deep Ritz Method to model strain localization as a sharp discontinuity

链接: https://arxiv.org/abs/2409.13241
作者: Omar León,Víctor Rivera,Angel Vázquez-Patiño,Jacinto Ulloa,Esteban Samaniego
关键词-EN: Deep Ritz Method, Ritz Method, Deep Ritz, Artificial Neural Networks, displacement field
类目: Machine Learning (cs.LG)
*备注: The article has 22 pages including 14 figures and 26 references. The manuscript was prepared for submission to Archives of Computational Methods in Engineering

点击查看摘要

Abstract:We present an exploratory study of the possibilities of the Deep Ritz Method (DRM) for the modeling of strain localization in solids as a sharp discontinuity in the displacement field. For this, we use a regularized strong discontinuity kinematics within a variational setting for elastoplastic solids. The corresponding mathematical model is discretized using Artificial Neural Networks (ANNs). The architecture takes care of the kinematics, while the variational statement of the boundary value problem is taken care of by the loss function. The main idea behind this approach is to solve both the equilibrium problem and the location of the localization band by means of trainable parameters in the ANN. As a proof of concept, we show through both 1D and 2D numerical examples that the computational modeling of strain localization for elastoplastic solids within the framework of DRM is feasible.

[LG-51] Balancing Label Imbalance in Federated Environments Using Only Mixup and Artificially-Labeled Noise

链接: https://arxiv.org/abs/2409.13235
作者: Kyle Sang,Tahseen Rabbani,Furong Huang
关键词-EN: differing subsets, hold data skewed, federated environment, non-iid federated learning, hold data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clients in a distributed or federated environment will often hold data skewed towards differing subsets of labels. This scenario, referred to as heterogeneous or non-iid federated learning, has been shown to significantly hinder model training and performance. In this work, we explore the limits of a simple yet effective augmentation strategy for balancing skewed label distributions: filling in underrepresented samples of a particular label class using pseudo-images. While existing algorithms exclusively train on pseudo-images such as mixups of local training data, our augmented client datasets consist of both real and pseudo-images. In further contrast to other literature, we (1) use a DP-Instahide variant to reduce the decodability of our image encodings and (2) as a twist, supplement local data using artificially labeled, training-free ‘natural noise’ generated by an untrained StyleGAN. These noisy images mimic the power spectra patterns present in natural scenes which, together with mixup images, help homogenize label distribution among clients. We demonstrate that small amounts of augmentation via mixups and natural noise markedly improve label-skewed CIFAR-10 and MNIST training.

[LG-52] Relationship between Uncertainty in DNNs and Adversarial Attacks

链接: https://arxiv.org/abs/2409.13232
作者: Abigail Adeniran,Adewale Adeyemo
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, natural language processing, outperformed human accuracy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: review

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have achieved state of the art results and even outperformed human accuracy in many challenging tasks, leading to DNNs adoption in a variety of fields including natural language processing, pattern recognition, prediction, and control optimization. However, DNNs are accompanied by uncertainty about their results, causing them to predict an outcome that is either incorrect or outside of a certain level of confidence. These uncertainties stem from model or data constraints, which could be exacerbated by adversarial attacks. Adversarial attacks aim to provide perturbed input to DNNs, causing the DNN to make incorrect predictions or increase model uncertainty. In this review, we explore the relationship between DNN uncertainty and adversarial attacks, emphasizing how adversarial attacks might raise DNN uncertainty.

[LG-53] Incremental Few-Shot Adaptation for Non-Prehensile Object Manipulation using Parallelizable Physics Simulators ICRA

链接: https://arxiv.org/abs/2409.13228
作者: Fabian Baumeister,Lukas Mack,Joerg Stueckler
关键词-EN: flexible production, important capability, capability for intelligent, perform tasks, tasks in open-world
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Few-shot adaptation is an important capability for intelligent robots that perform tasks in open-world settings such as everyday environments or flexible production. In this paper, we propose a novel approach for non-prehensile manipulation which iteratively adapts a physics-based dynamics model for model-predictive control. We adapt the parameters of the model incrementally with a few examples of robot-object interactions. This is achieved by sampling-based optimization of the parameters using a parallelizable rigid-body physics simulation as dynamic world model. In turn, the optimized dynamics model can be used for model-predictive control using efficient sampling-based optimization. We evaluate our few-shot adaptation approach in several object pushing experiments in simulation and with a real robot.

[LG-54] RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

链接: https://arxiv.org/abs/2409.13221
作者: Yinmin Zhong,Zili Zhang,Bingyang Wu,Shengyu Liu,Yukun Chen,Changyi Wan,Hanpeng Hu,Lei Xia,Ranchen Ming,Yibo Zhu,Xin Jin
关键词-EN: Human Feedback, pivotal post-training technique, human preference, RLHF, Reinforcement Learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal post-training technique to enhance the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2409.13221 [cs.LG] (or arXiv:2409.13221v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-55] MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

链接: https://arxiv.org/abs/2409.13213
作者: Eric Li,Yifan Zhang,Yu Huang,Kevin Leach
关键词-EN: Recent growth, growth and proliferation, promptly classify, malware, tested practitioners’ ability
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent growth and proliferation of malware has tested practitioners’ ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners’ ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a novel domain-knowledge-aware technique for augmenting malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware feature augmentation methods and highlights the capabilities of similar semi-supervised classifiers in addressing malware classification issues.

[LG-56] A Unified Causal Framework for Auditing Recommender Systems for Ethical Concerns

链接: https://arxiv.org/abs/2409.13210
作者: Vibhhu Sharma,Shantanu Gupta,Nil-Jana Akpinar,Zachary C. Lipton,Liu Leqi
关键词-EN: recommender systems, beliefs and preferences, widely deployed, recommender system auditing, Auditing recommender systems
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 28 pages

点击查看摘要

Abstract:As recommender systems become widely deployed in different domains, they increasingly influence their users’ beliefs and preferences. Auditing recommender systems is crucial as it not only ensures the continuous improvement of recommendation algorithms but also safeguards against potential issues like biases and ethical concerns. In this paper, we view recommender system auditing from a causal lens and provide a general recipe for defining auditing metrics. Under this general causal auditing framework, we categorize existing auditing metrics and identify gaps in them – notably, the lack of metrics for auditing user agency while accounting for the multi-step dynamics of the recommendation process. We leverage our framework and propose two classes of such metrics:future- and past-reacheability and stability, that measure the ability of a user to influence their own and other users’ recommendations, respectively. We provide both a gradient-based and a black-box approach for computing these metrics, allowing the auditor to compute them under different levels of access to the recommender system. In our experiments, we demonstrate the efficacy of methods for computing the proposed metrics and inspect the design of recommender systems through these proposed metrics.

[LG-57] Redefining Data Pairing for Motion Retargeting Leveraging a Human Body Prior IROS2024

链接: https://arxiv.org/abs/2409.13208
作者: Xiyana Figuera,Soogeun Park,Hyemin Ahn
关键词-EN: HUman, robot poses, poses, data, random robot poses
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 5 Figures, Accepted at IROS 2024

点击查看摘要

Abstract:We propose MR.HuBo (Motion Retargeting leveraging a HUman BOdy prior), a cost-effective and convenient method to collect high-quality upper body paired \langle \textrobot, human \rangle pose data, which is essential for data-driven motion retargeting methods. Unlike existing approaches which collect \langle \textrobot, human \rangle pose data by converting human MoCap poses into robot poses, our method goes in reverse. We first sample diverse random robot poses, and then convert them into human poses. However, since random robot poses can result in extreme and infeasible human poses, we propose an additional technique to sort out extreme poses by exploiting a human body prior trained from a large amount of human pose data. Our data collection method can be used for any humanoid robots, if one designs or optimizes the system’s hyperparameters which include a size scale factor and the joint angle ranges for sampling. In addition to this data collection method, we also present a two-stage motion retargeting neural network that can be trained via supervised learning on a large amount of paired data. Compared to other learning-based methods trained via unsupervised learning, we found that our deep neural network trained with ample high-quality paired data achieved notable performance. Our experiments also show that our data filtering method yields better retargeting results than training the model with raw and noisy data. Our code and video results are available on this https URL

[LG-58] Unveiling Population Heterogeneity in Health Risks Posed by Environmental Hazards Using Regression-Guided Neural Network

链接: https://arxiv.org/abs/2409.13205
作者: Jong Woo Nam,Eun Young Choi,Jennifer A. Ailshire,Yao-Yi Chiang
关键词-EN: disproportionately higher risks, Environmental hazards place, disproportionately higher, MMR, hazards place
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Environmental hazards place certain individuals at disproportionately higher risks. As these hazards increasingly endanger human health, precise identification of the most vulnerable population subgroups is critical for public health. Moderated multiple regression (MMR) offers a straightforward method for investigating this by adding interaction terms between the exposure to a hazard and other population characteristics to a linear regression model. However, when the vulnerabilities are hidden within a cross-section of many characteristics, MMR is often limited in its capabilities to find any meaningful discoveries. Here, we introduce a hybrid method, named regression-guided neural networks (ReGNN), which utilizes artificial neural networks (ANNs) to non-linearly combine predictors, generating a latent representation that interacts with a focal predictor (i.e. variable measuring exposure to an environmental hazard). We showcase the use of ReGNN for investigating the population heterogeneity in the health effects of exposure to air pollution (PM2.5) on cognitive functioning scores. We demonstrate that population heterogeneity that would otherwise be hidden using traditional MMR can be found using ReGNN by comparing its results to the fit results of the traditional MMR models. In essence, ReGNN is a novel tool that enhances traditional regression models by effectively summarizing and quantifying an individual’s susceptibility to health risks.

[LG-59] Exploring Scaling Laws for Local SGD in Large Language Model Training

链接: https://arxiv.org/abs/2409.13198
作者: Qiaozhi He,Xiaomin Zhuang,Zhihua Wu
关键词-EN: loosely connected devices, paper investigates scaling, investigates scaling laws, distributed optimization algorithm, local SGD
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Technical Report

点击查看摘要

Abstract:This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.

[LG-60] BoilerTAI: A Platform for Enhancing Instruction Using Generative AI in Educational Forums

链接: https://arxiv.org/abs/2409.13196
作者: Anvit Sinha,Shruti Goyal,Zachary Sy,Rhianna Kuperus,Ethan Dickey,Andres Bejarano
关键词-EN: Category track describes, Research Category track, Category track, Large Language Model, Toggle
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure. Accepted for publication in Frontiers in Education 2024

点击查看摘要

Abstract:Contribution: This Full paper in the Research Category track describes a practical, scalable platform that seamlessly integrates Generative AI (GenAI) with online educational forums, offering a novel approach to augment the instructional capabilities of staff. The platform empowers instructional staff to efficiently manage, refine, and approve responses by facilitating interaction between student posts and a Large Language Model (LLM). This contribution enhances the efficiency and effectiveness of instructional support and significantly improves the quality and speed of responses provided to students, thereby enriching the overall learning experience. Background: Grounded in Vygotsky’s socio-cultural theory and the concept of the More Knowledgeable Other (MKO), the study examines how GenAI can act as an auxiliary MKO to enrich educational dialogue between students and instructors. Research Question: How effective is GenAI in reducing the workload of instructional staff when used to pre-answer student questions posted on educational discussion forums? Methodology: Using a mixed-methods approach in large introductory programming courses, human Teaching Assistants (AI-TAs) employed an AI-assisted platform to pre-answer student queries. We analyzed efficiency indicators like the frequency of modifications to AI-generated responses and gathered qualitative feedback from AI-TAs. Findings: The findings indicate no significant difference in student reception to responses generated by AI-TAs compared to those provided by human instructors. This suggests that GenAI can effectively meet educational needs when adequately managed. Moreover, AI-TAs experienced a reduction in the cognitive load required for responding to queries, pointing to GenAI’s potential to enhance instructional efficiency without compromising the quality of education. Comments: 8 pages, 1 figure. Accepted for publication in Frontiers in Education 2024 Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) ACMclasses: K.3.2 Cite as: arXiv:2409.13196 [cs.CY] (or arXiv:2409.13196v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2409.13196 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ethan Dickey [view email] [v1] Fri, 20 Sep 2024 04:00:30 UTC (90 KB) Full-text links: Access Paper: View a PDF of the paper titled BoilerTAI: A Platform for Enhancing Instruction Using Generative AI in Educational Forums, by Anvit Sinha and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CY prev | next new | recent | 2024-09 Change to browse by: cs cs.HC cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-61] ChemDFM-X: Towards Large Multimodal Model for Chemistry

链接: https://arxiv.org/abs/2409.13194
作者: Zihan Zhao,Bo Chen,Jingpiao Li,Lu Chen,Liyang Wen,Pengyu Wang,Zichen Zhu,Danyang Zhang,Ziping Wan,Yansi Li,Zhongyang Dai,Xin Chen,Kai Yu
关键词-EN: offer unprecedented assistance, natural science including, Rapid developments, science including chemistry, tools are expected
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: 19 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.

[LG-62] An adapted large language model facilitates multiple medical tasks in diabetes care

链接: https://arxiv.org/abs/2409.13191
作者: Lai Wei,Zhen Ying,Muyang He,Yutong Chen,Qian Yang,Yanzhe Hong,Jiaping Lu,Xiaoying Li,Weiran Huang,Ying Chen
关键词-EN: global health burden, requires multi-stakeholder collaboration, significant global health, management requires multi-stakeholder, optimizing diabetes management
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diabetes is a chronic disease that poses a significant global health burden, and optimizing diabetes management requires multi-stakeholder collaboration. Large language models (LLMs) have shown promise in various healthcare scenarios, but their effectiveness across a diverse range of diabetes tasks remains unproven. In this study, we introduced a framework to train and validate diabetes-specific LLMs. We first developed a comprehensive data processing pipeline that includes data collection, filtering, augmentation and refinement. This approach contributes to creating a high-quality, diabetes-specific dataset, and several evaluation benchmarks entirely from scratch. Utilizing the collected training dataset, we fine-tuned a diabetes-specific LLM family that demonstrated state-of-the-art proficiency in understanding and processing various diabetes tasks compared to other LLMs. Furthermore, clinical studies showed the potential applications of our models in diabetes care, including providing personalized healthcare, assisting medical education, and streamlining clinical tasks. In conclusion, our study introduced a framework to develop and evaluate a diabetes-specific LLM family, and highlighted its potential to enhance clinical practice and provide personalized, data-driven support for diabetes support when facing different end users. The code is provided via GitHub at this https URL.

[LG-63] ASPINN: An asymptotic strategy for solving singularly perturbed differential equations

链接: https://arxiv.org/abs/2409.13185
作者: Sen Wang,Peizhi Zhao,Tao Song
关键词-EN: Perturbed Differential Equations, Singularly Perturbed Differential, Solving Singularly Perturbed, Physics-Informed Neural Networks, Differential Equations
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Solving Singularly Perturbed Differential Equations (SPDEs) presents challenges due to the rapid change of their solutions at the boundary layer. In this manuscript, We propose Asymptotic Physics-Informed Neural Networks (ASPINN), a generalization of Physics-Informed Neural Networks (PINN) and General-Kindred Physics-Informed Neural Networks (GKPINN) approaches. This is a decomposition method based on the idea of asymptotic analysis. Compared to PINN, the ASPINN method has a strong fitting ability for solving SPDEs due to the placement of exponential layers at the boundary layer. Unlike GKPINN, ASPINN lessens the number of fully connected layers, thereby reducing the training cost more effectively. Moreover, ASPINN theoretically approximates the solution at the boundary layer more accurately, which accuracy is also improved compared to GKPINN. We demonstrate the effect of ASPINN by solving diverse classes of SPDEs, which clearly shows that the ASPINN method is promising in boundary layer problems. Furthermore, we introduce Chebyshev Kolmogorov-Arnold Networks (Chebyshev-KAN) instead of MLP, achieving better performance in various experiments.

[LG-64] Overcoming Data Limitations in Internet Traffic Forecasting: LSTM Models with Transfer Learning and Wavelet Augmentation

链接: https://arxiv.org/abs/2409.13181
作者: Sajal Saha,Anwar Haque,Greg Sidebottom
关键词-EN: Effective internet traffic, smaller ISP networks, Effective internet, ISP networks, data
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 Figures, Submitted to Elsevier Journal of Computer Communication

点击查看摘要

Abstract:Effective internet traffic prediction in smaller ISP networks is challenged by limited data availability. This paper explores this issue using transfer learning and data augmentation techniques with two LSTM-based models, LSTMSeq2Seq and LSTMSeq2SeqAtn, initially trained on a comprehensive dataset provided by Juniper Networks and subsequently applied to smaller datasets. The datasets represent real internet traffic telemetry, offering insights into diverse traffic patterns across different network domains. Our study revealed that while both models performed well in single-step predictions, multi-step forecasts were challenging, particularly in terms of long-term accuracy. In smaller datasets, LSTMSeq2Seq generally outperformed LSTMSeq2SeqAtn, indicating that higher model complexity does not necessarily translate to better performance. The models’ effectiveness varied across different network domains, reflecting the influence of distinct traffic characteristics. To address data scarcity, Discrete Wavelet Transform was used for data augmentation, leading to significant improvements in model performance, especially in shorter-term forecasts. Our analysis showed that data augmentation is crucial in scenarios with limited data. Additionally, the study included an analysis of the models’ variability and consistency, with attention mechanisms in LSTMSeq2SeqAtn providing better short-term forecasting consistency but greater variability in longer forecasts. The results highlight the benefits and limitations of different modeling approaches in traffic prediction. Overall, this research underscores the importance of transfer learning and data augmentation in enhancing the accuracy of traffic prediction models, particularly in smaller ISP networks with limited data availability.

[LG-65] ConvLSTMTransNet: A Hybrid Deep Learning Approach for Internet Traffic Telemetry

链接: https://arxiv.org/abs/2409.13179
作者: Sajal Saha,Saikat Das,Glaucio H.S. Carvalho
关键词-EN: hybrid deep learning, Convolutional Neural Networks, Long Short-Term Memory, Gated Recurrent Unit, deep learning model
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 figure, Submitted to IEEE Virtual Conference on Communications 2024

点击查看摘要

Abstract:In this paper, we present a novel hybrid deep learning model, named ConvLSTMTransNet, designed for time series prediction, with a specific application to internet traffic telemetry. This model integrates the strengths of Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Transformer encoders to capture complex spatial-temporal relationships inherent in time series data. The ConvLSTMTransNet model was evaluated against three baseline models: RNN, LSTM, and Gated Recurrent Unit (GRU), using real internet traffic data sampled from high-speed ports on a provider edge router. Performance metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Weighted Absolute Percentage Error (WAPE) were used to assess each model’s accuracy. Our findings demonstrate that ConvLSTMTransNet significantly outperforms the baseline models by approximately 10% in terms of prediction accuracy. ConvLSTMTransNet surpasses traditional models due to its innovative architectural features, which enhance its ability to capture temporal dependencies and extract spatial features from internet traffic data. Overall, these findings underscore the importance of employing advanced architectures tailored to the complexities of internet traffic data for achieving more precise predictions.

[LG-66] An Adaptive End-to-End IoT Security Framework Using Explainable AI and LLMs

链接: https://arxiv.org/abs/2409.13177
作者: Sudipto Baral,Sajal Saha,Anwar Haque
关键词-EN: Internet of Things, Local Interpretable Model-agnostic, leverages Machine Learning, Large Language Models, Interpretable Model-agnostic Explanations
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 1 figure, Accepted in 2024 IEEE WF-IoT Conference

点击查看摘要

Abstract:The exponential growth of the Internet of Things (IoT) has significantly increased the complexity and volume of cybersecurity threats, necessitating the development of advanced, scalable, and interpretable security frameworks. This paper presents an innovative, comprehensive framework for real-time IoT attack detection and response that leverages Machine Learning (ML), Explainable AI (XAI), and Large Language Models (LLM). By integrating XAI techniques such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) with a model-independent architecture, we ensure our framework’s adaptability across various ML algorithms. Additionally, the incorporation of LLMs enhances the interpretability and accessibility of detection decisions, providing system administrators with actionable, human-understandable explanations of detected threats. Our end-to-end framework not only facilitates a seamless transition from model development to deployment but also represents a real-world application capability that is often lacking in existing research. Based on our experiments with the CIC-IOT-2023 dataset \citeneto2023ciciot2023, Gemini and OPENAI LLMS demonstrate unique strengths in attack mitigation: Gemini offers precise, focused strategies, while OPENAI provides extensive, in-depth security measures. Incorporating SHAP and LIME algorithms within XAI provides comprehensive insights into attack detection, emphasizing opportunities for model improvement through detailed feature analysis, fine-tuning, and the adaptation of misclassifications to enhance accuracy.

[LG-67] RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender Systems

链接: https://arxiv.org/abs/2409.13175
作者: Shuo Su,Xiaoshuang Chen,Yao Wang,Yulin Wu,Ziqiang Zhang,Kaiqiao Zhan,Ben Wang,Kun Gai
关键词-EN: Modern recommender systems, perform real-time computation, Modern recommender, limited computational resources, computation-intensive infrastructure
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern recommender systems are built upon computation-intensive infrastructure, and it is challenging to perform real-time computation for each request, especially in peak periods, due to the limited computational resources. Recommending by user-wise result caches is widely used when the system cannot afford a real-time recommendation. However, it is challenging to allocate real-time and cached recommendations to maximize the users’ overall engagement. This paper shows two key challenges to cache allocation, i.e., the value-strategy dependency and the streaming allocation. Then, we propose a reinforcement prediction-allocation framework (RPAF) to address these issues. RPAF is a reinforcement-learning-based two-stage framework containing prediction and allocation stages. The prediction stage estimates the values of the cache choices considering the value-strategy dependency, and the allocation stage determines the cache choices for each individual request while satisfying the global budget constraint. We show that the challenge of training RPAF includes globality and the strictness of budget constraints, and a relaxed local allocator (RLA) is proposed to address this issue. Moreover, a PoolRank algorithm is used in the allocation stage to deal with the streaming allocation problem. Experiments show that RPAF significantly improves users’ engagement under computational budget constraints.

[LG-68] Bilateral Sharpness-Aware Minimization for Flatter Minima

链接: https://arxiv.org/abs/2409.13173
作者: Jiaxin Deng,Junbiao Pang,Baochang Zhang,Qingming Huang
关键词-EN: Flatness Indicator Problem, Flatness Indicator, SAM, Sharpness-Aware Minimization, reducing a Max-Sharpness
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS). Despite the practical success, we empirically found that the MAxS behind SAM’s generalization enhancements face the “Flatness Indicator Problem” (FIP), where SAM only considers the flatness in the direction of gradient ascent, resulting in a next minimization region that is not sufficiently flat. A better Flatness Indicator (FI) would bring a better generalization of neural networks. Because SAM is a greedy search method in nature. In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS). By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM. The theoretical analysis proves that BSAM converges to local minima. Extensive experiments demonstrate that BSAM offers superior generalization performance and robustness compared to vanilla SAM across various tasks, i.e., classification, transfer learning, human pose estimation, and network quantization. Code is publicly available at: this https URL.

[LG-69] Hidden Activations Are Not Enough: A General Approach to Neural Network Predictions

链接: https://arxiv.org/abs/2409.13163
作者: Samuel Leblanc,Aiky Rasolomanana,Marco Armenta
关键词-EN: analyzing neural networks, quiver representation theory, neural network, quiver representation, neural network predictions
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Representation Theory (math.RT)
*备注:

点击查看摘要

Abstract:We introduce a novel mathematical framework for analyzing neural networks using tools from quiver representation theory. This framework enables us to quantify the similarity between a new data sample and the training data, as perceived by the neural network. By leveraging the induced quiver representation of a data sample, we capture more information than traditional hidden layer outputs. This quiver representation abstracts away the complexity of the computations of the forward pass into a single matrix, allowing us to employ simple geometric and statistical arguments in a matrix space to study neural network predictions. Our mathematical results are architecture-agnostic and task-agnostic, making them broadly applicable. As proof of concept experiments, we apply our results for the MNIST and FashionMNIST datasets on the problem of detecting adversarial examples on different MLP architectures and several adversarial attack methods. Our experiments can be reproduced with our \hrefthis https URLpublicly available repository.

[LG-70] Convergence of Distributed Adaptive Optimization with Local Updates

链接: https://arxiv.org/abs/2409.13155
作者: Ziheng Cheng,Margalit Glasgow
关键词-EN: distributed adaptive algorithms, study distributed adaptive, local updates, local, intermittent communication
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial but challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping.

[LG-71] Score-Based Multibeam Point Cloud Denoising

链接: https://arxiv.org/abs/2409.13143
作者: Li Ling,Yiping Xie,Nils Bore,John Folkesson
关键词-EN: Multibeam echo-sounder, cheaper MBES sensors, MBES, de-facto sensor, bathymetry mapping
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 IEEE OES AUV Symposium

点击查看摘要

Abstract:Multibeam echo-sounder (MBES) is the de-facto sensor for bathymetry mapping. In recent years, cheaper MBES sensors and global mapping initiatives have led to exponential growth of available data. However, raw MBES data contains 1-25% of noise that requires semi-automatic filtering using tools such as Combined Uncertainty and Bathymetric Estimator (CUBE). In this work, we draw inspirations from the 3D point cloud community and adapted a score-based point cloud denoising network for MBES outlier detection and denoising. We trained and evaluated this network on real MBES survey data. The proposed method was found to outperform classical methods, and can be readily integrated into existing MBES standard workflow. To facilitate future research, the code and pretrained model are available online.

[LG-72] Learning to Compare Hardware Designs for High-Level Synthesis

链接: https://arxiv.org/abs/2409.13138
作者: Yunsheng Bai,Atefeh Sohrabizadeh,Zijian Ding,Rongjian Liang,Weikai Li,Ding Wang,Haoxing Ren,Yizhou Sun,Jason Cong
关键词-EN: transforms high-level code, transforms high-level, High-level synthesis, high-level code, automated design process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Published in MLCAD 2024

点击查看摘要

Abstract:High-level synthesis (HLS) is an automated design process that transforms high-level code into hardware designs, enabling the rapid development of hardware accelerators. HLS relies on pragmas, which are directives inserted into the source code to guide the synthesis process, and pragmas have various settings and values that significantly impact the resulting hardware design. State-of-the-art ML-based HLS methods, such as HARP, first train a deep learning model, typically based on graph neural networks (GNNs) applied to graph-based representations of the source code and pragmas. They then perform design space exploration (DSE) to explore the pragma design space, rank candidate designs using the model, and return the top designs. However, traditional DSE methods face challenges due to the highly nonlinear relationship between pragma settings and performance metrics, along with complex interactions between pragmas that affect performance in non-obvious ways. To address these challenges, we propose compareXplore, a novel approach that learns to compare hardware designs for effective HLS optimization. CompareXplore introduces a hybrid loss function that combines pairwise preference learning with pointwise performance prediction, enabling the model to capture both relative preferences and absolute performance. Moreover, we introduce a novel node difference attention module that focuses on the most informative differences between designs, enabling the model to identify critical pragmas impacting performance. CompareXplore adopts a two-stage DSE, where a pointwise prediction model is used for the initial design pruning, followed by a pairwise comparison stage for precise performance verification. In extensive experiments, compareXplore achieves significant improvements in ranking metrics and generates high-quality HLS results for the selected designs, outperforming the existing SOTA method. Comments: Published in MLCAD 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2409.13138 [cs.LG] (or arXiv:2409.13138v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13138 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD (MLCAD '24), ACM, 2024, Article 2, 1-7 Related DOI: https://doi.org/10.1145/3670474.3685940 Focus to learn more DOI(s) linking to related resources

[LG-73] Federated Learning with Label-Masking Distillation ACM-MM2023

链接: https://arxiv.org/abs/2409.13136
作者: Jianghu Lu,Shikun Li,Kexin Bao,Pengju Wang,Zhenxing Qian,Shiming Ge
关键词-EN: Federated learning, collaboratively train models, multiple local clients, privacy-preserving manner, manner to collaboratively
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2023

点击查看摘要

Abstract:Federated learning provides a privacy-preserving manner to collaboratively train models on data distributed over multiple local clients via the coordination of a global server. In this paper, we focus on label distribution skew in federated learning, where due to the different user behavior of the client, label distributions between different clients are significantly different. When faced with such cases, most existing methods will lead to a suboptimal optimization due to the inadequate utilization of label distribution information in clients. Inspired by this, we propose a label-masking distillation approach termed FedLMD to facilitate federated learning via perceiving the various label distributions of each client. We classify the labels into majority and minority labels based on the number of examples per class during training. The client model learns the knowledge of majority labels from local data. The process of distillation masks out the predictions of majority labels from the global model, so that it can focus more on preserving the minority label knowledge of the client. A series of experiments show that the proposed approach can achieve state-of-the-art performance in various cases. Moreover, considering the limited resources of the clients, we propose a variant FedLMD-Tf that does not require an additional teacher, which outperforms previous lightweight approaches without increasing computational costs. Our code is available at this https URL.

[LG-74] CorBin-FL: A Differentially Private Federated Learning Mechanism using Common Randomness

链接: https://arxiv.org/abs/2409.13133
作者: Hojat Allah Salehi,Md Jueal Mia,S. Sandeep Pradhan,M. Hadi Amini,Farhad Shirani
关键词-EN: Federated learning, distributed machine learning, promising framework, privacy, Federated
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising framework for distributed machine learning. It enables collaborative learning among multiple clients, utilizing distributed data and computing resources. However, FL faces challenges in balancing privacy guarantees, communication efficiency, and overall model accuracy. In this work, we introduce CorBin-FL, a privacy mechanism that uses correlated binary stochastic quantization to achieve differential privacy while maintaining overall model accuracy. The approach uses secure multi-party computation techniques to enable clients to perform correlated quantization of their local model updates without compromising individual privacy. We provide theoretical analysis showing that CorBin-FL achieves parameter-level local differential privacy (PLDP), and that it asymptotically optimizes the privacy-utility trade-off between the mean square error utility measure and the PLDP privacy measure. We further propose AugCorBin-FL, an extension that, in addition to PLDP, achieves user-level and sample-level central differential privacy guarantees. For both mechanisms, we derive bounds on privacy parameters and mean squared error performance measures. Extensive experiments on MNIST and CIFAR10 datasets demonstrate that our mechanisms outperform existing differentially private FL mechanisms, including Gaussian and Laplacian mechanisms, in terms of model accuracy under equal PLDP privacy budgets.

[LG-75] Disentangling Recognition and Decision Regrets in Image-Based Reinforcement Learning

链接: https://arxiv.org/abs/2409.13108
作者: Alihan Hüyük,Arndt Ryo Koblitz,Atefeh Mohajeri,Matthew Andrews
关键词-EN: taking actions based, image-based reinforcement learning, extracting lower-dimensional features, reinforcement learning, policies usually operate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In image-based reinforcement learning (RL), policies usually operate in two steps: first extracting lower-dimensional features from raw images (the “recognition” step), and then taking actions based on the extracted features (the “decision” step). Extracting features that are spuriously correlated with performance or irrelevant for decision-making can lead to poor generalization performance, known as observational overfitting in image-based RL. In such cases, it can be hard to quantify how much of the error can be attributed to poor feature extraction vs. poor decision-making. In order to disentangle the two sources of error, we introduce the notions of recognition regret and decision regret. Using these notions, we characterize and disambiguate the two distinct causes behind observational overfitting: over-specific representations, which include features that are not needed for optimal decision-making (leading to high decision regret), vs. under-specific representations, which only include a limited set of features that were spuriously correlated with performance during training (leading to high recognition regret). Finally, we provide illustrative examples of observational overfitting due to both over-specific and under-specific representations in maze environments as well as the Atari game Pong.

[LG-76] ERIC: Estimating Rainfall with Commodity Doorbell Camera for Precision Residential Irrigation

链接: https://arxiv.org/abs/2409.13104
作者: Tian Liu,Liuyi Jin,Radu Stoleru,Amran Haroon,Charles Swanson,Kexin Feng
关键词-EN: nearby weather stations, adjust irrigation amounts, nearby weather, weather stations, stations to adjust
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: BuildSys 2024

点击查看摘要

Abstract:Current state-of-the-art residential irrigation systems, such as WaterMyYard, rely on rainfall data from nearby weather stations to adjust irrigation amounts. However, the accuracy of rainfall data is compromised by the limited spatial resolution of rain gauges and the significant variability of hyperlocal rainfall, leading to substantial water waste. To improve irrigation efficiency, we developed a cost-effective irrigation system, dubbed ERIC, which employs machine learning models to estimate rainfall from commodity doorbell camera footage and optimizes irrigation schedules without human intervention. Specifically, we: a) designed novel visual and audio features with lightweight neural network models to infer rainfall from the camera at the edge, preserving user privacy; b) built a complete end-to-end irrigation system on Raspberry Pi 4, costing only 75. We deployed the system across five locations (collecting over 750 hours of video) with varying backgrounds and light conditions. Comprehensive evaluation validates that ERIC achieves state-of-the-art rainfall estimation performance (~ 5mm/day), saving 9,112 gallons/month of water, translating to 28.56/month in utility savings.

[LG-77] Predicting soccer matches with complex networks and machine learning

链接: https://arxiv.org/abs/2409.13098
作者: Eduardo Alves Baratela,Felipe Jordão Xavier,Thomas Peron,Paulino Ribeiro Villas-Boas,Francisco Aparecido Rodrigues
关键词-EN: attracts the attention, researchers and professionals, sports industry, match statistics, sports prediction industries
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: To appear in Journal of Complex Networks

点击查看摘要

Abstract:Soccer attracts the attention of many researchers and professionals in the sports industry. Therefore, the incorporation of science into the sport is constantly growing, with increasing investments in performance analysis and sports prediction industries. This study aims to (i) highlight the use of complex networks as an alternative tool for predicting soccer match outcomes, and (ii) show how the combination of structural analysis of passing networks with match statistical data can provide deeper insights into the game patterns and strategies used by teams. In order to do so, complex network metrics and match statistics were used to build machine learning models that predict the wins and losses of soccer teams in different leagues. The results showed that models based on passing networks were as effective as ``traditional’’ models, which use general match statistics. Another finding was that by combining both approaches, more accurate models were obtained than when they were used separately, demonstrating that the fusion of such approaches can offer a deeper understanding of game patterns, allowing the comprehension of tactics employed by teams relationships between players, their positions, and interactions during matches. It is worth mentioning that both network metrics and match statistics were important and impactful for the mixed model. Furthermore, the use of networks with a lower granularity of temporal evolution (such as creating a network for each half of the match) performed better than a single network for the entire game.

[LG-78] Fast decision tree learning solves hard coding-theoretic problems

链接: https://arxiv.org/abs/2409.13096
作者: Caleb Koch,Carmen Strassle,Caleb Koch
关键词-EN: parameterized Nearest Codeword, Nearest Codeword Problem, Nearest Codeword, parameterized Nearest, PAC learning decision
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 31 pages, FOCS 2024

点击查看摘要

Abstract:We connect the problem of properly PAC learning decision trees to the parameterized Nearest Codeword Problem ( k -NCP). Despite significant effort by the respective communities, algorithmic progress on both problems has been stuck: the fastest known algorithm for the former runs in quasipolynomial time (Ehrenfeucht and Haussler 1989) and the best known approximation ratio for the latter is O(n/\log n) (Berman and Karpinsky 2002; Alon, Panigrahy, and Yekhanin 2009). Research on both problems has thus far proceeded independently with no known connections. We show that \textitany improvement of Ehrenfeucht and Haussler’s algorithm will yield O(\log n) -approximation algorithms for k -NCP, an exponential improvement of the current state of the art. This can be interpreted either as a new avenue for designing algorithms for k -NCP, or as one for establishing the optimality of Ehrenfeucht and Haussler’s algorithm. Furthermore, our reduction along with existing inapproximability results for k -NCP already rule out polynomial-time algorithms for properly learning decision trees. A notable aspect of our hardness results is that they hold even in the setting of \textitweak learning whereas prior ones were limited to the setting of strong learning. Comments: 31 pages, FOCS 2024 Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2409.13096 [cs.CC] (or arXiv:2409.13096v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2409.13096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] Personalized Speech Recognition for Children with Test-Time Adaptation

链接: https://arxiv.org/abs/2409.13095
作者: Zhonghao Shi,Harshvardhan Srivastava,Xuan Shi,Shrikanth Narayanan,Maja J. Matarić
关键词-EN: Accurate automatic speech, real-time child-AI interaction, effective real-time child-AI, Accurate automatic, ASR
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurate automatic speech recognition (ASR) for children is crucial for effective real-time child-AI interaction, especially in educational applications. However, off-the-shelf ASR models primarily pre-trained on adult data tend to generalize poorly to children’s speech due to the data domain shift from adults to children. Recent studies have found that supervised fine-tuning on children’s speech data can help bridge this domain shift, but human annotations may be impractical to obtain for real-world applications and adaptation at training time can overlook additional domain shifts occurring at test time. We devised a novel ASR pipeline to apply unsupervised test-time adaptation (TTA) methods for child speech recognition, so that ASR models pre-trained on adult speech can be continuously adapted to each child speaker at test time without further human annotations. Our results show that ASR models adapted with TTA methods significantly outperform the unadapted off-the-shelf ASR baselines both on average and statistically across individual child speakers. Our analysis also discovered significant data domain shifts both between child speakers and within each child speaker, which further motivates the need for test-time adaptation.

[LG-80] Embedding Geometries of Contrastive Language-Image Pre-Training ECCV2024

链接: https://arxiv.org/abs/2409.13079
作者: Jason Chuan-Chih Chou,Nahid Alam
关键词-EN: InfoNCE loss, loss for contrastive, widely popular, popular for bridging, CLIP
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024 - Beyond Euclidean Workshop

点击查看摘要

Abstract:Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP’s original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

[LG-81] What does guidance do? A fine-grained analysis in a simple setting

链接: https://arxiv.org/abs/2409.13074
作者: Muthu Chidambaram,Khashayar Gatmiry,Sitan Chen,Holden Lee,Jianfeng Lu
关键词-EN: conditional likelihood raised, data distribution tilted, originally motivated, likelihood raised, guidance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2409.13074 [cs.LG] (or arXiv:2409.13074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.13074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] Improved Image Classification with Manifold Neural Networks

链接: https://arxiv.org/abs/2409.13063
作者: Caio F. Deberaldini Netto,Zhiyang Wang,Luana Ruiz
关键词-EN: transportation systems, molecular biology, electrical grids, gained popularity, successful applications
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have gained popularity in various learning tasks, with successful applications in fields like molecular biology, transportation systems, and electrical grids. These fields naturally use graph data, benefiting from GNNs’ message-passing framework. However, the potential of GNNs in more general data representations, especially in the image domain, remains underexplored. Leveraging the manifold hypothesis, which posits that high-dimensional data lies in a low-dimensional manifold, we explore GNNs’ potential in this context. We construct an image manifold using variational autoencoders, then sample the manifold to generate graphs where each node is an image. This approach reduces data dimensionality while preserving geometric information. We then train a GNN to predict node labels corresponding to the image labels in the classification task, and leverage convergence of GNNs to manifold neural networks to analyze GNN generalization. Experiments on MNIST and CIFAR10 datasets demonstrate that GNNs generalize effectively to unseen graphs, achieving competitive accuracy in classification tasks.

[LG-83] Comprehensive Overview of Artificial Intelligence Applications in Modern Industries

链接: https://arxiv.org/abs/2409.13059
作者: Yijie Weng,Jianhao Wu,Tara Kelly,William Johnson
关键词-EN: enhancing decision-making processes, Artificial Intelligence, optimizing operations, decision-making processes, opportunities for innovation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is fundamentally reshaping various industries by enhancing decision-making processes, optimizing operations, and unlocking new opportunities for innovation. This paper explores the applications of AI across four key sectors: healthcare, finance, manufacturing, and retail. Each section delves into the specific challenges faced by these industries, the AI technologies employed to address them, and the measurable impact on business outcomes and societal welfare. We also discuss the implications of AI integration, including ethical considerations, the future trajectory of AI development, and its potential to drive economic growth while posing challenges that need to be managed responsibly.

[LG-84] LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

链接: https://arxiv.org/abs/2409.13054
作者: Akshaj Kumar Veldanda,Shi-Xiong Zhang,Anirban Das,Supriyo Chakraborty,Stephen Rawls,Sambit Sahu,Milind Naphade
关键词-EN: Large language models, Large language, problematic knowledge embedded, revolutionized various domains, embedded during pretraining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized various domains, yet their utility comes with significant challenges related to outdated or problematic knowledge embedded during pretraining. This paper addresses the challenge of modifying LLMs to unlearn problematic and outdated information while efficiently integrating new knowledge without retraining from scratch. Here, we propose LLM Surgery, a framework to efficiently modify LLM behaviour by optimizing a three component objective function that: (1) Performs reverse gradient on unlearning dataset (problematic and outdated information), (2) Performs gradient descent on the update dataset (new and updated information), and (3) Minimizes the KL divergence on the retain dataset (small subset of unchanged text), ensuring alignment between pretrained and modified model outputs. Due to the lack of publicly available datasets specifically tailored for our novel task, we compiled a new dataset and an evaluation benchmark. Using Llama2-7B, we demonstrate that LLM Surgery can achieve significant forgetting on the unlearn set, a 20% increase in accuracy on the update set, and maintain performance on the retain set.

[LG-85] owards Unbiased Evaluation of Time-series Anomaly Detector

链接: https://arxiv.org/abs/2409.13053
作者: Debarpan Bhattacharya,Sumanta Mukherjee,Chandramouli Kamanchi,Vijay Ekambaram,Arindam Jati,Pankaj Dayama
关键词-EN: detecting seismic activity, series anomaly detection, critical applications, seismic activity, sensor failures
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Time series anomaly detection (TSAD) is an evolving area of research motivated by its critical applications, such as detecting seismic activity, sensor failures in industrial plants, predicting crashes in the stock market, and so on. Across domains, anomalies occur significantly less frequently than normal data, making the F1-score the most commonly adopted metric for anomaly detection. However, in the case of time series, it is not straightforward to use standard F1-score because of the dissociation between time points' and time events’. To accommodate this, anomaly predictions are adjusted, called as point adjustment (PA), before the F_1 -score evaluation. However, these adjustments are heuristics-based, and biased towards true positive detection, resulting in over-estimated detector performance. In this work, we propose an alternative adjustment protocol called ``Balanced point adjustment’’ (BA). It addresses the limitations of existing point adjustment methods and provides guarantees of fairness backed by axiomatic definitions of TSAD evaluation.

[LG-86] ACE: Tumor-Aware Counterfactual Explanations

链接: https://arxiv.org/abs/2409.13045
作者: Eleonora Beatrice Rossi,Eleonora Lopez,Danilo Comminiello
关键词-EN: advanced diagnostic capabilities, significantly advanced diagnostic, diagnostic capabilities, enhancing both accuracy, accuracy and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: The paper has been accepted at Italian Workshop on Neural Networks (WIRN) 2024

点击查看摘要

Abstract:The application of deep learning in medical imaging has significantly advanced diagnostic capabilities, enhancing both accuracy and efficiency. Despite these benefits, the lack of transparency in these AI models, often termed “black boxes,” raises concerns about their reliability in clinical settings. Explainable AI (XAI) aims to mitigate these concerns by developing methods that make AI decisions understandable and trustworthy. In this study, we propose Tumor Aware Counterfactual Explanations (TACE), a framework designed to generate reliable counterfactual explanations for medical images. Unlike existing methods, TACE focuses on modifying tumor-specific features without altering the overall organ structure, ensuring the faithfulness of the counterfactuals. We achieve this by including an additional step in the generation process which allows to modify only the region of interest (ROI), thus yielding more reliable counterfactuals as the rest of the organ remains unchanged. We evaluate our method on mammography images and brain MRI. We find that our method far exceeds existing state-of-the-art techniques in quality, faithfulness, and generation speed of counterfactuals. Indeed, more faithful explanations lead to a significant improvement in classification success rates, with a 10.69% increase for breast cancer and a 98.02% increase for brain tumors. The code of our work is available at this https URL.

[LG-87] ACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning COLING2025

链接: https://arxiv.org/abs/2409.13035
作者: Shivam Shandilya,Menglin Xia,Supriyo Ghosh,Huiqiang Jiang,Jue Zhang,Qianhui Wu,Victor Rühle
关键词-EN: large language models, leading to challenges, computational efficiency, increasing prevalence, prevalence of large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Submitted to COLING 2025

点击查看摘要

Abstract:The increasing prevalence of large language models (LLMs) such as GPT-4 in various applications has led to a surge in the size of prompts required for optimal performance, leading to challenges in computational efficiency. Prompt compression aims to reduce the inference cost by minimizing input tokens without compromising on the task performance. However, existing prompt compression techniques either rely on sub-optimal metrics such as information entropy or model it as a task-agnostic token classification problem that fails to capture task-specific information. To address these issues, we propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method. To ensure low latency requirements, we leverage existing Transformer encoder-based token classification model while guiding the learning process with task-specific reward signals using lightweight REINFORCE algorithm. We evaluate the performance of our method on three diverse and challenging tasks including text summarization, question answering and code summarization. We demonstrate that our RL-guided compression method improves the task performance by 8% - 260% across these three scenarios over state-of-the-art compression techniques while satisfying the same compression rate and latency requirements.

[LG-88] Cost: A Novel Instance Complexity Based Cost-Sensitive Learning Framework for Imbalanced Classification

链接: https://arxiv.org/abs/2409.13007
作者: Asif Newaz,Asif Ur Rahman Adib,Taskeed Jabid
关键词-EN: data presents significant, presents significant challenges, imbalance in data, data presents, Class imbalance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class imbalance in data presents significant challenges for classification tasks. It is fairly common and requires careful handling to obtain desirable performance. Traditional classification algorithms become biased toward the majority class. One way to alleviate the scenario is to make the classifiers cost-sensitive. This is achieved by assigning a higher misclassification cost to minority-class instances. One issue with this implementation is that all the minority-class instances are treated equally, and assigned with the same penalty value. However, the learning difficulties of all the instances are not the same. Instances that are located near the decision boundary are harder to classify, whereas those further away are easier. Without taking into consideration the instance complexity and naively weighting all the minority-class samples uniformly, results in an unwarranted bias and consequently, a higher number of misclassifications of the majority-class instances. This is undesirable and to overcome the situation, we propose a novel instance complexity-based cost-sensitive approach in this study. We first categorize all the minority-class instances based on their difficulty level and then the instances are penalized accordingly. This ensures a more equitable instance weighting and prevents excessive penalization. The performance of the proposed approach is tested on 66 imbalanced datasets against the traditional cost-sensitive learning frameworks and a significant improvement in performance is noticeable, demonstrating the effectiveness of our method.

[LG-89] Data Poisoning and Leakage Analysis in Federated Learning

链接: https://arxiv.org/abs/2409.13004
作者: Wenqi Wei,Tiansheng Huang,Zachary Yahn,Anoop Singhal,Margaret Loper,Ling Liu
关键词-EN: training data, training data privacy, Data, Data poisoning, training data poisoning
类目: Machine Learning (cs.LG)
*备注: Chapter of Handbook of Trustworthy Federated Learning

点击查看摘要

Abstract:Data poisoning and leakage risks impede the massive deployment of federated learning in the real world. This chapter reveals the truths and pitfalls of understanding two dominating threats: \em training data privacy intrusion and \em training data poisoning. We first investigate training data privacy threat and present our observations on when and how training data may be leaked during the course of federated training. One promising defense strategy is to perturb the raw gradient update by adding some controlled randomized noise prior to sharing during each round of federated learning. We discuss the importance of determining the proper amount of randomized noise and the proper location to add such noise for effective mitigation of gradient leakage threats against training data privacy. Then we will review and compare different training data poisoning threats and analyze why and when such data poisoning induced model Trojan attacks may lead to detrimental damage on the performance of the global model. We will categorize and compare representative poisoning attacks and the effectiveness of their mitigation techniques, delivering an in-depth understanding of the negative impact of data poisoning. Finally, we demonstrate the potential of dynamic model perturbation in simultaneously ensuring privacy protection, poisoning resilience, and model performance. The chapter concludes with a discussion on additional risk factors in federated learning, including the negative impact of skewness, data and algorithmic biases, as well as misinformation in training data. Powered by empirical evidence, our analytical study offers some transformative insights into effective privacy protection and security assurance strategies in attack-resilient federated learning.

[LG-90] Introducing the Large Medical Model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences

链接: https://arxiv.org/abs/2409.13000
作者: Ricky Sahu,Eric Marriott,Ethan Siegel,David Wagner,Flore Uzan,Troy Yang,Asim Javed
关键词-EN: NHE Fact Sheet, NHE Fact, Fact Sheet, healthcare spending approaching, optimal patient care
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:With U.S. healthcare spending approaching 5T (NHE Fact Sheet 2024), and 25% of it estimated to be wasteful (Waste in the US the health care system: estimated costs and potential for savings, n.d.), the need to better predict risk and optimal patient care is evermore important. This paper introduces the Large Medical Model (LMM), a generative pre-trained transformer (GPT) designed to guide and predict the broad facets of patient care and healthcare administration. The model is trained on medical event sequences from over 140M longitudinal patient claims records with a specialized vocabulary built from medical terminology systems and demonstrates a superior capability to forecast healthcare costs and identify potential risk factors. Through experimentation and validation, we showcase the LMM’s proficiency in not only in cost and risk predictions, but also in discerning intricate patterns within complex medical conditions and an ability to identify novel relationships in patient care. The LMM is able to improve both cost prediction by 14.1% over the best commercial models and chronic conditions prediction by 1.9% over the best transformer models in research predicting a broad set of conditions. The LMM is a substantial advancement in healthcare analytics, offering the potential to significantly enhance risk assessment, cost management, and personalized medicine.

[LG-91] VCAT: Vulnerability-aware and Curiosity-driven Adversarial Training for Enhancing Autonomous Vehicle Robustness

链接: https://arxiv.org/abs/2409.12997
作者: Xuan Cai,Zhiyong Cui,Xuesong Bai,Ruimin Ke,Zhenshu Ma,Haiyang Yu,Yilong Ren
关键词-EN: face significant threats, Autonomous vehicles, face significant, significant threats, safe operation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, conference

点击查看摘要

Abstract:Autonomous vehicles (AVs) face significant threats to their safe operation in complex traffic environments. Adversarial training has emerged as an effective method of enabling AVs to preemptively fortify their robustness against malicious attacks. Train an attacker using an adversarial policy, allowing the AV to learn robust driving through interaction with this attacker. However, adversarial policies in existing methodologies often get stuck in a loop of overexploiting established vulnerabilities, resulting in poor improvement for AVs. To overcome the limitations, we introduce a pioneering framework termed Vulnerability-aware and Curiosity-driven Adversarial Training (VCAT). Specifically, during the traffic vehicle attacker training phase, a surrogate network is employed to fit the value function of the AV victim, providing dense information about the victim’s inherent vulnerabilities. Subsequently, random network distillation is used to characterize the novelty of the environment, constructing an intrinsic reward to guide the attacker in exploring unexplored territories. In the victim defense training phase, the AV is trained in critical scenarios in which the pretrained attacker is positioned around the victim to generate attack behaviors. Experimental results revealed that the training methodology provided by VCAT significantly improved the robust control capabilities of learning-based AVs, outperforming both conventional training modalities and alternative reinforcement learning counterparts, with a marked reduction in crash rates. The code is available at this https URL.

[LG-92] pyrtklib: An open-source package for tightly coupled deep learning and GNSS integration for positioning in urban canyons

链接: https://arxiv.org/abs/2409.12996
作者: Runzhi Hu,Penghui Xu,Yihan Zhong,Weisong Wen
关键词-EN: Global Navigation Satellite, Navigation Satellite Systems, intelligent transportation systems, Global Navigation, Navigation Satellite
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) is revolutionizing numerous fields, with increasing applications in Global Navigation Satellite Systems (GNSS) positioning algorithms in intelligent transportation systems (ITS) via deep learning. However, a significant technological disparity exists as traditional GNSS algorithms are often developed in Fortran or C, contrasting with the Python-based implementation prevalent in deep learning tools. To address this discrepancy, this paper introduces pyrtklib, a Python binding for the widely utilized open-source GNSS tool, RTKLIB. This binding makes all RTKLIB functionalities accessible in Python, facilitating seamless integration. Moreover, we present a deep learning subsystem under pyrtklib, which is a novel deep learning framework that leverages pyrtklib to accurately predict weights and biases within the GNSS positioning process. The use of pyrtklib enables developers to easily and quickly prototype and implement deep learning-aided GNSS algorithms, showcasing its potential to enhance positioning accuracy significantly.

[LG-93] Improving generalisability of 3D binding affinity models in low data regimes

链接: https://arxiv.org/abs/2409.12995
作者: Julia Buhmann,Ward Haddadin,Lukáš Pravda,Alan Bilsland,Hagen Triendl
关键词-EN: Predicting protein-ligand binding, computer-aided drug design, Predicting protein-ligand, binding affinity, protein-ligand binding affinity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 10 figues

点击查看摘要

Abstract:Predicting protein-ligand binding affinity is an essential part of computer-aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well-suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein-specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre-training via quantum mechanical data, unsupervised pre-training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.

[LG-94] Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML

链接: https://arxiv.org/abs/2409.12994
作者: Chelsea Maria John,Stepan Nassyr,Carolin Penke,Andreas Herten
关键词-EN: specialized hardware accelerators, hardware accelerators designed, efficient model training, machine learning, technologies has driven
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: To be published in Workshop Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC-W '24) (2024)

点击查看摘要

Abstract:The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.

[LG-95] DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

链接: https://arxiv.org/abs/2409.12992
作者: Yang Chen,Yuhang Jia,Shiwan Zhao,Ziyue Jiang,Haoran Li,Jiarong Kang,Yong Qin
关键词-EN: free-text editing continues, unrestricted free-text editing, OOD text scenarios, increasingly prevalent, continues to grow
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and enhance the overall fluency of the edited speech. Experimental results demonstrate that our model achieves state-of-the-art performance in both in-domain and OOD text scenarios.

[LG-96] Semantic Meta-Split Learning: A TinyML Scheme for Few-Shot Wireless Image Classification

链接: https://arxiv.org/abs/2409.12978
作者: Eslam Eldeeb,Mohammad Shehab,Hirley Alves,Mohamed-Slim Alouini
关键词-EN: transmits significant information, emerging technology, transmits significant, significant information, SGO
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic and goal-oriented (SGO) communication is an emerging technology that only transmits significant information for a given task. Semantic communication encounters many challenges, such as computational complexity at end users, availability of data, and privacy-preserving. This work presents a TinyML-based semantic communication framework for few-shot wireless image classification that integrates split-learning and meta-learning. We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving. In addition, meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks. The proposed algorithm is tested using a data set of images of hand-written letters. In addition, we present an uncertainty analysis of the predictions using conformal prediction (CP) techniques. Simulation results show that the proposed Semantic-MSL outperforms conventional schemes by achieving 20 % gain on classification accuracy using fewer data points, yet less training energy consumption.

[LG-97] Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements Applications And Challenges

链接: https://arxiv.org/abs/2409.12977
作者: James E. Gallagher,Edward J. Oughton
关键词-EN: powerful tools supporting, tools supporting diverse, autonomous vehicles, infrastructure monitoring, environmental assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multispectral imaging and deep learning have emerged as powerful tools supporting diverse use cases from autonomous vehicles, to agriculture, infrastructure monitoring and environmental assessment. The combination of these technologies has led to significant advancements in object detection, classification, and segmentation tasks in the non-visible light spectrum. This paper considers 400 total papers, reviewing 200 in detail to provide an authoritative meta-review of multispectral imaging technologies, deep learning models, and their applications, considering the evolution and adaptation of You Only Look Once (YOLO) methods. Ground-based collection is the most prevalent approach, totaling 63% of the papers reviewed, although uncrewed aerial systems (UAS) for YOLO-multispectral applications have doubled since 2020. The most prevalent sensor fusion is Red-Green-Blue (RGB) with Long-Wave Infrared (LWIR), comprising 39% of the literature. YOLOv5 remains the most used variant for adaption to multispectral applications, consisting of 33% of all modified YOLO models reviewed. 58% of multispectral-YOLO research is being conducted in China, with broadly similar research quality to other countries (with a mean journal impact factor of 4.45 versus 4.36 for papers not originating from Chinese institutions). Future research needs to focus on (i) developing adaptive YOLO architectures capable of handling diverse spectral inputs that do not require extensive architectural modifications, (ii) exploring methods to generate large synthetic multispectral datasets, (iii) advancing multispectral YOLO transfer learning techniques to address dataset scarcity, and (iv) innovating fusion research with other sensor types beyond RGB and LWIR.

[LG-98] RACE: Transformer-based user Representations from Attributed Clickstream Event sequences RECSYS

链接: https://arxiv.org/abs/2409.12972
作者: William Black,Alexander Manlove,Jack Pennington,Andrea Marchini,Ercument Ilhan,Vilda Markeviciute
关键词-EN: intricate browsing patterns, span numerous sessions, period of time, process of researching, making a purchase
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: RecSys Workshop on Recommenders in Tourism (RecTour 2024), October 14th-18th, 2024, co-located with the 18th ACM Conference on Recommender Systems, Bari, Italy

点击查看摘要

Abstract:For users navigating travel e-commerce websites, the process of researching products and making a purchase often results in intricate browsing patterns that span numerous sessions over an extended period of time. The resulting clickstream data chronicle these user journeys and present valuable opportunities to derive insights that can significantly enhance personalized recommendations. We introduce TRACE, a novel transformer-based approach tailored to generate rich user embeddings from live multi-session clickstreams for real-time recommendation applications. Prior works largely focus on single-session product sequences, whereas TRACE leverages site-wide page view sequences spanning multiple user sessions to model long-term engagement. Employing a multi-task learning framework, TRACE captures comprehensive user preferences and intents distilled into low-dimensional representations. We demonstrate TRACE’s superior performance over vanilla transformer and LLM-style architectures through extensive experiments on a large-scale travel e-commerce dataset of real user journeys, where the challenges of long page-histories and sparse targets are particularly prevalent. Visualizations of the learned embeddings reveal meaningful clusters corresponding to latent user states and behaviors, highlighting TRACE’s potential to enhance recommendation systems by capturing nuanced user interactions and preferences

[LG-99] Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

链接: https://arxiv.org/abs/2409.12969
作者: Gracjan Góral,Alicja Ziarko,Michal Nauman,Maciej Wołczyk
关键词-EN: enables individuals, Visual perspective-taking, individuals to anticipate, anticipate the actions, Vision Language Models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at this https URL

[LG-100] Optical training of large-scale Transformers and deep neural networks with direct feedback alignment

链接: https://arxiv.org/abs/2409.12965
作者: Ziao Wang,Kilian Müller,Matthew Filipovich,Julien Launay,Ruben Ohana,Gustave Pariente,Safa Mokaadi,Charles Brossollet,Fabien Moreau,Alessandro Cappelli,Iacopo Poli,Igor Carron,Laurent Daudet,Florent Krzakala,Sylvain Gigan
关键词-EN: electronic hardware accelerators, dedicated electronic hardware, machine learning relies, hardware accelerators, relies nearly exclusively
类目: Emerging Technologies (cs.ET); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Optics (physics.optics)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Modern machine learning relies nearly exclusively on dedicated electronic hardware accelerators. Photonic approaches, with low consumption and high operation speed, are increasingly considered for inference but, to date, remain mostly limited to relatively basic tasks. Simultaneously, the problem of training deep and complex neural networks, overwhelmingly performed through backpropagation, remains a significant limitation to the size and, consequently, the performance of current architectures and a major compute and energy bottleneck. Here, we experimentally implement a versatile and scalable training algorithm, called direct feedback alignment, on a hybrid electronic-photonic platform. An optical processing unit performs large-scale random matrix multiplications, which is the central operation of this algorithm, at speeds up to 1500 TeraOps. We perform optical training of one of the most recent deep learning architectures, including Transformers, with more than 1B parameters, and obtain good performances on both language and vision tasks. We study the compute scaling of our hybrid optical approach, and demonstrate a potential advantage for ultra-deep and wide neural networks, thus opening a promising route to sustain the exponential growth of modern artificial intelligence beyond traditional von Neumann approaches.

[LG-101] DARDA: Domain-Aware Real-Time Dynamic Neural Network Adaptation

链接: https://arxiv.org/abs/2409.09753
作者: Shahriar Rifat,Jonathan Ashdown,Francesco Restuccia
关键词-EN: Deep Neural Networks, Test Time Adaptation, Test Time, Neural Networks, Deep Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test Time Adaptation (TTA) has emerged as a practical solution to mitigate the performance degradation of Deep Neural Networks (DNNs) in the presence of corruption/ noise affecting inputs. Existing approaches in TTA continuously adapt the DNN, leading to excessive resource consumption and performance degradation due to accumulation of error stemming from lack of supervision. In this work, we propose Domain-Aware Real-Time Dynamic Adaptation (DARDA) to address such issues. Our key approach is to proactively learn latent representations of some corruption types, each one associated with a sub-network state tailored to correctly classify inputs affected by that corruption. After deployment, DARDA adapts the DNN to previously unseen corruptions in an unsupervised fashion by (i) estimating the latent representation of the ongoing corruption; (ii) selecting the sub-network whose associated corruption is the closest in the latent space to the ongoing corruption; and (iii) adapting DNN state, so that its representation matches the ongoing corruption. This way, DARDA is more resource efficient and can swiftly adapt to new distributions caused by different corruptions without requiring a large variety of input data. Through experiments with two popular mobile edge devices - Raspberry Pi and NVIDIA Jetson Nano - we show that DARDA reduces energy consumption and average cache memory footprint respectively by 1.74x and 2.64x with respect to the state of the art, while increasing the performance by 10.4%, 5.7% and 4.4% on CIFAR-10, CIFAR-100 and TinyImagenet.

[LG-102] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

链接: https://arxiv.org/abs/2408.04811
作者: Moussa Koulako Bala Doumbouya,Ananjan Nandi,Gabriel Poesia,Davide Ghilardi,Anna Goldie,Federico Bianchi,Dan Jurafsky,Christopher D. Manning
关键词-EN: critical concern due, Large Language Models, resist generating harmful, Large Language, jailbreak attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The safety of Large Language Models (LLMs) remains a critical concern due to a lack of adequate benchmarks for systematically evaluating their ability to resist generating harmful content. Previous efforts towards automated red teaming involve static or templated sets of illicit requests and adversarial prompts which have limited utility given jailbreak attacks’ evolving and composable nature. We propose a novel dynamic benchmark of composable jailbreak attacks to move beyond static datasets and taxonomies of attacks and harms. Our approach consists of three components collectively called h4rm3l: (1) a domain-specific language that formally expresses jailbreak attacks as compositions of parameterized prompt transformation primitives, (2) bandit-based few-shot program synthesis algorithms that generate novel attacks optimized to penetrate the safety filters of a target black box LLM, and (3) open-source automated red-teaming software employing the previous two components. We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs. Several of our synthesized attacks are more effective than previously reported ones, with Attack Success Rates exceeding 90% on SOTA closed language models such as claude-3-haiku and GPT4-o. By generating datasets of jailbreak attacks in a unified formal representation, h4rm3l enables reproducible benchmarking and automated red-teaming, contributes to understanding LLM safety limitations, and supports the development of robust defenses in an increasingly LLM-integrated world. Warning: This paper and related research artifacts contain offensive and potentially disturbing prompts and model-generated content. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG) MSC classes: 68 ACMclasses: I.2; I.2.0; I.2.1; I.2.5; I.2.7; K.6.5; K.4.2 Cite as: arXiv:2408.04811 [cs.CR] (or arXiv:2408.04811v2 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04811 Focus to learn more arXiv-issued DOI via DataCite

[LG-103] Improved Unet brain tumor image segmentation based on GSConv module and ECA attention mechanism

链接: https://arxiv.org/abs/2409.13626
作者: Qiyuan Tian,Zhuoyue Wang,Xiaoling Cui
关键词-EN: medical image segmentation, deep learning algorithm, learning algorithm based, improved U-Net model, medical image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages; Accepted by CONF-CDS 2024 conference already

点击查看摘要

Abstract:An improved model of medical image segmentation for brain tumor is discussed, which is a deep learning algorithm based on U-Net architecture. Based on the traditional U-Net, we introduce GSConv module and ECA attention mechanism to improve the performance of the model in medical image segmentation tasks. With these improvements, the new U-Net model is able to extract and utilize multi-scale features more efficiently while flexibly focusing on important channels, resulting in significantly improved segmentation results. During the experiment, the improved U-Net model is trained and evaluated systematically. By looking at the loss curves of the training set and the test set, we find that the loss values of both rapidly decline to the lowest point after the eighth epoch, and then gradually converge and stabilize. This shows that our model has good learning ability and generalization ability. In addition, by monitoring the change in the mean intersection ratio (mIoU), we can see that after the 35th epoch, the mIoU gradually approaches 0.8 and remains stable, which further validates the model. Compared with the traditional U-Net, the improved version based on GSConv module and ECA attention mechanism shows obvious advantages in segmentation effect. Especially in the processing of brain tumor image edges, the improved model can provide more accurate segmentation results. This achievement not only improves the accuracy of medical image analysis, but also provides more reliable technical support for clinical diagnosis.

[LG-104] pAE: An Efficient Autoencoder Architecture for Modeling the Lateral Geniculate Nucleus by Integrating Feedforward and Feedback Streams in Human Visual System

链接: https://arxiv.org/abs/2409.13622
作者: Moslem Gorji,Amin Ranjbar,Mohammad Bagher Menhaj
关键词-EN: hierarchically identifying objects, visual cortex, visual, responsible for hierarchically, identifying objects
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 14 pages, 14 figures, and 1 table

点击查看摘要

Abstract:The visual cortex is a vital part of the brain, responsible for hierarchically identifying objects. Understanding the role of the lateral geniculate nucleus (LGN) as a prior region of the visual cortex is crucial when processing visual information in both bottom-up and top-down pathways. When visual stimuli reach the retina, they are transmitted to the LGN area for initial processing before being sent to the visual cortex for further processing. In this study, we introduce a deep convolutional model that closely approximates human visual information processing. We aim to approximate the function for the LGN area using a trained shallow convolutional model which is designed based on a pruned autoencoder (pAE) architecture. The pAE model attempts to integrate feed forward and feedback streams from/to the V1 area into the problem. This modeling framework encompasses both temporal and non-temporal data feeding modes of the visual stimuli dataset containing natural images captured by a fixed camera in consecutive frames, featuring two categories: images with animals (in motion), and images without animals. Subsequently, we compare the results of our proposed deep-tuned model with wavelet filter bank methods employing Gabor and biorthogonal wavelet functions. Our experiments reveal that the proposed method based on the deep-tuned model not only achieves results with high similarity in comparison with human benchmarks but also performs significantly better than other models. The pAE model achieves the final 99.26% prediction performance and demonstrates a notable improvement of around 28% over human results in the temporal mode.

[LG-105] Stimulus-to-Stimulus Learning in RNNs with Cortical Inductive Biases

链接: https://arxiv.org/abs/2409.13471
作者: Pantelis Vafidis,Antonio Rangel
关键词-EN: predict external contingencies, external contingencies, contingencies from experience, stimulus substitution, inductive bias
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 37 pages, 13 figures

点击查看摘要

Abstract:Animals learn to predict external contingencies from experience through a process of conditioning. A natural mechanism for conditioning is stimulus substitution, whereby the neuronal response to a stimulus with no prior behavioral significance becomes increasingly identical to that generated by a behaviorally significant stimulus it reliably predicts. We propose a recurrent neural network model of stimulus substitution which leverages two forms of inductive bias pervasive in the cortex: representational inductive bias in the form of mixed stimulus representations, and architectural inductive bias in the form of two-compartment pyramidal neurons that have been shown to serve as a fundamental unit of cortical associative learning. The properties of these neurons allow for a biologically plausible learning rule that implements stimulus substitution, utilizing only information available locally at the synapses. We show that the model generates a wide array of conditioning phenomena, and can learn large numbers of associations with an amount of training commensurate with animal experiments, without relying on parameter fine-tuning for each individual experimental task. In contrast, we show that commonly used Hebbian rules fail to learn generic stimulus-stimulus associations with mixed selectivity, and require task-specific parameter fine-tuning. Our framework highlights the importance of multi-compartment neuronal processing in the cortex, and showcases how it might confer cortical animals the evolutionary edge.

[LG-106] Classification of 4 types of White blood cell images

链接: https://arxiv.org/abs/2409.13442
作者: Rabia Asghar,Arslan Shaukat,Usman Akram,Rimsha Tariq
关键词-EN: white blood cells, blood cells, white blood, bacterial infections, blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human immune system contains white blood cells (WBC) that are good indicator of many diseases like bacterial infections, AIDS, cancer, spleen, etc. White blood cells have been sub classified into four types: monocytes, lymphocytes, eosinophils and neutrophils on the basis of their nucleus, shape and cytoplasm. Traditionally in laboratories, pathologists and hematologists analyze these blood cells through microscope and then classify them manually. This manual process takes more time and increases the chance of human error. Hence, there is a need to automate this process. In this paper, first we have used different CNN pre-train models such as ResNet-50, InceptionV3, VGG16 and MobileNetV2 to automatically classify the white blood cells. These pre-train models are applied on Kaggle dataset of microscopic images. Although we achieved reasonable accuracy ranging between 92 to 95%, still there is need to enhance the performance. Hence, inspired by these architectures, a framework has been proposed to automatically categorize the four kinds of white blood cells with increased accuracy. The aim is to develop a convolution neural network (CNN) based classification system with decent generalization ability. The proposed CNN model has been tested on white blood cells images from Kaggle and LISC datasets. Accuracy achieved is 99.57% and 98.67% for both datasets respectively. Our proposed convolutional neural network-based model provides competitive performance as compared to previous results reported in literature.

[LG-107] Differentially Private Multimodal Laplacian Dropout (DP-MLD) for EEG Representative Learning

链接: https://arxiv.org/abs/2409.13440
作者: Xiaowen Fu,Bingxin Wang,Xinzhou Guo,Guoqing Liu,Yang Xiang
关键词-EN: shown great promise, multimodal EEG, EEG, shown great, great promise
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, multimodal electroencephalogram (EEG) learning has shown great promise in disease detection. At the same time, ensuring privacy in clinical studies has become increasingly crucial due to legal and ethical concerns. One widely adopted scheme for privacy protection is differential privacy (DP) because of its clear interpretation and ease of implementation. Although numerous methods have been proposed under DP, it has not been extensively studied for multimodal EEG data due to the complexities of models and signal data considered there. In this paper, we propose a novel Differentially Private Multimodal Laplacian Dropout (DP-MLD) scheme for multimodal EEG learning. Our approach proposes a novel multimodal representative learning model that processes EEG data by language models as text and other modal data by vision transformers as images, incorporating well-designed cross-attention mechanisms to effectively extract and integrate cross-modal features. To achieve DP, we design a novel adaptive feature-level Laplacian dropout scheme, where randomness allocation and performance are dynamically optimized within given privacy budgets. In the experiment on an open-source multimodal dataset of Freezing of Gait (FoG) in Parkinson’s Disease (PD), our proposed method demonstrates an approximate 4% improvement in classification accuracy, and achieves state-of-the-art performance in multimodal EEG learning under DP.

[LG-108] Longitudinal Segmentation of MS Lesions via Temporal Difference Weighting MICCAI2024

链接: https://arxiv.org/abs/2409.13416
作者: Maximilian Rokuss,Yannick Kirchhoff,Saikat Roy,Balint Kovacs,Constantin Ulrich,Tassilo Wald,Maximilian Zenk,Stefan Denner,Fabian Isensee,Philipp Vollmuth,Jens Kleesiek,Klaus Maier-Hein
关键词-EN: Multiple Sclerosis, monitoring disease progression, treatment efficacy, Accurate segmentation, longitudinal MRI scans
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024 LDTM

点击查看摘要

Abstract:Accurate segmentation of Multiple Sclerosis (MS) lesions in longitudinal MRI scans is crucial for monitoring disease progression and treatment efficacy. Although changes across time are taken into account when assessing images in clinical practice, most existing deep learning methods treat scans from different timepoints separately. Among studies utilizing longitudinal images, a simple channel-wise concatenation is the primary albeit suboptimal method employed to integrate timepoints. We introduce a novel approach that explicitly incorporates temporal differences between baseline and follow-up scans through a unique architectural inductive bias called Difference Weighting Block. It merges features from two timepoints, emphasizing changes between scans. We achieve superior scores in lesion segmentation (Dice Score, Hausdorff distance) as well as lesion detection (lesion-level F_1 score) as compared to state-of-the-art longitudinal and single timepoint models across two datasets. Our code is made publicly available at this http URL.

[LG-109] Hydrogen under Pressure as a Benchmark for Machine-Learning Interatomic Potentials

链接: https://arxiv.org/abs/2409.13390
作者: Thomas Bischoff,Bastian Jäckl,Matthias Rupp
关键词-EN: Machine-learning interatomic potentials, systems’ potential energy, potential energy surfaces, atomistic systems’ potential, data-driven surrogate models
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-learning interatomic potentials (MLPs) are fast, data-driven surrogate models of atomistic systems’ potential energy surfaces that can accelerate ab-initio molecular dynamics (MD) simulations by several orders of magnitude. The performance of MLPs is commonly measured as the prediction error in energies and forces on data not used in their training. While low prediction errors on a test set are necessary, they do not guarantee good performance in MD simulations. The latter requires physically motivated performance measures obtained from running accelerated simulations. However, the adoption of such measures has been limited by the effort and domain knowledge required to calculate and interpret them. To overcome this limitation, we present a benchmark that automatically quantifies the performance of MLPs in MD simulations of a liquid-liquid phase transition in hydrogen under pressure, a challenging benchmark system. The benchmark’s h-llpt-24 dataset provides reference geometries, energies, forces, and stresses from density functional theory MD simulations at different temperatures and mass densities. The benchmark’s Python code automatically runs MLP-accelerated MD simulations and calculates, quantitatively compares and visualizes pressures, stable molecular fractions, diffusion coefficients, and radial distribution functions. Employing this benchmark, we show that several state-of-the-art MLPs fail to reproduce the liquid-liquid phase transition. Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG) Cite as: arXiv:2409.13390 [cond-mat.mtrl-sci] (or arXiv:2409.13390v1 [cond-mat.mtrl-sci] for this version) https://doi.org/10.48550/arXiv.2409.13390 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-110] Validity of Feature Importance in Low-Performing Machine Learning for Tabular Biomedical Data

链接: https://arxiv.org/abs/2409.13342
作者: Youngro Lee,Giacomo Baruzzo,Jeonghwan Kim,Jongmo Seo,Barbara Di Camillo
关键词-EN: feature, feature importance, feature cutting, cutting, data cutting
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In tabular biomedical data analysis, tuning models to high accuracy is considered a prerequisite for discussing feature importance, as medical practitioners expect the validity of feature importance to correlate with performance. In this work, we challenge the prevailing belief, showing that low-performing models may also be used for feature importance. We propose experiments to observe changes in feature rank as performance degrades sequentially. Using three synthetic datasets and six real biomedical datasets, we compare the rank of features from full datasets to those with reduced sample sizes (data cutting) or fewer features (feature cutting). In synthetic datasets, feature cutting does not change feature rank, while data cutting shows higher discrepancies with lower performance. In real datasets, feature cutting shows similar or smaller changes than data cutting, though some datasets exhibit the opposite. When feature interactions are controlled by removing correlations, feature cutting consistently shows better stability. By analyzing the distribution of feature importance values and theoretically examining the probability that the model cannot distinguish feature importance between features, we reveal that models can still distinguish feature importance despite performance degradation through feature cutting, but not through data cutting. We conclude that the validity of feature importance can be maintained even at low performance levels if the data size is adequate, which is a significant factor contributing to suboptimal performance in tabular medical data analysis. This paper demonstrates the potential for utilizing feature importance analysis alongside statistical analysis to compare features relatively, even when classifier performance is not satisfactory.

[LG-111] Deep Learning based Optical Image Super-Resolution via Generative Diffusion Models for Layerwise in-situ LPBF Monitoring

链接: https://arxiv.org/abs/2409.13171
作者: Francis Ogoke,Sumesh Kalambettu Suresh,Jesse Adamczyk,Dan Bolintineanu,Anthony Garland,Michael Heiden,Amir Barati Farimani
关键词-EN: Powder Bed Fusion, Laser Powder Bed, Bed Fusion, Laser Powder, Powder Bed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stochastic formation of defects during Laser Powder Bed Fusion (L-PBF) negatively impacts its adoption for high-precision use cases. Optical monitoring techniques can be used to identify defects based on layer-wise imaging, but these methods are difficult to scale to high resolutions due to cost and memory constraints. Therefore, we implement generative deep learning models to link low-cost, low-resolution images of the build plate to detailed high-resolution optical images of the build plate, enabling cost-efficient process monitoring. To do so, a conditional latent probabilistic diffusion model is trained to produce realistic high-resolution images of the build plate from low-resolution webcam images, recovering the distribution of small-scale features and surface roughness. We first evaluate the performance of the model by analyzing the reconstruction quality of the generated images using peak-signal-to-noise-ratio (PSNR), structural similarity index measure (SSIM) and wavelet covariance metrics that describe the preservation of high-frequency information. Additionally, we design a framework based upon the Segment Anything foundation model to recreate the 3D morphology of the printed part and analyze the surface roughness of the reconstructed samples. Finally, we explore the zero-shot generalization capabilities of the implemented framework to other part geometries by creating synthetic low-resolution data.

[LG-112] FaFeSort: A Fast and Few-shot End-to-end Neural Network for Multi-channel Spike Sorting

链接: https://arxiv.org/abs/2409.13067
作者: Yuntao Han,Shiwei Wang
关键词-EN: Decoding extracellular recordings, Decoding extracellular, brain-computer interfaces, crucial task, task in electrophysiology
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding extracellular recordings is a crucial task in electrophysiology and brain-computer interfaces. Spike sorting, which distinguishes spikes and their putative neurons from extracellular recordings, becomes computationally demanding with the increasing number of channels in modern neural probes. To address the intensive workload and complex neuron interactions, we propose FaFeSort, an end-to-end neural network-based spike sorter with few-shot learning and parallelizable post-processing. Our framework reduces the required number of annotated spikes for training by 44% compared to training from scratch, achieving up to 25.68% higher accuracy. Additionally, our novel post-processing algorithm is compatible to the deep learning frameworks, making FaFeSort significantly faster than state-of-the-art spike sorters. On synthesized Neuropixels recordings, FaFeSort achieves comparable accuracy with Kilosort4 sorting 50 seconds of data in only 1.32 seconds. Our method demonstrates robustness across various probe geometries, noise levels, and drift conditions, offering a substantial improvement in both accuracy and runtime efficiency comparing to existing spike sorters.

[LG-113] Semi-overcomplete convolutional auto-encoder embedding as shape priors for deep vessel segmentation

链接: https://arxiv.org/abs/2409.13001
作者: Amine Sadikine,Bogdan Badic,Jean-Pierre Tasu,Vincent Noblet,Dimitris Visvikis,Pierre-Henri Conze
关键词-EN: medical image analysis, image analysis, recently experienced, experienced a widespread, widespread interest
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures, conference

点击查看摘要

Abstract:The extraction of blood vessels has recently experienced a widespread interest in medical image analysis. Automatic vessel segmentation is highly desirable to guide clinicians in computer-assisted diagnosis, therapy or surgical planning. Despite a good ability to extract large anatomical structures, the capacity of U-Net inspired architectures to automatically delineate vascular systems remains a major issue, especially given the scarcity of existing datasets. In this paper, we present a novel approach that integrates into deep segmentation shape priors from a Semi-Overcomplete Convolutional Auto-Encoder (S-OCAE) embedding. Compared to standard Convolutional Auto-Encoders (CAE), it exploits an over-complete branch that projects data onto higher dimensions to better characterize tiny structures. Experiments on retinal and liver vessel extraction, respectively performed on publicly-available DRIVE and 3D-IRCADb datasets, highlight the effectiveness of our method compared to U-Net trained without and with shape priors from a traditional CAE.

[LG-114] CMINNs: Compartment Model Informed Neural Networks – Unlocking Drug Dynamics

链接: https://arxiv.org/abs/2409.12998
作者: Nazanin Ahmadi Daryakenari,Shupeng Wang,George Karniadakis
关键词-EN: frequently encounter difficulties, drug development process, traditional models frequently, development process, impact on targets
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the field of pharmacokinetics and pharmacodynamics (PKPD) modeling, which plays a pivotal role in the drug development process, traditional models frequently encounter difficulties in fully encapsulating the complexities of drug absorption, distribution, and their impact on targets. Although multi-compartment models are frequently utilized to elucidate intricate drug dynamics, they can also be overly complex. To generalize modeling while maintaining simplicity, we propose an innovative approach that enhances PK and integrated PK-PD modeling by incorporating fractional calculus or time-varying parameter(s), combined with constant or piecewise constant parameters. These approaches effectively model anomalous diffusion, thereby capturing drug trapping and escape rates in heterogeneous tissues, which is a prevalent phenomenon in drug dynamics. Furthermore, this method provides insight into the dynamics of drug in cancer in multi-dose administrations. Our methodology employs a Physics-Informed Neural Network (PINN) and fractional Physics-Informed Neural Networks (fPINNs), integrating ordinary differential equations (ODEs) with integer/fractional derivative order from compartmental modeling with neural networks. This integration optimizes parameter estimation for variables that are time-variant, constant, piecewise constant, or related to the fractional derivative order. The results demonstrate that this methodology offers a robust framework that not only markedly enhances the model’s depiction of drug absorption rates and distributed delayed responses but also unlocks different drug-effect dynamics, providing new insights into absorption rates, anomalous diffusion, drug resistance, peristance and pharmacokinetic tolerance, all within a system of just two (fractional) ODEs with explainable results.

信息检索

[IR-0] Beauty Beyond Words: Explainable Beauty Product Recommendations Using Ingredient-Based Product Attributes

链接: https://arxiv.org/abs/2409.13628
作者: Siliang Liu,Rahul Suresh,Amin Banitalebi-Dehkordi
关键词-EN: Accurate attribute, trust with customers, Accurate attribute extraction, building trust, Accurate
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 18th ACM Conference on Recommender Systems, Workshop on Strategic and Utility-aware REcommendation

点击查看摘要

Abstract:Accurate attribute extraction is critical for beauty product recommendations and building trust with customers. This remains an open problem, as existing solutions are often unreliable and incomplete. We present a system to extract beauty-specific attributes using end-to-end supervised learning based on beauty product ingredients. A key insight to our system is a novel energy-based implicit model architecture. We show that this implicit model architecture offers significant benefits in terms of accuracy, explainability, robustness, and flexibility. Furthermore, our implicit model can be easily fine-tuned to incorporate additional attributes as they become available, making it more useful in real-world applications. We validate our model on a major e-commerce skincare product catalog dataset and demonstrate its effectiveness. Finally, we showcase how ingredient-based attribute extraction contributes to enhancing the explainability of beauty recommendations.

[IR-1] Advancing Event Causality Identification via Heuristic Semantic Dependency Inquiry Network

链接: https://arxiv.org/abs/2409.13621
作者: Haoran Li,Qiang Gao,Hongmei Wu,Li Huang
关键词-EN: Event Causality Identification, Causality Identification, Event Causality, focuses on extracting, Dependency Inquiry Network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Event Causality Identification (ECI) focuses on extracting causal relations between events in texts. Existing methods for ECI primarily rely on causal features and external knowledge. However, these approaches fall short in two dimensions: (1) causal features between events in a text often lack explicit clues, and (2) external knowledge may introduce bias, while specific problems require tailored analyses. To address these issues, we propose SemDI - a simple and effective Semantic Dependency Inquiry Network for ECI. SemDI captures semantic dependencies within the context using a unified encoder. Then, it utilizes a Cloze Analyzer to generate a fill-in token based on comprehensive context understanding. Finally, this fill-in token is used to inquire about the causal relation between two events. Extensive experiments demonstrate the effectiveness of SemDI, surpassing state-of-the-art methods on three widely used benchmarks. Code is available at this https URL.

[IR-2] Data Augmentation for Sequential Recommendation: A Survey

链接: https://arxiv.org/abs/2409.13545
作者: Yizhou Dang,Enneng Yang,Yuting Liu,Guibing Guo,Linying Jiang,Jianzhe Zhao,Xingwei Wang
关键词-EN: sequential recommendation, recommender systems, real-world situations, essential branch, branch of recommender
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As an essential branch of recommender systems, sequential recommendation (SR) has received much attention due to its well-consistency with real-world situations. However, the widespread data sparsity issue limits the SR model’s performance. Therefore, researchers have proposed many data augmentation (DA) methods to mitigate this phenomenon and have achieved impressive progress. In this survey, we provide a comprehensive review of DA methods for SR. We start by introducing the research background and motivation. Then, we categorize existing methodologies regarding their augmentation principles, objects, and purposes. Next, we present a comparative discussion of their advantages and disadvantages, followed by the exhibition and analysis of representative experimental results. Finally, we outline directions for future research and summarize this survey. We also maintain a repository with a paper list at \urlthis https URL.

[IR-3] A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering

链接: https://arxiv.org/abs/2409.13483
作者: Georgios Sidiropoulos,Evangelos Kanoulas
关键词-EN: Speech-based open-domain, open-domain question answering, Speech-based open-domain question, large corpus, increasing number
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Speech-based open-domain question answering (QA over a large corpus of text passages with spoken questions) has emerged as an important task due to the increasing number of users interacting with QA systems via speech interfaces. Passage retrieval is a key task in speech-based open-domain QA. So far, previous works adopted pipelines consisting of an automatic speech recognition (ASR) model that transcribes the spoken question before feeding it to a dense text retriever. Such pipelines have several limitations. The need for an ASR model limits the applicability to low-resource languages and specialized domains with no annotated speech data. Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. Our experimental results showed that, on shorter questions, our retriever is a promising alternative to the \textitASR and Retriever pipeline, achieving better retrieval performance in cases where ASR would have mistranscribed important words in the question or have produced a transcription with a high word error rate.

[IR-4] Procedure Model for Building Knowledge Graphs for Industry Applications

链接: https://arxiv.org/abs/2409.13425
作者: Sascha Meckler
关键词-EN: Enterprise knowledge graphs, Enterprise knowledge, graphs combine business, individuals and relationships, combine business data
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Enterprise knowledge graphs combine business data and organizational knowledge by means of a semantic network of concepts, properties, individuals and relationships. The graph-based integration of previously unconnected information with domain knowledge provides new insights and enables intelligent business applications. However, knowledge graph construction is a large investment which requires a joint effort of domain and technical experts. This paper presents a practical step-by-step procedure model for building an RDF knowledge graph that interconnects heterogeneous data and expert knowledge for an industry use case. The self-contained process adapts the “Cross Industry Standard Process for Data Mining” and uses competency questions throughout the entire development cycle. The procedure model starts with business and data understanding, describes tasks for ontology modeling and the graph setup, and ends with process steps for evaluation and deployment.

[IR-5] Contextual Compression in Retrieval-Augmented Generation for Large Language Models : A Survey

链接: https://arxiv.org/abs/2409.13385
作者: Sourav Verma
关键词-EN: Large Language Models, showcase remarkable abilities, Large Language, Language Models, showcase remarkable
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Ongoing Work

点击查看摘要

Abstract:Large Language Models (LLMs) showcase remarkable abilities, yet they struggle with limitations such as hallucinations, outdated knowledge, opacity, and inexplicable reasoning. To address these challenges, Retrieval-Augmented Generation (RAG) has proven to be a viable solution, leveraging external databases to improve the consistency and coherence of generated content, especially valuable for complex, knowledge-rich tasks, and facilitates continuous improvement by leveraging domain-specific insights. By combining the intrinsic knowledge of LLMs with the vast, dynamic repositories of external databases, RAG achieves a synergistic effect. However, RAG is not without its limitations, including a limited context window, irrelevant information, and the high processing overhead for extensive contextual data. In this comprehensive work, we explore the evolution of Contextual Compression paradigms, providing an in-depth examination of the field. Finally, we outline the current challenges and suggest potential research and development directions, paving the way for future advancements in this area.

[IR-6] More Clustering Quality Metrics for ABCDE

链接: https://arxiv.org/abs/2409.13376
作者: Stephan van Staden
关键词-EN: populations of items, ABCDE, quality metrics, quality, clustering
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:ABCDE is a technique for evaluating clusterings of very large populations of items. Given two clusterings, namely a Baseline clustering and an Experiment clustering, ABCDE can characterize their differences with impact and quality metrics, and thus help to determine which clustering to prefer. We previously described the basic quality metrics of ABCDE, namely the GoodSplitRate, BadSplitRate, GoodMergeRate, BadMergeRate and DeltaPrecision, and how to estimate them on the basis of human judgements. This paper extends that treatment with more quality metrics. It describes a technique that aims to characterize the DeltaRecall of the clustering change. It introduces a new metric, called IQ, to characterize the degree to which the clustering diff translates into an improvement in the quality. Ideally, a large diff would improve the quality by a large amount. Finally, this paper mentions ways to characterize the absolute Precision and Recall of a single clustering with ABCDE.

[IR-7] A Unified Causal Framework for Auditing Recommender Systems for Ethical Concerns

链接: https://arxiv.org/abs/2409.13210
作者: Vibhhu Sharma,Shantanu Gupta,Nil-Jana Akpinar,Zachary C. Lipton,Liu Leqi
关键词-EN: recommender systems, beliefs and preferences, widely deployed, recommender system auditing, Auditing recommender systems
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 28 pages

点击查看摘要

Abstract:As recommender systems become widely deployed in different domains, they increasingly influence their users’ beliefs and preferences. Auditing recommender systems is crucial as it not only ensures the continuous improvement of recommendation algorithms but also safeguards against potential issues like biases and ethical concerns. In this paper, we view recommender system auditing from a causal lens and provide a general recipe for defining auditing metrics. Under this general causal auditing framework, we categorize existing auditing metrics and identify gaps in them – notably, the lack of metrics for auditing user agency while accounting for the multi-step dynamics of the recommendation process. We leverage our framework and propose two classes of such metrics:future- and past-reacheability and stability, that measure the ability of a user to influence their own and other users’ recommendations, respectively. We provide both a gradient-based and a black-box approach for computing these metrics, allowing the auditor to compute them under different levels of access to the recommender system. In our experiments, we demonstrate the efficacy of methods for computing the proposed metrics and inspect the design of recommender systems through these proposed metrics.

[IR-8] RPAF: A Reinforcement Prediction-Allocation Framework for Cache Allocation in Large-Scale Recommender Systems

链接: https://arxiv.org/abs/2409.13175
作者: Shuo Su,Xiaoshuang Chen,Yao Wang,Yulin Wu,Ziqiang Zhang,Kaiqiao Zhan,Ben Wang,Kun Gai
关键词-EN: Modern recommender systems, perform real-time computation, Modern recommender, limited computational resources, computation-intensive infrastructure
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modern recommender systems are built upon computation-intensive infrastructure, and it is challenging to perform real-time computation for each request, especially in peak periods, due to the limited computational resources. Recommending by user-wise result caches is widely used when the system cannot afford a real-time recommendation. However, it is challenging to allocate real-time and cached recommendations to maximize the users’ overall engagement. This paper shows two key challenges to cache allocation, i.e., the value-strategy dependency and the streaming allocation. Then, we propose a reinforcement prediction-allocation framework (RPAF) to address these issues. RPAF is a reinforcement-learning-based two-stage framework containing prediction and allocation stages. The prediction stage estimates the values of the cache choices considering the value-strategy dependency, and the allocation stage determines the cache choices for each individual request while satisfying the global budget constraint. We show that the challenge of training RPAF includes globality and the strictness of budget constraints, and a relaxed local allocator (RLA) is proposed to address this issue. Moreover, a PoolRank algorithm is used in the allocation stage to deal with the streaming allocation problem. Experiments show that RPAF significantly improves users’ engagement under computational budget constraints.

[IR-9] RACE: Transformer-based user Representations from Attributed Clickstream Event sequences RECSYS

链接: https://arxiv.org/abs/2409.12972
作者: William Black,Alexander Manlove,Jack Pennington,Andrea Marchini,Ercument Ilhan,Vilda Markeviciute
关键词-EN: intricate browsing patterns, span numerous sessions, period of time, process of researching, making a purchase
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: RecSys Workshop on Recommenders in Tourism (RecTour 2024), October 14th-18th, 2024, co-located with the 18th ACM Conference on Recommender Systems, Bari, Italy

点击查看摘要

Abstract:For users navigating travel e-commerce websites, the process of researching products and making a purchase often results in intricate browsing patterns that span numerous sessions over an extended period of time. The resulting clickstream data chronicle these user journeys and present valuable opportunities to derive insights that can significantly enhance personalized recommendations. We introduce TRACE, a novel transformer-based approach tailored to generate rich user embeddings from live multi-session clickstreams for real-time recommendation applications. Prior works largely focus on single-session product sequences, whereas TRACE leverages site-wide page view sequences spanning multiple user sessions to model long-term engagement. Employing a multi-task learning framework, TRACE captures comprehensive user preferences and intents distilled into low-dimensional representations. We demonstrate TRACE’s superior performance over vanilla transformer and LLM-style architectures through extensive experiments on a large-scale travel e-commerce dataset of real user journeys, where the challenges of long page-histories and sparse targets are particularly prevalent. Visualizations of the learned embeddings reveal meaningful clusters corresponding to latent user states and behaviors, highlighting TRACE’s potential to enhance recommendation systems by capturing nuanced user interactions and preferences

附件下载

点击下载今日全部论文列表