本篇博文主要展示 2024-09-05 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-05)

今日共更新372篇论文,其中:

  • 自然语言处理53篇(Computation and Language (cs.CL))
  • 人工智能95篇(Artificial Intelligence (cs.AI))
  • 计算机视觉99篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习118篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
[NLP-0] RoboTwin:具有生成数字双胞胎的双臂机器人基准(早期版本)

链接: https://arxiv.org/abs/2409.02920
作者: Yao Mu,Tianxing Chen,Shijia Peng,Zanxin Chen,Zeyu Gao,Yude Zou,Lunkai Lin,Zhiqiang Xie,Ping Luo
关键词-EN: increasingly important areas, Effective collaboration, capabilities are increasingly, increasingly important, important areas
关键词-ZH: 越来越重要的领域,有效的合作,能力越来越,越来越重要,重要的领域
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots’ ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction. We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality. Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation. These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications. The project page is available at this https URL
摘要:双臂机器人的有效协作及其工具使用能力是机器人学发展中日益重要的领域。这些技能在扩大机器人在不同现实环境中运行的能力方面发挥了重要作用。然而,由于缺乏专门的培训数据,进展受到阻碍。本文介绍了RoboTwin,这是一个新的基准数据集,结合了真实世界的遥控数据和来自数字双胞胎的合成数据,专为双臂机器人场景设计。使用CoBot Magic平台,我们收集了有关工具使用和人与机器人交互的各种数据。我们提出了一种使用人工智能生成的内容创建数字双胞胎的创新方法,将2D图像转换为详细的3D模型。此外,我们利用大型语言模型来生成面向功能的专家级训练数据和特定于任务的姿势序列。我们的主要贡献是:1)RoboTwin基准数据集,2)有效的从真实到模拟的管道,以及3)使用语言模型自动生成专家级数据。这些进步旨在解决机器人训练数据短缺的问题,潜在地加速开发更有能力和更多功能的机器人系统,用于广泛的现实世界应用。项目页面可通过此HTTPS URL访问

[NLP-1] Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling
[NLP-1] 掩蔽扩散模型是秘密的时间不可知的掩蔽模型,并利用不准确的分类抽样

链接: https://arxiv.org/abs/2409.02908
作者: Kaiwen Zheng,Yongxin Chen,Hanzi Mao,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
关键词-EN: language modeling tasks, popular research topic, discrete diffusion models, Masked diffusion models, diffusion models
关键词-ZH: 语言建模任务、流行研究主题、离散扩散模型、掩蔽扩散模型、扩散模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 40 pages

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs’ original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20 \times speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs’ generation results in the previous literature.
摘要:屏蔽扩散模型(MDM)以其优越的性能成为离散数据产生式建模的研究热点,并在语言建模方面与自回归模型(ARM)相媲美。最近在简化屏蔽扩散框架方面的努力进一步导致与连续空间扩散模型和更有原则的训练和抽样食谱保持一致。然而,在本文中,我们揭示了MDM的训练和采样在理论上都不受时间变量的影响,这是扩散模型的关键特征,相反,它们等同于掩蔽模型。采样方面的联系是由我们提出的首击采样器(FHS)得出的。具体地说,我们证明了FHS在理论上等价于MDM的原始生成过程,同时显著减少了耗时的分类采样,并获得了20倍的加速比。此外,我们的调查挑战了之前关于MDM可以在生成性困惑中超过ARM的说法。我们第一次发现了一个潜在的数字问题,即使是使用32位浮点精度,这也会导致不准确的分类采样。我们表明,数值问题降低了有效温度,无论是理论上还是经验上,导致了以前文献中对MDM生成结果的不公平评估。

[NLP-2] LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
[NLP-2] LongCite:使LLM能够在长上下文QA中生成细粒度引用

链接: https://arxiv.org/abs/2409.02897
作者: jiajie Zhang,Yushi Bai,Xin Lv,Wanjun Gu,Danqing Liu,Minhao Zou,Shulin Cao,Lei Hou,Yuxiao Dong,Ling Feng,Juanzi Li
关键词-EN: user verification difficult, makes user verification, demonstrated impressive capacities, user questions based, responses makes user
关键词-ZH: 用户验证困难,使用户验证,表现出令人印象深刻的能力,基于用户问题,响应使用户
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs’ performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
摘要:尽管目前的长语境大语言模型在回答基于大量文本的用户问题方面表现出了令人印象深刻的能力,但它们的响应中缺乏引用使得用户难以验证,导致它们的可信性因其潜在的幻觉而受到关注。在这项工作中,我们的目标是使长上下文LLMS能够生成具有细粒度句子级引用的响应,从而提高它们的忠实性和可验证性。我们首先介绍了LongBch-Cite,这是一个用于评估当前LLMS在长上下文引文问答(LQAC)中性能的自动化基准,揭示了相当大的改进空间。为此,我们提出了一种新的流水线,它利用现有的LLMS自动生成具有精确句子级引文的长上下文问答实例,并利用该流水线构建了LQAC的大规模SFT数据集LongCite-45k。最后,我们使用LongCite-45k数据集训练LongCite-8B和LongCite-9B,成功地使它们能够在单一输出中生成准确的响应和细粒度的句子级引用。在LongBch-Cite上的评估结果表明,我们训练的模型达到了最先进的引文质量,超过了包括GPT-40在内的先进专利模型。

[NLP-3] LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
[NLP-3] LongLLaVA:通过混合架构将多模式LLM有效扩展到1000张图像

链接: https://arxiv.org/abs/2409.02889
作者: Xidong Wang,Dingjie Song,Shunian Chen,Chen Zhang,Benyou Wang
关键词-EN: Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large, high-resolution image understanding
关键词-ZH: 多模式大型语言、大型语言模型、大型语言、多模式大型、高分辨率图像理解
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 19 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textitdegraded performance with more images and \textithigh computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbfLongLLaVA~(\textbfLong-Context \textbfLarge \textbfLanguage \textbfand \textbfVision \textbfAssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
摘要:扩展多通道大语言模型的长上下文能力对于视频理解、高分辨率图像理解和多通道代理至关重要。这涉及到一系列系统的优化,包括模型体系结构、数据构建和训练策略,特别是解决诸如图像数量较多时性能下降和计算成本过高等挑战。在本文中,我们将模型结构调整为Mamba块和Transformer块的混合,在多个图像之间处理具有时间和空间相关性的数据构造,并采用渐进训练策略。发布的模型\extbfLongLLaVA~(\extbfLong-Context\extbfLarge\extbf Language\extbfand\extbfVision\extbfAssistant)是首款混合型MLLM,实现了效率和效果的更好平衡。LongLLaVA不仅在各种基准测试中取得了具有竞争力的结果,而且保持了高吞吐量和低内存消耗。特别是,它可以在一个A100 80 GB GPU上处理近千幅图像,显示出广泛的任务应用前景。

[NLP-4] Configurable Foundation Models: Building LLMs from a Modular Perspective
[NLP-4] 可配置基础模型:从模块化角度构建LLM

链接: https://arxiv.org/abs/2409.02877
作者: Chaojun Xiao,Zhengyan Zhang,Chenyang Song,Dazhi Jiang,Feng Yao,Xu Han,Xiaozhi Wang,Shuo Wang,Yufei Huang,Guanyu Lin,Yingfa Chen,Weilin Zhao,Yuge Tu,Zexuan Zhong,Ao Zhang,Chenglei Si,Khai Hao Moo,Chenyang Zhao,Huimin Chen,Yankai Lin,Zhiyuan Liu,Jingbo Shang,Maosong Sun
关键词-EN: abilities increasingly cumbersome, recently unveiled challenges, unveiled challenges tied, continual scalability due, limited computation resources
关键词-ZH: 能力越来越笨重,最近公布的挑战,公布的挑战捆绑,持续的可扩展性,有限的计算资源
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
摘要:近年来,LLMS的发展对计算效率和持续可伸缩性提出了挑战,因为它们需要巨大的参数,使得这些模型在计算资源有限的设备和需要各种能力的场景中的应用和演变变得越来越麻烦。受人脑模块化的启发,越来越多的人倾向于将LLM分解为许多功能模块,允许与部分模块进行推理,并动态组装模块以处理复杂的任务,如混合专家。为了突出模块化方法的固有效率和可组合性,我们创造了术语砖来表示每个功能模块,将模块化结构指定为可配置的基础模型。在本文中,我们对可配置基础模型的构建、使用和局限性进行了全面的概述和调查。我们首先将模块形式化为新出现的砖-在训练前阶段出现的功能神经元分区,以及定制的砖-通过额外的后训练构建的砖,以提高LLM的能力和知识。基于不同的功能块,我们进一步提出了四种面向块的操作:检索和路由、合并、更新和增长。这些操作允许基于处理复杂任务的指令动态配置LLM。为了验证我们的观点,我们对广泛使用的低成本管理模型进行了实证分析。我们发现,FFN层遵循模块化模式,具有神经元的功能专门化和功能神经元分区。最后,我们强调了几个有待解决的问题和未来研究的方向。总体而言,本文旨在为现有的LLM研究提供一个全新的模块化视角,并启发未来创建更高效和可扩展的基础模型。

[NLP-5] Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling
[NLP-5] 低资源语言和认知建模的视觉接地语音模型

链接: https://arxiv.org/abs/2409.02865
作者: Leanne Nortje
关键词-EN: unlabelled speech paired, learn from unlabelled, VGS, visually grounded speech, VGS models
关键词-ZH: 未标记语音配对、向未标记、VGS、视觉基础语音、VGS模型
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Dissertation

点击查看摘要

Abstract:This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.
摘要:本文研究了视觉基础语音(VGS)模型,这些模型从与图像配对的未标记语音中学习。它专注于低资源语言的应用和理解人类语言习得。我们引入了一项名为视觉提示关键词定位的任务,使用图像检测和定位语音中的关键词。我们展示了VGS模型在约鲁巴语等低资源语言的少量学习场景中的有效性。此外,我们还研究了VGS模型中的相互排他性偏见。我们的单语VGS模型表现出了这种偏见,但我们发现,多种语言不会影响VGS模型中的偏见,与在儿童中观察到的情况类似。

[NLP-6] Historical German Text Normalization Using Type- and Token-Based Language Modeling
[NLP-6] 使用基于类型和标记的语言建模实现历史德语文本规范化

链接: https://arxiv.org/abs/2409.02841
作者: Anton Ehrmanntraut
关键词-EN: natural language processing, Historic variations, full-text search, search or natural, historical digitized texts
关键词-ZH: 自然语言处理、历史变体、全文搜索、搜索或自然、历史数字化文本
类目: Computation and Language (cs.CL)
备注: 27 pages, 3 figures

点击查看摘要

Abstract:Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.
摘要:历史拼写的变化对历史数字化文本的全文检索或自然语言处理提出了挑战。为了最大限度地缩小历史拼写和当代拼写之间的差距,通常会对历史原始材料进行自动拼写标准化。这份报告提出了一个以平行语料库为训练对象的约1700-1900年间德语文学文本的规范化系统。该系统使用了一种机器学习方法,该方法使用Transformer语言模型,结合编解码器模型来归一化单个单词类型,并结合预先训练的因果语言模型来在其上下文中调整这些归一化。广泛的评估表明,所提出的系统提供了最先进的准确性,可与更大的完全基于端到端句子的规范化系统相媲美,对预先训练的Transformer大型语言模型进行微调。然而,由于模型难以概括,以及缺乏广泛的高质量平行数据,历史文本的规范化仍然是一项挑战。

[NLP-7] R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education
[NLP-7] R2 GQA:检索器-读者-生成器问题解答系统,支持学生理解高等教育法律法规

链接: https://arxiv.org/abs/2409.02840
作者: Phuc-Tinh Pham Do,Duy-Ngoc Dinh Cao,Khanh Quoc Tran,Kiet Van Nguyen
关键词-EN: Question Answering system, Machine Reader module, Machine Reader, Question Answering, Retriever module employs
关键词-ZH: 问题解答系统,机器阅读器模块,机器阅读器,问题解答,检索器模块使用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.
摘要:本文提出的R2GQA系统是一个检索器-阅读器-生成器问答系统,由三个主要组件组成:文档检索器、机器阅读器和答案生成器。检索器模块使用先进的信息检索技术,从法律法规文件的数据集中提取文章的上下文。机器阅读器模块使用最先进的自然语言理解算法来理解检索到的文档并提取答案。最后,生成器模块将提取的答案合成为对学生有关法律法规的问题的简明而翔实的回答。此外,我们还构建了大学培训法规领域的ViRHE4QA数据集,包括9,758个问答对,并进行了严格的构建过程。这是更高规则领域中的第一个越南数据集,其中包含各种类型的答案,包括摘录和抽象。此外,R2GQA系统是第一个提供越南语抽象答案的系统。本文讨论了基于ViRHE4QA数据集的R2GQA系统中各个模块的设计和实现,重点介绍了它们的功能和交互。此外,我们提供的实验结果表明,所提出的系统在支持高等教育环境下学生对法律法规的理解方面是有效和有用的。总体而言,R2GQA系统和ViRHE4QA数据集承诺对相关研究做出重大贡献,并帮助学生浏览复杂的法律文件和法规,使他们能够做出明智的决定,并有效地遵守机构政策。我们的数据集可用于研究目的。

[NLP-8] Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models
[NLP-8] 通过使用大型语言模型的少镜头学习探索加密货币讨论中的情绪动态和预测行为

链接: https://arxiv.org/abs/2409.02836
作者: Moein Shahiki Tash,Zahra Ahani,Mohim Tash,Olga Kolesnikova,Grigori Sidorov
关键词-EN: leveraging advanced natural, language processing techniques, Regret Detection behaviors, advanced natural language, natural language processing
关键词-ZH: 利用先进的自然语言处理技术、遗憾检测行为、先进的自然语言、自然语言处理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named “Prediction statements,” categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.
摘要:本研究利用先进的自然语言处理技术,在与加密货币相关的讨论中对预测性陈述、希望演讲和后悔检测行为进行分析。我们引入了一种名为“预测语句”的新分类方案,将评论分类为预测性增量、预测性递减、预测性中性或非预测性类别。使用GPT-40,一个尖端的大型语言模型,我们探索了五种主要加密货币的情绪动态:Cardano,Binance,马季奇,Fantom和Ripple。我们的分析揭示了不同的预测情绪模式,马季奇表现出明显更高的乐观预测倾向。此外,我们调查了希望和后悔情绪,揭示了这些情绪和预测性行为之间微妙的相互作用。尽管遇到了与数据量和资源可用性相关的限制,但我们的研究报告了关于加密货币市场内投资者行为和情绪趋势的有价值的发现,为战略决策和未来的研究工作提供了参考。

[NLP-9] CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
[NLP-9] CMM-Math:一个中国多峰数学数据集,用于评估和增强大型多峰模型的数学推理

链接: https://arxiv.org/abs/2409.02834
作者: Wentao Liu,Qianjun Pan,Yi Zhang,Zhuo Liu,Ji Wu,Jie Zhou,Aimin Zhou,Qin Chen,Bo Jiang,Liang He
关键词-EN: obtained promising results, Large language models, human intelligence, obtained promising, promising results
关键词-ZH: 获得了有希望的结果,大型语言模型,人类智能,获得了有希望的,有希望的结果
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
摘要:大语言模型在人类智能的基础技能–数学推理中取得了可喜的成果。以往的研究大多集中于基于文本数学推理数据集(如MASTH、GSM8K)来改进和度量LLMS的性能。最近,一些研究人员发布了英语多通道数学数据集(如MATHVISTA和MATH-V)来评估大型多通道模型(LMM)的有效性。在本文中,我们发布了一个中国多模式数学(CMM-Math)数据集,包括基准和训练部分,以评估和增强LMM的数学推理。在中国,CMM-Math包含了超过28,000个高质量的样本,具有各种问题类型(例如,多项选择、填空等),并提供了从小学到高中的12个年级的详细解决方案。具体地说,视觉背景可能存在于问题或意见中,这使得该数据集更具挑战性。通过综合分析,我们发现CMM-Math数据集上的最新LMM面临着挑战,强调了LMM发展的必要性。我们还提出了一种多模式数学LMM(Math-LMM)来处理多个图像和文本段的混合输入问题。我们通过三个阶段来训练我们的模型,包括基础预训练、基础微调和数学微调。大量的实验表明,通过在三个多峰数学数据集上与SOTA LMM进行比较,我们的模型有效地提高了数学推理的性能。

[NLP-10] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
[NLP-10] MMMU-Pro:更强大的多学科多模式理解基准

链接: https://arxiv.org/abs/2409.02813
作者: Xiang Yue,Tianyu Zheng,Yuansheng Ni,Yubo Wang,Kai Zhang,Shengbang Tong,Yuxuan Sun,Ming Yin,Botao Yu,Ge Zhang,Huan Sun,Yu Su,Wenhu Chen,Graham Neubig
关键词-EN: Massive Multi-discipline Multimodal, Multi-discipline Multimodal Understanding, Massive Multi-discipline, paper introduces MMMU-Pro, Multi-discipline Multimodal
关键词-ZH: 大规模多学科多模式,多学科多模式理解,大规模多学科,论文介绍MMMU-Pro,多学科多模式
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see” and “read” simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.
摘要:本文介绍了大规模多学科多通道理解与推理(MMMU)基准测试的健壮版本MMMU-Pro。MMMU-Pro通过一个基于MMMU的三步过程严格评估多通道模型的真实理解和推理能力:(1)过滤出纯文本模型可以回答的问题,(2)增加候选选项,(3)引入仅视觉输入设置,其中问题嵌入图像中。这一设置挑战了人工智能真正地同时“看”和“读”,考验了人类无缝整合视觉和文本信息的基本认知技能。结果表明,MMMU-Pro上的模型性能明显低于MMMU上的模型性能,在各个模型上的差异在16.8%到26.9%之间。我们探讨了OCR提示和思维链(COT)推理的影响,发现OCR提示的影响很小,而COT通常会提高性能。MMMU-Pro提供了更严格的评估工具,密切模拟了真实世界的场景,为未来多通道人工智能的研究提供了有价值的方向。

[NLP-11] owards a Unified View of Preference Learning for Large Language Models : A Survey
[NLP-11] 大型语言模型偏好学习的统一观点:一项调查

链接: https://arxiv.org/abs/2409.02795
作者: Bofei Gao,Feifan Song,Yibo Miao,Zefan Cai,Zhe Yang,Liang Chen,Helan Hu,Runxin Xu,Qingxiu Dong,Ce Zheng,Wen Xiao,Ge Zhang,Daoguang Zan,Keming Lu,Bowen Yu,Dayiheng Liu,Zeyu Cui,Jian Yang,Lei Sha,Houfeng Wang,Zhifang Sui,Peiyi Wang,Tianyu Liu,Baobao Chang
关键词-EN: exhibit remarkably powerful, remarkably powerful capabilities, exhibit remarkably, powerful capabilities, Large Language Models
关键词-ZH: 表现出非常强大、非常强大的能力,表现出非常强大的能力,大型语言模型
类目: Computation and Language (cs.CL)
备注: Initial Commit, 21 pages

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM’s output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM’s performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.
摘要:大型语言模型(LLM)表现出非常强大的功能。取得成功的关键因素之一是使LLM的产出与人类的偏好保持一致。这种对准过程通常只需要少量的数据来有效地提高LLM的性能。虽然有效,但这一领域的研究跨越了多个领域,涉及的方法相对复杂,难以理解。不同方法之间的关系一直没有得到充分的探索,限制了偏好匹配的发展。有鉴于此,我们将现有流行的比对策略分解为不同的组成部分,并提供一个统一的框架来研究当前的比对策略,从而建立它们之间的联系。在本次调查中,我们将偏好学习中的所有策略分解为四个组成部分:模型、数据、反馈和算法。这一统一的视图提供了对现有比对算法的深入了解,并为协同不同战略的优势打开了可能性。此外,我们还提供了流行的现有算法的详细工作示例,以便于读者全面理解。最后,基于我们的统一视角,我们探索了将大型语言模型与人类偏好保持一致所面临的挑战和未来的研究方向。

[NLP-12] A Comparative Study of Pre-training and Self-training
[NLP-12] 预训练与自我训练的比较研究

链接: https://arxiv.org/abs/2409.02751
作者: Yiheng Wang,Jiayu Lin,Zuoquan Lin
关键词-EN: Pre-training, self-training, semi-supervised learning, data augmentation, Abstract
关键词-ZH: 预训练、自我训练、半监督学习、数据增强、摘要
类目: Computation and Language (cs.CL)
备注: 19 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Pre-training and self-training are two approaches to semi-supervised learning. The comparison between pre-training and self-training has been explored. However, the previous works led to confusing findings: self-training outperforms pre-training experienced on some tasks in computer vision, and contrarily, pre-training outperforms self-training experienced on some tasks in natural language processing, under certain conditions of incomparable settings. We propose, comparatively and exhaustively, an ensemble method to empirical study all feasible training paradigms combining pre-training, self-training, and fine-tuning within consistent foundational settings comparable to data augmentation. We conduct experiments on six datasets, four data augmentation, and imbalanced data for sentiment analysis and natural language inference tasks. Our findings confirm that the pre-training and fine-tuning paradigm yields the best overall performances. Moreover, self-training offers no additional benefits when combined with semi-supervised pre-training.
摘要:预训练和自我训练是半监督学习的两种方法。探讨了预训与自训的比较。然而,前人的工作导致了令人困惑的发现:在某些不可比拟的环境下,自我训练在计算机视觉中的某些任务上优于预训练,相反,在自然语言处理的某些任务上,预训练优于自我训练。我们提出了一种比较详尽的集成方法来对所有可行的训练范例进行实证研究,该方法结合了预训练、自我训练和在一致的基础环境中进行微调,可与数据增强相媲美。我们在六个数据集、四个数据扩充和不平衡数据上进行了实验,用于情感分析和自然语言推理任务。我们的研究结果证实,预训练和微调范式产生的整体表现最好。此外,自我培训与半监督的预培训相结合,不会带来额外的好处。

[NLP-13] Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?
[NLP-13] 汇集和注意力:基于LLm的嵌入模型的有效设计是什么?

链接: https://arxiv.org/abs/2409.02727
作者: Yixuan Tang,Yi Yang
关键词-EN: Large Language Models, Large Language, LLM-based embedding models, advancements of Large, LLM-based embedding
关键词-ZH: 大型语言模型、大型语言、基于LLM的嵌入模型、大型、基于LLM的嵌入的进步
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: this https URL

点击查看摘要

Abstract:The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.
摘要:大型语言模型在生成任务中的显著进步导致了越来越多的工作探索基于大型语言模型的嵌入模型。虽然这些模型采用了不同的汇集和注意策略,在公共嵌入基准上取得了最先进的性能,但基于LLM的嵌入模型的有效设计仍然存在问题。然而,这些模型通常使用不同的LLM基本模型或训练设置在不同的数据集上进行训练。此外,对公共嵌入基准的评价往往无法报告统计意义,因此很难确定哪些设计真正有助于最终业绩。这使得从业者为基于LLM的嵌入模型寻找最佳训练配方的过程变得复杂。在本研究中,我们进行了大规模的实验,使用相同的训练数据和基本模型训练一系列基于LLM的嵌入模型,但它们的汇集和注意策略不同。结果表明,不存在一刀切的解决方案:虽然双向注意和额外的可训练汇合层在文本相似性和信息检索任务中的表现优于简单设计,但在聚类和分类任务中并不显著超过EOS-LAST令牌汇集和默认因果注意等更简单的设计。此外,我们提出了一种新的池化策略,多层可训练池化,它使用交叉注意网络来转换所有隐含层的输出,而不仅仅是最后一层的输出。与现有的池化方法相比,该方法在文本相似度和检索任务上具有统计上的优势。总体而言,本文为基于LLM的嵌入模型的有效训练策略提供了有益的启示。

[NLP-14] Pre-training data selection for biomedical domain adaptation using journal impact metrics
[NLP-14] 使用期刊影响指标进行生物医学领域适应的预训练数据选择

链接: https://arxiv.org/abs/2409.02725
作者: Mathieu Laï-king,Patrick Paroubek
关键词-EN: natural language processing, biomedical domain, Domain adaptation, NLP, Domain
关键词-ZH: 自然语言处理,生物医学领域,领域适应,NLP,领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact metrics is not efficient. But we also show that pre-training using fewer abstracts (but with the same number of training steps) does not necessarily decrease the resulting model’s performance.
摘要:领域自适应是自然语言处理(NLP)中广泛使用的一种方法,用于提高特定领域内的语言模型的性能。这种方法在生物医学领域尤其常见,该领域定期发表大量科学文章。PubMed是一个重要的文本语料库,在生物医学领域经常使用。这项研究的主要目标是探索使用科学论文的特定质量度量来优化训练前数据集是否可以提高结果模型的性能。为了实现这一目标,我们采用了两个简单的期刊影响指标,并通过在完整的PubMed训练集的不同子集上持续预训练BERT来进行实验,然后我们从格式回复基准中评估得到的生物医学语言理解任务模型。我们的结果表明,使用日志影响度量进行剪枝是不有效的。但我们也表明,使用较少的摘要(但具有相同数量的训练步骤)的预训练并不一定会降低结果模型的性能。

[NLP-15] Alignment-Aware Model Extraction Attacks on Large Language Models
[NLP-15] 对大型语言模型的对齐感知模型提取攻击

链接: https://arxiv.org/abs/2409.02718
作者: Zi Liang,Qingqing Ye,Yanyun Wang,Sen Zhang,Yaxin Xiao,Ronghua Li,Jianliang Xu,Haibo Hu
关键词-EN: received increasing research, increasing research attention, large language models, large language, received increasing
关键词-ZH: 受到越来越多的研究,越来越多的研究关注,大型语言模型,大型语言,越来越多
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Source code: this https URL

点击查看摘要

Abstract:Model extraction attacks (MEAs) on large language models (LLMs) have received increasing research attention lately. Existing attack methods on LLMs inherit the extraction strategies from those designed for deep neural networks (DNNs) yet neglect the inconsistency of training tasks between MEA and LLMs’ alignments. As such, they result in poor attack performances. To tackle this issue, we present Locality Reinforced Distillation (LoRD), a novel model extraction attack algorithm specifically for LLMs. In particular, we design a policy-gradient-style training task, which utilizes victim models’ responses as a signal to guide the crafting of preference for the local model. Theoretical analysis has shown that i) LoRD’s convergence procedure in MEAs is consistent with the alignments of LLMs, and ii) LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing. Extensive experiments on domain-specific extractions demonstrate the superiority of our method by examining the extraction of various state-of-the-art commercial LLMs.
摘要:针对大型语言模型的模型提取攻击近年来受到越来越多的关注。现有的针对LLMS的攻击方法继承了针对深度神经网络(DNN)的提取策略,但忽略了MEA和LLMS对齐之间训练任务的不一致性。因此,他们的进攻表现很差。针对这一问题,我们提出了一种新的针对LLMS的模型提取攻击算法LOAD。特别是,我们设计了一个策略梯度式的训练任务,它利用受害者模型的反应作为一个信号来指导对局部模型的偏好的形成。理论分析表明:1)Lord在MEAS中的收敛过程与LLMS的比对是一致的;2)Lord在降低查询复杂度的同时,通过基于探测的窃取来减轻水印保护。对特定领域提取的大量实验通过检验各种最先进的商业LLM的提取来证明我们的方法的优越性。

[NLP-16] A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations
[NLP-16] 使用跨语言句子表示增强低资源机器翻译的数据选择方法

链接: https://arxiv.org/abs/2409.02712
作者: Nidhi Kowtal,Tejas Deshpande,Raviraj Joshi
关键词-EN: language pairs faces, language pairs, English-Marathi language pairs, pairs faces significant, Machine translation
关键词-ZH: 语言对面孔,语言对,英语-马拉地语对,对面孔重要,机器翻译
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at I2CT 2024

点击查看摘要

Abstract:Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.
摘要:由于平行语料库和语言资源的稀缺,低资源语言对中的机器翻译面临着巨大的挑战。这项研究的重点是英语-马拉提语语言对,其中现有的数据集存在明显的噪声,阻碍了机器翻译模型的性能。为了缓解数据质量问题的影响,我们提出了一种基于跨语言句子表示的数据过滤方法。我们的方法利用多语言SBERT模型来过滤训练数据中有问题的翻译。具体地说,我们使用IndicSBERT相似性模型来评估原始句子和翻译句子之间的语义对等,使我们能够保留语言上正确的翻译,同时丢弃有实质性偏差的实例。结果表明,与使用IndicSBERT过滤后的基线相比,翻译质量有了显著的提高。这说明了在资源有限的机器翻译场景中,跨语言句子表示如何减少错误。通过将多语言句子BERT模型集成到翻译流水线中,本研究有助于在低资源环境下提高机器翻译技术。该方法不仅解决了英语-马拉提语翻译面临的挑战,而且为提高其他低资源语言翻译任务的翻译质量提供了一个有价值的框架。

[NLP-17] Detecting Calls to Action in Multimodal Content: Analysis of the 2021 German Federal Election Campaign on Instagram
[NLP-17] 检测多模式内容中的行动呼吁:Instagram上2021年德国联邦竞选活动分析

链接: https://arxiv.org/abs/2409.02690
作者: Michael Achmann-Denkler,Jakob Fehle,Mario Haim,Christian Wolff
关键词-EN: German Instagram election, Calls to Action, social media contexts, Instagram election campaign, German Instagram
关键词-ZH: 德国Instagram选举、行动呼吁、社交媒体背景、Instagram竞选活动、德国Instagram
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: Accepted Archival Paper for the CPSS Workshop at KONVENS 2024. Camera Ready Submission

点击查看摘要

Abstract:This study investigates the automated classification of Calls to Action (CTAs) within the 2021 German Instagram election campaign to advance the understanding of mobilization in social media contexts. We analyzed over 2,208 Instagram stories and 712 posts using fine-tuned BERT models and OpenAI’s GPT-4 models. The fine-tuned BERT model incorporating synthetic training data achieved a macro F1 score of 0.93, demonstrating a robust classification performance. Our analysis revealed that 49.58% of Instagram posts and 10.64% of stories contained CTAs, highlighting significant differences in mobilization strategies between these content types. Additionally, we found that FDP and the Greens had the highest prevalence of CTAs in posts, whereas CDU and CSU led in story CTAs.
摘要:本研究调查了2021年德国Instagram竞选活动中行动呼吁(CTS)的自动分类,以促进对社交媒体背景下动员的理解。我们使用微调的BERT模型和OpenAI的GPT-4模型分析了超过2,208个Instagram故事和712篇帖子。结合合成训练数据的微调BERT模型获得了0.93的宏F1评分,展示了稳健的分类性能。我们的分析显示,49.58%的Instagram帖子和10.64%的故事包含CTS,凸显了这些内容类型之间动员策略的显着差异。此外,我们发现自由民主党和绿党在职位上的CTS患病率最高,而CDU和CSU在故事CTS中领先。

[NLP-18] Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs
[NLP-18] 基于因果关系感知的参数高效微调,以解决LLM的问题改进

链接: https://arxiv.org/abs/2409.02686
作者: Ruoyu Wang,Xiaoxuan Li,Lina Yao
关键词-EN: Large Language Models, Large Language, recent studies reveal, Language Models, demonstrated remarkable efficiency
关键词-ZH: 大型语言模型,大型语言,最近的研究表明,语言模型表现出显着的效率
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model’s reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.
摘要:大型语言模型在处理基于人类指令的各种任务方面表现出了显著的效率,但最近的研究表明,这些模型在处理涉及推理的问题(如数学或物理问题)时往往无法取得令人满意的结果。这种现象通常被归因于这些模型是否能够真正理解文本中嵌入的知识,或者只是学习复制令牌分发而不真正理解内容的不确定性。在本文中,我们对这一问题进行了深入的研究,旨在提高LLMS的推理能力。首先,我们通过在注意力和表征层面可视化文本生成过程来考察该模型是否具有真正的推理能力。然后,我们将LLMS的推理过程描述为一个因果框架,该框架为我们在可视化中观察到的问题提供了形式化的解释。最后,在这个因果框架的基础上,我们提出了一种新的参数高效微调(PEFT)方法–解开因果适应(DCA),通过鼓励模型提取一般性的问题解决技能并将这些技能应用于不同的问题来增强模型的推理能力。实验表明,我们的方法在多个基准测试中的性能一致地优于基线,并且仅使用1.2M个可调参数,我们获得了比其他微调方法更好或相当的结果。这表明我们的方法在提高LLMS的整体精度和可靠性方面是有效的和高效的。

[NLP-19] Creating Domain-Specific Translation Memories for Machine Translation Fine-tuning: The TRENCARD Bilingual Cardiology Corpus
[NLP-19] 为机器翻译微调创建特定领域的翻译记忆:TRANCARD双语心脏病学数据库

链接: https://arxiv.org/abs/2409.02667
作者: Gokhan Dogru
关键词-EN: language model fine-tuning, domain-specific parallel corpora, large language model, compile domain-specific parallel, machine translation training
关键词-ZH: 语言模型微调、特定领域并行库、大型语言模型、编译特定领域并行、机器翻译训练
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article investigates how translation memories ™ can be created by translators or other language professionals in order to compile domain-specific parallel corpora , which can then be used in different scenarios, such as machine translation training and fine-tuning, TM leveraging, and/or large language model fine-tuning. The article introduces a semi-automatic TM preparation methodology leveraging primarily translation tools used by translators in favor of data quality and control by the translators. This semi-automatic methodology is then used to build a cardiology-based Turkish - English corpus from bilingual abstracts of Turkish cardiology journals. The resulting corpus called TRENCARD Corpus has approximately 800,000 source words and 50,000 sentences. Using this methodology, translators can build their custom TMs in a reasonable time and use them in their bilingual data requiring tasks.
摘要:本文探讨了翻译员或其他语言专业人员如何创建翻译记忆库(TM),以编写特定领域的并行库,然后可以将其用于不同的场景,例如机器翻译培训和微调、TM利用和/或大型语言模型微调。本文介绍了一种半自动TM准备方法,主要利用翻译人员使用的翻译工具,有利于数据质量和翻译人员控制。然后,使用这种半自动方法从土耳其心脏病学期刊的双语摘要中构建基于心脏病学的土耳其语-英语文集。由此产生的数据库名为DRENCARD Corpus,拥有大约800,000个源词和50,000个句子。使用这种方法,翻译人员可以在合理的时间内构建自定义TM,并在需要执行的双语数据任务中使用它们。

[NLP-20] OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation
[NLP-20] CheckThat上的OpenFact!2024年:结合多种攻击方法以有效生成对抗性文本

链接: https://arxiv.org/abs/2409.02649
作者: Włodzimierz Lewoniewski,Piotr Stolarski,Milena Stróżyna,Elzbieta Lewańska,Aleksandra Wojewoda,Ewelina Księżniak,Marcin Sawiński
关键词-EN: paper presents, presents the experiments, Credibility Assessment, credibility assessment issues, Adversarial
关键词-ZH: 论文提出,提出实验,可信度评估,可信度评估问题,对抗性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CLEF 2024 - Conference and Labs of the Evaluation Forum

点击查看摘要

Abstract:This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems. Comments: CLEF 2024 - Conference and Labs of the Evaluation Forum Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.02649 [cs.CL] (or arXiv:2409.02649v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.02649 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)
摘要:本文介绍了CheckThat!CLEF 2024实验室任务6:使用对抗性实例进行可信度评估的稳健性(InCrediblae)。这项任务的主要目标是在五个问题领域生成对抗性实例,以评估广泛使用的文本分类方法(微调的BERT、BiLSTM和Roberta)在应用于可信度评估问题时的稳健性。本研究探讨了集成学习在增强对自然语言处理(NLP)模型的敌意攻击中的应用。我们在五个数据集上系统地测试和提炼了几种对抗性攻击方法,包括BERT攻击、遗传算法、TextFooler和Clare,这些方法跨越各种错误信息任务。通过开发改进版本的BERT-攻击和混合方法,我们在攻击效率方面取得了显著的改进。我们的结果证明了修改和结合多种方法来创建更复杂和有效的对抗性攻击策略的潜力,有助于开发更健壮和安全的系统。评论:CLEF 2024–评价论坛的会议和实验室:计算和语言(cs.CL);人工智能(cs.AI)引用如下:arxiv:2409.02649cs.CLhttps://doi.org/10.48550/arXiv.2409.02649 Focus通过DataCite(待注册)了解更多arxiv发布的文档参考:评估论坛会议和实验室的工作笔记(CLEF2024)

[NLP-21] A Survey on Emergent Language
[NLP-21] 新兴语言概览

链接: https://arxiv.org/abs/2409.02645
作者: Jannik Peters,Constantin Waubert de Puiseau,Hasan Tercan,Arya Gopikrishnan,Gustavo Adolpho Lucas De Carvalho,Christian Bitter,Tobias Meisen
关键词-EN: multi-agent reinforcement learning, emergent language represents, context of multi-agent, language, reinforcement learning
关键词-ZH: 多智能体强化学习,涌现语言表示,多智能体上下文,语言,强化学习
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of emergent language represents a novel area of research within the domain of artificial intelligence, particularly within the context of multi-agent reinforcement learning. Although the concept of studying language emergence is not new, early approaches were primarily concerned with explaining human language formation, with little consideration given to its potential utility for artificial agents. In contrast, studies based on reinforcement learning aim to develop communicative capabilities in agents that are comparable to or even superior to human language. Thus, they extend beyond the learned statistical representations that are common in natural language processing research. This gives rise to a number of fundamental questions, from the prerequisites for language emergence to the criteria for measuring its success. This paper addresses these questions by providing a comprehensive review of 181 scientific publications on emergent language in artificial intelligence. Its objective is to serve as a reference for researchers interested in or proficient in the field. Consequently, the main contributions are the definition and overview of the prevailing terminology, the analysis of existing evaluation methods and metrics, and the description of the identified research gaps.
摘要:涌现语言领域代表了人工智能领域的一个新的研究领域,特别是在多智能体强化学习的背景下。虽然研究语言涌现的概念并不新鲜,但早期的方法主要关注于解释人类语言的形成,很少考虑它对人工代理人的潜在作用。相比之下,基于强化学习的研究旨在发展与人类语言相当甚至更好的代理的交流能力。因此,它们超越了自然语言处理研究中常见的学习统计表示法。这引发了一些基本问题,从语言出现的先决条件到衡量语言成功的标准。本文通过对181篇关于人工智能中涌现语言的科学出版物的全面综述来解决这些问题。其目的是为对该领域感兴趣或精通该领域的研究人员提供参考。因此,主要贡献是对主流术语的定义和概述,对现有评价方法和衡量标准的分析,以及对查明的研究差距的说明。

[NLP-22] PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation
[NLP-22] PUB:用于评估合成视觉数据解释的大型语言模型的绘图理解基准和数据集

链接: https://arxiv.org/abs/2409.02617
作者: Aneta Pawelec,Victoria Sara Wesołowska,Zuzanna Bączek,Piotr Sankowski
关键词-EN: decision-making processes, crucial for advancing, large language models, data, interpret visual representations
关键词-ZH: 决策过程对于推进大型语言模型、数据、解释视觉表示至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to interpret visual representations of data is crucial for advancing their application in data analysis and decision-making processes. This paper presents a novel synthetic dataset designed to evaluate the proficiency of LLMs in interpreting various forms of data visualizations, including plots like time series, histograms, violins, boxplots, and clusters. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models like ChatGPT or Gemini, assessing their understanding and interpretative accuracy. To ensure data integrity, our benchmark dataset is generated automatically, making it entirely new and free from prior exposure to the models being tested. This strategy allows us to evaluate the models’ ability to truly interpret and understand the data, eliminating possibility of pre-learned responses, and allowing for an unbiased evaluation of the models’ capabilities. We also introduce quantitative metrics to assess the performance of the models, providing a robust and comprehensive evaluation tool. Benchmarking several state-of-the-art LLMs with this dataset reveals varying degrees of success, highlighting specific strengths and weaknesses in interpreting diverse types of visual data. The results provide valuable insights into the current capabilities of LLMs and identify key areas for improvement. This work establishes a foundational benchmark for future research and development aimed at enhancing the visual interpretative abilities of language models. In the future, improved LLMs with robust visual interpretation skills can significantly aid in automated data analysis, scientific research, educational tools, and business intelligence applications. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2409.02617 [cs.CL] (or arXiv:2409.02617v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.02617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大型语言模型解释数据可视化表示的能力对于推动其在数据分析和决策过程中的应用至关重要。本文提出了一种新的合成数据集,旨在评估LLMS在解释各种形式的数据可视化方面的熟练程度,包括时间序列、直方图、小提琴、盒图和聚类等曲线图。我们的数据集是使用受控参数生成的,以确保全面覆盖潜在的真实世界场景。我们使用与图像中的视觉数据相关的问题的多模式文本提示来对几个最先进的模型进行基准测试,如ChatGPT或Gemini,评估它们的理解和解释准确性。为了确保数据的完整性,我们的基准数据集是自动生成的,使其成为全新的,而不会事先接触到正在测试的模型。这一策略允许我们评估模型真正解释和理解数据的能力,消除预先学习的反应的可能性,并允许对模型的能力进行公正的评估。我们还引入了量化度量来评估模型的性能,提供了一个健壮和全面的评估工具。用这个数据集对几个最先进的LLM进行基准测试,可以发现不同程度的成功,突出了在解释不同类型的可视数据方面的具体优势和劣势。研究结果提供了对低成本管理当前能力的有价值的见解,并确定了需要改进的关键领域。这项工作为未来旨在增强语言模型的视觉解释能力的研究和开发建立了一个基础基准。在未来,改进的LLMS具有强大的视觉解释技能,可以在自动化数据分析、科学研究、教育工具和商业智能应用方面提供显著帮助。主题:计算和语言(cs.CL)引用为:arxiv:2409.02617cs.CLhttps://doi.org/10.48550/arXiv.2409.02617 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-23] An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
[NLP-23] 用BEST-PQ分析线性复杂性注意替代

链接: https://arxiv.org/abs/2409.02596
作者: Ryan Whetten,Titouan Parcollet,Adel Moumen,Marco Dinarelli,Yannick Estève
关键词-EN: Self-Supervised Learning, including speech processing, Learning, SSL, including speech
关键词-ZH: 自我监督学习,包括语音处理、学习、SSL,包括语音
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in the IEEE Soken Language Technology Workshop 2024

点击查看摘要

Abstract:Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
摘要:自我监督学习(SSL)已被证明在各个领域都有效,包括语音处理。然而,SSL的计算和内存昂贵。这在一定程度上是由于多头自我注意力(MHSA)的二次复杂性。MHSA的替代方案已被提出并用于语音领域,但尚未在SSL设置中进行适当的调查。在这项工作中,我们研究了用最近具有线性复杂性的最先进替代品(即HyperMixing、Fastformer、SummaryMixing和Mamba)取代MHSA的影响。我们通过观察速度、VRAM消耗量以及SSL MP3 S基准测试的性能来评估这些方法。结果表明,与MHSA相比,这些线性替代方案保持了有竞争力的性能,同时平均将VRAM消耗减少约20%至60%,并将20至80秒的输入序列的速度从7%提高到65%。

[NLP-24] More is More: Addition Bias in Large Language Models
[NLP-24] 多即多:大型语言模型中的添加偏差

链接: https://arxiv.org/abs/2409.02569
作者: Luca Santagata,Cristiano De Nobili
关键词-EN: Large Language Models, Language Models, cognitive bias observed, Large Language, drawing a parallel
关键词-ZH: 大型语言模型,语言模型,观察到的认知偏见,大型语言,绘制平行图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math \Sigma tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others’ writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.
摘要:在这篇文章中,我们研究了大型语言模型(LLM)中加性偏见的存在,并将其与在人类中观察到的认知偏见进行了类比,在人类认知偏见中,个体倾向于加性变化而不是减性变化。通过一系列对照实验,我们测试了不同的LLM,包括GPT-3.5 Turbo、Claude 3.5十四行诗、Mistral、Math\Sigma tral和Llama 3.1,这些任务旨在测量它们对加法和减法修改的倾向。我们的发现表明,在所有测试的模型中,加性变化具有显著的偏好。例如,在一个回文创建任务中,Llama 3.1在97.85%的时间里倾向于添加字母而不是删除它们。同样,在乐高塔平衡任务中,GPT-3.5 Turbo在76.38%的时间里选择添加一块积木,而不是移除一块。在文本摘要任务中,当被要求改进自己或他人的写作时,59.40%到75.10%的案例产生了更长的摘要。这些结果表明,与人类相似,LLM表现出显著的加性偏见,当LLM被大规模使用时,这可能会产生影响。附加偏向可能会增加资源使用和环境影响,由于过度消费和浪费而导致更高的经济成本。在开发和应用低土地管理办法时应考虑到这一偏见,以确保采取平衡和有效的办法解决问题。

[NLP-25] Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts
[NLP-25] 过度分析时,语言是可怕的:用论证理论驱动的预测来解开隐含的厌女推理

链接: https://arxiv.org/abs/2409.02519
作者: Arianna Muti,Federico Ruggeri,Khalid Al-Khatib,Alberto Barrón-Cedeño,Tommaso Caselli
关键词-EN: Italian and English, large language models, Argumentative Reasoning task, propose misogyny detection, language models
关键词-ZH: 意大利语和英语、大型语言模型、论证推理任务、提出厌女症检测、语言模型
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:We propose misogyny detection as an Argumentative Reasoning task and we investigate the capacity of large language models (LLMs) to understand the implicit reasoning used to convey misogyny in both Italian and English. The central aim is to generate the missing reasoning link between a message and the implied meanings encoding the misogyny. Our study uses argumentation theory as a foundation to form a collection of prompts in both zero-shot and few-shot settings. These prompts integrate different techniques, including chain-of-thought reasoning and augmented knowledge. Our findings show that LLMs fall short on reasoning capabilities about misogynistic comments and that they mostly rely on their implicit knowledge derived from internalized common stereotypes about women to generate implied assumptions, rather than on inductive reasoning.
摘要:我们提出将厌女症检测作为一项论证推理任务,并研究大型语言模型(LLM)理解用于在意大利语和英语中表达厌女症的隐性推理的能力。其核心目的是在信息和编码厌女症的隐含含义之间产生缺失的推理联系。我们的研究以论证理论为基础,在零镜头和少镜头设置下形成提示集。这些提示集成了不同的技术,包括思想链推理和增强知识。我们的研究结果表明,LLM缺乏对厌恶女性的评论的推理能力,并且他们主要依赖于从内化的对女性的常见刻板印象中获得的隐性知识来产生隐含的假设,而不是归纳推理。

[NLP-26] Word and Phrase Features in Graph Convolutional Network for Automatic Question Classification
[NLP-26] 用于自动问题分类的图卷积网络中的单词和短语特征

链接: https://arxiv.org/abs/2409.02481
作者: Junyoung Lee,Ninad Dixit,Kaustav Chakrabarti,S. Supraja
关键词-EN: enabling adaptive learning, adaptive learning systems, AI-driven educational tools, Effective question classification, difficulty level
关键词-ZH: 实现自适应学习、自适应学习系统、人工智能驱动的教育工具、有效的问题分类、难度水平
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective question classification is crucial for AI-driven educational tools, enabling adaptive learning systems to categorize questions by skill area, difficulty level, and competence. This classification not only supports educational diagnostics and analytics but also enhances complex tasks like information retrieval and question answering by associating questions with relevant categories. Traditional methods, often based on word embeddings and conventional classifiers, struggle to capture the nuanced relationships in natural language, leading to suboptimal performance. To address this, we propose a novel approach leveraging graph convolutional networks (GCNs), named Phrase Question-Graph Convolutional Network (PQ-GCN) to better model the inherent structure of questions. By representing questions as graphs – where nodes signify words or phrases and edges denote syntactic or semantic relationships – our method allows GCNs to learn from the interconnected nature of language more effectively. Additionally, we explore the incorporation of phrase-based features to enhance classification accuracy, especially in low-resource settings. Our findings demonstrate that GCNs, augmented with these features, offer a promising solution for more accurate and context-aware question classification, bridging the gap between graph neural network research and practical educational applications.
摘要:有效的问题分类对于人工智能驱动的教育工具至关重要,它使自适应学习系统能够根据技能领域、难度水平和能力对问题进行分类。这种分类不仅支持教育诊断和分析,还通过将问题与相关类别相关联来增强信息检索和问题回答等复杂任务。传统的方法通常基于单词嵌入和传统分类器,难以捕捉自然语言中的细微差别关系,导致性能不佳。为了解决这一问题,我们提出了一种利用图卷积网络(GCNS)的新方法,称为短语问题-图卷积网络(PQ-GCN),以更好地建模问题的内在结构。通过将问题表示为图–其中节点表示单词或短语,边表示语法或语义关系–我们的方法允许GCNS更有效地从语言的相互关联的本质中学习。此外,我们探索了结合基于短语的特征来提高分类精度,特别是在低资源环境下。我们的发现表明,GCNS扩展了这些功能,为更准确和上下文感知的问题分类提供了一种有前途的解决方案,弥合了图神经网络研究和实际教育应用之间的差距。

[NLP-27] A Comparative Study on Large Language Models for Log Parsing
[NLP-27] 用于日志解析的大型语言模型比较研究

链接: https://arxiv.org/abs/2409.02474
作者: Merve Astekin,Max Hort,Leon Moonen
关键词-EN: log parsing, Toggle, Log, Code, language models
关键词-ZH: 日志解析、切换、日志、代码、语言模型
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted for publication in the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '24)

点击查看摘要

Abstract:Background: Log messages provide valuable information about the status of software systems. This information is provided in an unstructured fashion and automated approaches are applied to extract relevant parameters. To ease this process, log parsing can be applied, which transforms log messages into structured log templates. Recent advances in language models have led to several studies that apply ChatGPT to the task of log parsing with promising results. However, the performance of other state-of-the-art large language models (LLMs) on the log parsing task remains unclear. Aims: In this study, we investigate the current capability of state-of-the-art LLMs to perform log parsing. Method: We select six recent LLMs, including both paid proprietary (GPT-3.5, Claude 2.1) and four free-to-use open models, and compare their performance on system logs obtained from a selection of mature open-source projects. We design two different prompting approaches and apply the LLMs on 1, 354 log templates across 16 different projects. We evaluate their effectiveness, in the number of correctly identified templates, and the syntactic similarity between the generated templates and the ground truth. Results: We found that free-to-use models are able to compete with paid models, with CodeLlama extracting 10% more log templates correctly than GPT-3.5. Moreover, we provide qualitative insights into the usability of language models (e.g., how easy it is to use their responses). Conclusions: Our results reveal that some of the smaller, free-to-use LLMs can considerably assist log parsing compared to their paid proprietary competitors, especially code-specialized models. Comments: Accepted for publication in the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '24) Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL) Cite as: arXiv:2409.02474 [cs.SE] (or arXiv:2409.02474v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2409.02474 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3674805.3686684 Focus to learn more DOI(s) linking to related resources Submission history From: Leon Moonen [view email] [v1] Wed, 4 Sep 2024 06:46:31 UTC (490 KB) Full-text links: Access Paper: View a PDF of the paper titled A Comparative Study on Large Language Models for Log Parsing, by Merve Astekin and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2024-09 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要:背景:日志消息提供有关软件系统状态的有价值的信息。这些信息是以非结构化方式提供的,并应用自动化方法来提取相关参数。为了简化这一过程,可以应用日志解析,将日志消息转换为结构化日志模板。语言模型的最新进展导致了将ChatGPT应用于日志解析任务的几项研究,并取得了令人振奋的结果。然而,其他最先进的大型语言模型(LLM)在日志解析任务上的性能仍然不清楚。目的:在这项研究中,我们调查了目前最先进的LLMS执行日志解析的能力。方法:我们选择了最近的6个LLM,包括付费专有(GPT-3.5,Claude 2.1)和4个免费使用的开放模型,并在从一些成熟的开源项目中获得的系统日志上比较了它们的性能。我们设计了两种不同的提示方法,并将LLMS应用于16个不同项目的1,354个日志模板。我们通过正确识别模板的数量以及生成的模板与基本事实之间的句法相似性来评估它们的有效性。结果:我们发现免费使用的模型能够与付费模型竞争,CodeLlama比GPT-3.5多提取10%的日志模板。此外,我们对语言模型的可用性提供了定性的见解(例如,使用它们的响应有多容易)。结论:我们的结果显示,与付费的专有竞争对手相比,一些较小的、免费使用的LLM可以相当大程度上帮助进行日志解析,特别是代码专门化的模型。评论:接受第18届ACM/IEEE经验式软件工程和度量国际研讨会(ESEM‘24)主题:软件工程(cs.se);计算与语言(cs.CL)引用如下:arxiv:2409.02474cs.SEhttps://doi.org/10.48550/arXiv.2409.02474 Focus通过DataCite(待注册)了解更多arxiv发布的DOI相关DOI:https://doi.org/10.1145/3674805.3686684 Focus要了解更多DOI(S)链接到相关资源提交历史记录,请访问:Leon Moonen[查看电子邮件][v1]Wed,4 Sep 2024 06:46:31 UTC(490 KB)全文链接:Access Paper:查看由Merve Astekin和其他1位作者撰写的题为《用于日志解析的大型语言模型的比较研究》的PDF论文查看PDFHTML(实验)Tex源其他格式查看许可证当前浏览上下文:cs.SE prev|Next new|Recent|2024-09浏览方式更改为:CS cs.CL引用NASA ADSGoogle学者语义学者a导出BibTeX引用加载…正在加载BibTeX格式的引文…数据提供方:Bookmark Checked=“Checked”>书目工具书目和引用工具书目资源管理器切换书目资源管理器(什么是资源管理器?)Litmap切换Litmap(什么是Litmap?)切换SCITE智能引用(什么是智能引用?)与本文相关的代码、数据、媒体代码、数据和媒体链接到论文代码切换Catalyst X代码查找器(什么是CatalyzeX?)切换DagsHub(什么是DagsHub?)GotitPub切换Gotit.pub(什么是GotitPub?)代码链接切换带有代码的论文(什么是带有代码的论文?)科学预测切换科学预测(什么是科学预测?)演示复制切换复制(什么是复制?)空间切换拥抱脸部空间(什么是空间?)空格切换TXYZ.AI(什么是TXYZ.AI?)相关论文推荐器和搜索工具链接影响花影响花(什么是影响花?)连接的文件切换连接的文件(什么是连接的文件?)核心推荐切换核心推荐(什么是核心?)作者地点机构主题arXivLabs:与社区合作者的实验项目arXivLabs是一个框架,允许合作者直接在我们的网站上开发和共享新的arxiv功能。与arXivLabs合作的个人和组织都接受并接受了我们的价值观,即开放、社区、卓越和用户数据隐私。Arxiv致力于这些价值观,只与坚持这些价值观的合作伙伴合作。你有一个为arxiv社区增加价值的项目的想法吗?了解有关arXivLabs的更多信息。本文的哪些作者是代言人?|禁用MathJax(什么是MathJax?)Mathjax切换();关于帮助联系arxiv单击此处联系arxiv联系订阅arxiv邮件单击此处订阅版权隐私策略Web辅助功能arxiv操作状态通过电子邮件或SLACK获取状态通知

[NLP-28] DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
[NLP-28] DetectiveQA:评估侦探小说的长上下文推理

链接: https://arxiv.org/abs/2409.02465
作者: Zhe Xu,Jiasheng Ye,Xiangyang Liu,Tianxiang Sun,Xiaoran Liu,Qipeng Guo,Linlin Li,Qun Liu,Xuanjing Huang,Xipeng Qiu
关键词-EN: Large Language Models, Large Language, Language Models, academia and industry, hot topic
关键词-ZH: 大型语言模型,大型语言,语言模型,学术界和工业界,热门话题
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), long-context information understanding and processing have become a hot topic in academia and industry. However, benchmarks for evaluating the ability of LLMs to handle long-context information do not seem to have kept pace with the development of LLMs. Despite the emergence of various long-context evaluation benchmarks, the types of capability assessed are still limited, without new capability dimensions. In this paper, we introduce DetectiveQA, a narrative reasoning benchmark featured with an average context length of over 100K tokens. DetectiveQA focuses on evaluating the long-context reasoning ability of LLMs, which not only requires a full understanding of context but also requires extracting important evidences from the context and reasoning according to extracted evidences to answer the given questions. This is a new dimension of capability evaluation, which is more in line with the current intelligence level of LLMs. We use detective novels as data sources, which naturally have various reasoning elements. Finally, we manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions. We evaluate many long-context LLMs on DetectiveQA, including commercial and open-sourced models, and the results indicate that existing long-context LLMs still require significant advancements to effectively process true long-context dependency questions.
摘要:随着大语言模型的迅速发展,长上下文信息的理解和处理已成为学术界和工业界的研究热点。然而,评估LLMS处理长背景信息的能力的基准似乎没有跟上LLMS的发展步伐。尽管出现了各种长期评价基准,但评估的能力类型仍然有限,没有新的能力层面。在本文中,我们介绍了一个叙事推理基准DetectiveQA,它的平均上下文长度超过10万个令牌。DetectiveQA的重点是评估LLMS的长上下文推理能力,这不仅要求对上下文有充分的理解,而且需要从上下文中提取重要证据,并根据提取的证据进行推理来回答给定的问题。这是能力评估的一个新维度,更符合当前LLMS的智能水平。我们使用侦探小说作为数据来源,自然有各种各样的推理元素。最后,我们用中文对600个问题进行了人工标注,并提供了上下文信息和问题的英文版本。我们在DetectiveQA上对许多长上下文LLM进行了评估,包括商业模型和开源模型,结果表明现有的长上下文LLM仍然需要显著的改进才能有效地处理真正的长上下文依赖问题。

[NLP-29] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP2024
[NLP-29] 正常化会失去什么?探索多语言ASB模型评估中的陷阱

链接: https://arxiv.org/abs/2409.02449
作者: Kavya Manohar,Leena G Pillai
关键词-EN: automatic speech recognition, evaluating multilingual automatic, multilingual automatic speech, ASR models, leading ASR models
关键词-ZH: 自动语音识别、评估多语言自动、多语言自动语音、ASB模型、领先的ASB模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Sumbitted to EMNLP 2024

点击查看摘要

Abstract:This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta’s MMS, Seamless, and Assembly AI’s Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. We conclude by proposing a shift towards developing normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.
摘要:本文探讨了评估多语言自动语音识别(ASR)模型的陷阱,特别关注印度语脚本。我们调查了领先的ASR模型所使用的文本规范化例程,包括OpenAI Whisper、Meta的MMS、Seamless和Assembly AI的Conform,以及它们对性能度量的意外后果。我们的研究表明,目前的文本标准化做法,虽然旨在通过消除拼写、标点符号和特殊字符等不一致的情况来标准化ASR输出以进行公平比较,但在应用于印度文字时存在根本缺陷。通过使用文本相似度得分的实证分析和深入的语言检查,我们证明了这些缺陷导致了印度语的人为夸大的性能度量。最后,我们建议转向开发利用母语专业知识的标准化例程,确保对多语言ASR模型进行更稳健和准确的评估。

[NLP-30] Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning
[NLP-30] 大型语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

链接: https://arxiv.org/abs/2409.02428
作者: Guanwen Xie,Jingzehua Xu,Yiyuan Yang,Shuai Zhang
关键词-EN: Leveraging large language, large language models, demonstrates significant potential, Leveraging large, functions demonstrates significant
关键词-ZH: 利用大型语言、大型语言模型展示了巨大的潜力,利用大型功能展示了巨大的潜力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.
摘要:利用大型语言模型(LLM)来设计奖励函数显示出巨大的潜力。然而,在具有复杂定制环境和多需求的强化学习(RL)任务中实现奖励函数的有效设计和改进具有相当大的挑战性。在本文中,我们使LLM成为有效的白盒搜索器,突出了它们高级的语义理解能力。具体地说,我们为每个明确的用户需求生成奖励组件,并雇用奖励评论员来识别正确的代码形式。然后,LLMS将权重分配给奖励分量以平衡它们的值,并基于训练日志分析器提供的上下文迭代地搜索和优化这些权重,同时自适应地确定搜索步长。我们将该框架应用于一个水下信息收集RL任务,该任务没有直接的人类反馈或奖励示例(零射击)。奖励评论员成功地纠正了奖励代码,每个要求只有一个反馈,有效地防止了在提供奖励函数反馈时可能发生的不可弥补的错误。权值的有效初始化使得在不需要权值搜索的情况下获得Pareto解集内的不同报酬函数。即使在权重降低100倍的情况下,也只需要不到四次迭代就可以获得满足用户要求的解。该框架也很好地与使用GPT-3.5 Turbo的大多数提示一起工作,因为它不需要高级的数值理解或计算。

[NLP-31] Abstractive Text Summarization: State of the Art Challenges and Improvements
[NLP-31] 抽象文本总结:最新技术水平的挑战和改进

链接: https://arxiv.org/abs/2409.02413
作者: Hassan Shakil,Ahmad Farooq,Jugal Kalita
关键词-EN: Specifically focusing, prospective research directions, opposed to extractive, survey presents, summarization
关键词-ZH: 具体关注、前瞻性研究方向,而不是提取、调查呈现、总结
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 Tables, 7 Figures

点击查看摘要

Abstract:Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
摘要:这项调查特别关注摘要文本摘要的前景,而不是摘要技术,提供了一个全面的概述,深入探讨了最新的技术、普遍存在的挑战和未来的研究方向。我们将这些技术分为传统的序列到序列模型、预训练的大型语言模型、强化学习、分层方法和多模式摘要。与以前没有详细审查复杂性、可扩展性和技术比较的工作不同,这篇综述采取了一种全面的方法,包括最新的方法、挑战、解决方案、比较、限制和未来的改进-为研究人员提供了一个广泛的概述,以推进抽象总结研究。我们提供了各种分类技术的重要比较表–提供对模型复杂性、可扩展性和适当应用的洞察。文章强调了诸如意义表达不充分、事实一致性、文本摘要可控、跨语言摘要和评价指标等方面的挑战。为应对这些挑战,提出了利用知识融合和其他创新战略的解决方案。文章最后强调了新兴的研究领域,如事实不一致、特定领域、跨语言、多语言和长文档摘要,以及处理噪声数据。我们的目标是为研究人员和从业者提供该领域的结构化概述,使他们能够更好地了解当前的情况,并确定进一步研究和改进的潜在领域。

[NLP-32] Determination of language families using deep learning
[NLP-32] 使用深度学习确定语系

链接: https://arxiv.org/abs/2409.02393
作者: Peter B. Lerner
关键词-EN: convolutional generative adversarial, establish linguistic affinities, analyze transliterated text, transliterated text fragments, convolutional generative
关键词-ZH: 卷积生成对抗,建立语言亲和力,分析音译文本,音译文本片段,卷积生成
类目: Computation and Language (cs.CL)
备注: First draft. Comments are welcome

点击查看摘要

Abstract:We use a c-GAN (convolutional generative adversarial) neural network to analyze transliterated text fragments of extant, dead comprehensible, and one dead non-deciphered (Cypro-Minoan) language to establish linguistic affinities. The paper is agnostic with respect to translation and/or deciphering. However, there is hope that the proposed approach can be useful for decipherment with more sophisticated neural network techniques.
摘要:我们使用c-GAN(卷积生成对抗)神经网络来分析现存的、死可理解的和一种死不可破译的(Cypro-Minoan)语言的音译文本片段,以建立语言亲和力。该论文在翻译和/或解密方面是不可知的。然而,希望所提出的方法可以用于使用更复杂的神经网络技术进行解密。

[NLP-33] Large Language Models and Cognitive Science: A Comprehensive Review of Similarities Differences and Challenges
[NLP-33] 大型语言模型和认知科学:相似性、差异和挑战的全面回顾

链接: https://arxiv.org/abs/2409.02387
作者: Qian Niu,Junyu Liu,Ziqian Bi,Pohsun Feng,Benji Peng,Keyu Chen
关键词-EN: Large Language Models, Large Language, Language Models, intersection of Large, comprehensive review explores
关键词-ZH: 大型语言模型,大型语言,语言模型,大型交叉,全面评论探索
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.
摘要:本综述探讨了大语言模型与认知科学的交叉,考察了大语言模型与人类认知过程的异同。我们分析了评估LLMS认知能力的方法,并讨论了它们作为认知模型的潜力。该综述涵盖了LLMS在不同认知领域的应用,强调了为认知科学研究所获得的见解。我们评估了LLMS的认知偏差和局限性,并提出了改进其性能的方法。研究了LLMS与认知体系结构的集成,揭示了增强人工智能(AI)能力的有希望的途径。确定了主要挑战和未来研究方向,强调需要继续改进LLMS,以更好地与人类认知保持一致。这篇综述为LLMS在促进我们对人工智能和人类智能的理解方面的现状和未来潜力提供了一个平衡的视角。

[NLP-34] STAB: Speech Tokenizer Assessment Benchmark
[NLP-34] STAB:语音令牌器评估基准

链接: https://arxiv.org/abs/2409.02384
作者: Shikhar Vashishth,Harman Singh,Shikhar Bharadwaj,Sriram Ganapathy,Chulayuth Asawaroengchai,Kartik Audhkhasi,Andrew Rosenberg,Ankur Bapna,Bhuvana Ramabhadran
关键词-EN: closely resembles text, widely successful large, successful large language, large language models, Representing speech
关键词-ZH: 与文本非常相似,广泛成功的大型,成功的大型语言,大型语言模型,代表语音
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages

点击查看摘要

Abstract:Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
摘要:将语音表示为离散标记提供了一个框架,用于将语音转换为与文本非常相似的格式,从而使语音能够用作广泛成功的大型语言模型(LLMS)的输入。目前,虽然已经提出了几个语音标记器,但是关于特定下游任务的标记器所需的属性及其总体概括性存在歧义。评估令牌化器在不同下游任务中的性能是一项计算密集型工作,对可伸缩性构成了挑战。为了规避这一要求,我们提出了STAB(语音令牌化器评估基准),这是一个系统的评估框架,旨在全面评估语音令牌化器并揭示其内在特征。这一框架对语音标记化的基本机制有了更深入的理解,从而为加快未来标记器模型的发展和使用标准化基准进行比较分析提供了宝贵的资源。我们评估STAB指标,并将其与一系列语音任务和标记器选择的下游任务性能相关联。

[NLP-35] How Privacy-Savvy Are Large Language Models ? A Case Study on Compliance and Privacy Technical Review
[NLP-35] 大型语言模型的隐私程度如何?合规与隐私技术审查案例研究

链接: https://arxiv.org/abs/2409.02375
作者: Xichou Zhu,Yang Liu,Zhou Shen,Yi Liu,Min Li,Yujun Chen,Benzi John,Zhenzhen Ma,Tao Hu,Bolong Yang,Manman Wang,Zongxing Xie,Peng Liu,Dan Cai,Junhui Wang
关键词-EN: large language models, complex question answering, technical privacy reviews, language generation, privacy
关键词-ZH: 大型语言模型、复杂问题回答、技术隐私审查、语言生成、隐私
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:The recent advances in large language models (LLMs) have significantly expanded their applications across various fields such as language generation, summarization, and complex question answering. However, their application to privacy compliance and technical privacy reviews remains under-explored, raising critical concerns about their ability to adhere to global privacy standards and protect sensitive user data. This paper seeks to address this gap by providing a comprehensive case study evaluating LLMs’ performance in privacy-related tasks such as privacy information extraction (PIE), legal and regulatory key point detection (KPD), and question answering (QA) with respect to privacy policies and data protection regulations. We introduce a Privacy Technical Review (PTR) framework, highlighting its role in mitigating privacy risks during the software development life-cycle. Through an empirical assessment, we investigate the capacity of several prominent LLMs, including BERT, GPT-3.5, GPT-4, and custom models, in executing privacy compliance checks and technical privacy reviews. Our experiments benchmark the models across multiple dimensions, focusing on their precision, recall, and F1-scores in extracting privacy-sensitive information and detecting key regulatory compliance points. While LLMs show promise in automating privacy reviews and identifying regulatory discrepancies, significant gaps persist in their ability to fully comply with evolving legal standards. We provide actionable recommendations for enhancing LLMs’ capabilities in privacy compliance, emphasizing the need for robust model improvements and better integration with legal and regulatory requirements. This study underscores the growing importance of developing privacy-aware LLMs that can both support businesses in compliance efforts and safeguard user privacy rights.
摘要:大语言模型的最新进展使其在语言生成、摘要、复杂问题回答等领域的应用得到了极大的扩展。然而,它们在隐私合规和技术隐私审查方面的应用仍未得到充分探索,这引发了人们对它们遵守全球隐私标准和保护敏感用户数据的能力的严重关切。为了解决这一差距,本文提供了一个全面的案例研究,评估LLMS在隐私相关任务中的表现,如隐私信息提取(PIE)、法律和监管关键点检测(KPD)以及与隐私政策和数据保护法规相关的问题回答(QA)。我们引入了隐私技术审查(PTR)框架,强调其在降低软件开发生命周期中的隐私风险方面的作用。通过实证评估,我们调查了几个重要的LLM,包括BERT、GPT-3.5、GPT-4和自定义模型,在执行隐私合规性检查和技术隐私审查方面的能力。我们的实验在多个维度上对模型进行了基准测试,重点考察了它们在提取隐私敏感信息和检测关键法规遵从点方面的精确度、召回率和F1得分。虽然LLM在自动化隐私审查和识别监管差异方面表现出了希望,但它们在完全遵守不断演变的法律标准的能力方面仍然存在重大差距。我们为增强LLMS在隐私合规方面的能力提供了可行的建议,强调了健全的模型改进以及与法律和监管要求更好地整合的必要性。这项研究强调了开发具有隐私意识的LLM的日益重要性,这些LLM既能支持企业的合规努力,又能保护用户的隐私权。

[NLP-36] Do Large Language Models Possess Sensitive to Sentiment?
[NLP-36] 大型语言模型对情绪敏感吗?

链接: https://arxiv.org/abs/2409.02370
作者: Yang Liu,Xichou Zhu,Zhou Shen,Yi Liu,Min Li,Yujun Chen,Benzi John,Zhenzhen Ma,Tao Hu,Zhiyang Xu,Wei Luo,Junhui Wang
关键词-EN: Large Language Models, Large Language, language understanding, Language Models, recently displayed
关键词-ZH: 大型语言模型,大型语言,语言理解,语言模型,最近显示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models’ outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.
摘要:近年来,大型语言模型在语言理解方面表现出了非凡的能力。然而,如何全面评估低收入者的情绪能力仍然是一个挑战。本文研究了LLMS对文本模式中情感的检测和反应能力。随着LLMS在不同应用中的整合程度越来越高,理解他们对情绪基调的敏感性变得非常关键,因为这会影响用户体验和情绪驱动任务的有效性。我们进行了一系列实验来评估几个著名的LLM在识别和恰当地应对积极、消极和中性情绪方面的表现。这些模型的输出被分析在不同的情绪基准上,并将它们的反应与人类评估进行比较。我们的发现表明,尽管LLM对情绪表现出基本的敏感性,但它们的准确性和一致性存在很大差异,强调了在其训练过程中进一步增强的要求,以更好地捕捉微妙的情绪线索。以我们的发现为例,在某些情况下,模型可能会错误地将强烈积极的情绪归类为中性,或者无法识别文本中的讽刺或讽刺。这种错误分类突显了情绪分析的复杂性,以及模型需要改进的领域。另一方面,不同的LLM可能对相同的数据集执行不同的操作,这取决于它们的体系结构和训练数据集。这种差异要求更深入地研究造成性能差异的因素以及如何优化这些因素。

[NLP-37] Diversify-verify-adapt: Efficient and Robust Retrieval-Augmented Ambiguous Question Answering
[NLP-37] 多样化-验证-适应:高效且稳健的检索-增强模糊问题解答

链接: https://arxiv.org/abs/2409.02361
作者: Yeonjun In,Sungchul Kim,Ryan A. Rossi,Md Mehrab Tanjim,Tong Yu,Ritwik Sinha,Chanyoung Park
关键词-EN: generating comprehensive responses, comprehensive responses based, retrieval augmented generation, augmented generation, plausible interpretations
关键词-ZH: 生成综合响应、基于综合响应、检索增强生成、增强生成、合理解释
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The retrieval augmented generation (RAG) framework addresses an ambiguity in user queries in QA systems by retrieving passages that cover all plausible interpretations and generating comprehensive responses based on the passages. However, our preliminary studies reveal that a single retrieval process often suffers from low quality results, as the retrieved passages frequently fail to capture all plausible interpretations. Although the iterative RAG approach has been proposed to address this problem, it comes at the cost of significantly reduced efficiency. To address these issues, we propose the diversify-verify-adapt (DIVA) framework. DIVA first diversifies the retrieved passages to encompass diverse interpretations. Subsequently, DIVA verifies the quality of the passages and adapts the most suitable approach tailored to their quality. This approach improves the QA systems accuracy and robustness by handling low quality retrieval issue in ambiguous questions, while enhancing efficiency.
摘要:检索增强生成(RAG)框架通过检索覆盖所有合理解释的段落并基于这些段落生成综合回答来解决QA系统中用户查询的歧义性。然而,我们的初步研究表明,单一的检索过程经常受到低质量结果的影响,因为检索的段落经常无法捕捉到所有似乎合理的解释。虽然已经提出了迭代RAG方法来解决这个问题,但它是以显著降低效率为代价的。为了解决这些问题,我们提出了多样化-核实-适应(DIVA)框架。DIVA首先将检索到的段落多样化,以包含不同的解释。随后,DIVA验证段落的质量,并采用最适合其质量的方法。这种方法通过处理歧义问题中的低质量检索问题,在提高效率的同时,提高了问答系统的准确性和稳健性。

[NLP-38] NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval
[NLP-38] NUDGE:用于检索的嵌入的轻量级非参数微调

链接: https://arxiv.org/abs/2409.02343
作者: Sepanta Zeighami,Zac Wellmer,Aditya Parameswaran
关键词-EN: Nearest Neighbor search, Nearest Neighbor, Retrieval-Augmented Generation, Neighbor search, dense vector embeddings
关键词-ZH: 最近邻居搜索、最近邻居、检索增强生成、邻居搜索、密集载体嵌入
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract: k -Nearest Neighbor search on dense vector embeddings ( k -NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of k -NN retrieval. We present a thorough theoretical and experimental study of NUDGE’s non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.
摘要:基于密集向量嵌入模型的k-近邻搜索(k-NN检索)是文本和图像以及检索-增强生成(RAG)流水线的主要检索方法。在实践中,应用程序开发人员经常微调嵌入,以提高他们对手头的数据集和查询工作负载的准确性。现有的方法要么微调预先训练的模型本身,要么更有效地训练适配器模型以转换预先训练的模型的输出,但代价是准确性。我们提出了一系列新颖的非参数嵌入微调方法,它们比现有的两种方法都更准确、更高效。NUGH直接修改数据记录的嵌入,以最大限度地提高k-NN检索的准确性。我们对非参数方法进行了深入的理论和实验研究。我们证明了即使潜在的问题是NP-难的,约束变分也可以被有效地求解。这些约束还确保了对嵌入的更改是适度的,从而避免了对在预培训期间学习的语义的严重扭曲。在对5个预先训练的模型和9个标准文本和图像检索数据集进行的实验中,NUSH在几分钟内运行,通常比现有的微调方法提高NDCG@10超过10%。平均而言,与对预先训练的模型和训练适配器进行微调相比,NUGY的准确率提高了3.3倍和4.3倍,运行速度分别提高了200倍和3倍。

[NLP-39] Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
[NLP-39] Arctic-SnowCoder:在代码预训练中揭开高质量数据的神秘面纱

链接: https://arxiv.org/abs/2409.02326
作者: Yuxiang Wei,Hojae Han,Rajhans Samdani
关键词-EN: Recent studies, increasingly demonstrating, crucial for effective, data, pretraining
关键词-ZH: 最近的研究越来越多地表明,数据、预训练对于有效的关键
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of “high-quality” remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.
摘要:最近的研究越来越多地表明,高质量的数据对于语言模型的有效预训练至关重要。然而,对“高质量”的准确定义仍未得到充分探索。针对码域,我们引入了北极雪地编码器-1.3B,这是一种数据效率高的基码模型,通过三个阶段的逐步精化数据对555B令牌进行预训练:(1)使用500B标准质量的码令牌进行一般预训练,通过基本过滤、重复数据删除和去污进行预处理;(2)使用50B高质量的令牌继续预训练,由BERT式的高质量注释器从第一阶段中选择,使用从高质量代码文件中提取的正例,以及来自Magicoder和StarCoder2-Indict的指令数据,(3)使用LLAMA-3.1-70B创建的5B合成数据,以第二阶段数据为种子,采用Magicoder方法进行预训练,增强了预训练。尽管在有限的数据集上接受了培训,但北极-SnowCoder在BigCodeBtch(一种专注于实际和具有挑战性的编程任务的编码基准)上取得了最先进的性能,与使用不超过1T令牌进行培训的类似大小的模型相比,性能高出Phi-1.5-1.3B 36%。在所有评估的基准中,北极-SnowCoder-1.3B击败了StarCoderBase-3B,在1T令牌上进行了预训练。此外,它的性能与基于数万亿令牌训练的领先小型基本代码模型的性能相当。例如,北极雪地编码器-1.3B在HumanEval+(一种评估函数级代码生成的基准)上超过了StarCoder2-3B,StarCoder2-3B在超过3.3T令牌上进行了预训练,并且在BigCodeBtch上仍然具有竞争力。我们的评估提供了一个全面的分析,证明了北极雪编器的各种设计选择的合理性。最重要的是,我们发现高质量数据的关键是它与下游应用程序的分布保持一致。

[NLP-40] Optimal L-Systems for Stochastic L-system Inference Problems
[NLP-40] 随机L-系统推理问题的最优L-系统

链接: https://arxiv.org/abs/2409.02259
作者: Ali Lotfi,Ian McQuillan
关键词-EN: optimal stochastic L-system, stochastic L-system capable, stochastic L-system, stochastic Lindenmayer-system, L-system
关键词-ZH: 最佳随机L-系统,有能力的随机L-系统,随机L-系统,随机林登迈尔系统,L-系统
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.
文摘:本文提出了两个新的定理来解决随机林登迈尔系统(L系统)推理中的两个公开问题,特别是着重于构造一个能够产生给定串序列的最优随机L系统。第一个定理描述了一种构造随机L系统的方法,该系统通过奇异导数最大化产生给定单词序列的可能性。此外,第二个定理确定了具有最高概率产生具有多个可能派生的给定单词序列的随机L系统。在此基础上,给出了一种从给定序列推导出最优随机L系统的算法。该算法结合了复杂的优化技术,如内点法,确保产生适合于产生给定序列的随机最优随机L系统。这允许使用随机L系统作为机器学习的模型,仅使用正数据进行训练。

[NLP-41] MMLU-Pro: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
[NLP-41] MMLU-Pro:评估LLM中的高级推理和RST学习

链接: https://arxiv.org/abs/2409.02257
作者: Saeid Asgari Taghanaki,Aliasgahr Khani,Amir Khasahmadi
关键词-EN: Existing benchmarks, challenging evaluation frameworks, large language models, increasingly struggle, large language
关键词-ZH: 现有的基准、具有挑战性的评估框架、大型语言模型、日益困难、大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs’ ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro’s difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of five state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \urlthis https URL.
摘要:大型语言模型(LLM)的现有基准越来越难以区分表现最好的模型,这突显了需要更具挑战性的评估框架。我们引入了MMLU-Pro+,这是一个建立在MMLU-Pro基础上的增强基准,用于评估LLMS中的快捷学习和高阶推理。通过将不同领域的问题与多个正确答案结合起来,MMLU-Pro+测试了LLMS从事复杂推理和抵制简单化问题解决策略的能力。我们的结果表明,MMLU-Pro+保持了MMLU-Pro的难度,同时提供了更严格的模型区分测试,特别是在多个正确答案的情况下。我们引入了新的度量,如快捷选择率和正确配对识别率,为模型行为和锚定偏差提供了更深入的见解。对五个最先进的LLM的评估显示出显著的性能差距,突出了推理能力和偏差敏感性的差异。我们在此HTTPS URL上发布数据集和评估代码。

[NLP-42] herapy as an NLP Task: Psychologists Comparison of LLMs and Human Peers in CBT
[NLP-42] 作为NLP任务的传承:心理学家CBT中LLM和人类同行的比较

链接: https://arxiv.org/abs/2409.02244
作者: Zainab Iftikhar,Sean Ransom,Amy Xiao,Jeff Huang
关键词-EN: Wider access, mental health treatment, mental health, Cognitive Behavioral Therapy, biggest challenges
关键词-ZH: 更广泛的接触、心理健康治疗、心理健康、认知行为疗法、最大的挑战
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Wider access to therapeutic care is one of the biggest challenges in mental health treatment. Due to institutional barriers, some people seeking mental health support have turned to large language models (LLMs) for personalized therapy, even though these models are largely unsanctioned and untested. We investigate the potential and limitations of using LLMs as providers of evidence-based therapy by using mixed methods clinical metrics. Using HELPERT, a prompt run on a large language model using the same process and training as a comparative group of peer counselors, we replicated publicly accessible mental health conversations rooted in Cognitive Behavioral Therapy (CBT) to compare session dynamics and counselor’s CBT-based behaviors between original peer support sessions and their reconstructed HELPERT sessions. Two licensed, CBT-trained clinical psychologists evaluated the sessions using the Cognitive Therapy Rating Scale and provided qualitative feedback. Our findings show that the peer sessions are characterized by empathy, small talk, therapeutic alliance, and shared experiences but often exhibit therapist drift. Conversely, HELPERT reconstructed sessions exhibit minimal therapist drift and higher adherence to CBT methods but display a lack of collaboration, empathy, and cultural understanding. Through CTRS ratings and psychologists’ feedback, we highlight the importance of human-AI collaboration for scalable mental health. Our work outlines the ethical implication of imparting human-like subjective qualities to LLMs in therapeutic settings, particularly the risk of deceptive empathy, which may lead to unrealistic patient expectations and potential harm.
摘要:更广泛地获得治疗护理是心理健康治疗中最大的挑战之一。由于体制障碍,一些寻求心理健康支持的人转向大型语言模型(LLM)进行个性化治疗,尽管这些模型在很大程度上是未经批准和测试的。我们调查了使用LLMS作为循证治疗提供者的潜力和局限性,通过使用混合方法的临床指标。使用HELPERT,这是一个大型语言模型上的快速运行,使用与同伴辅导员对照组相同的过程和培训,我们复制了植根于认知行为疗法(CBT)的可公开访问的心理健康对话,以比较原始同伴支持会话和重建的HELPERT会话之间的会话动力学和咨询师基于CBT的行为。两名有执照的、接受过CBT培训的临床心理学家使用认知治疗评定量表对会议进行评估,并提供定性反馈。我们的发现表明,同伴会议的特点是移情、闲聊、治疗联盟和分享经验,但经常表现出治疗师的漂移。相反,HELPERT重建的治疗过程显示出最小的治疗师漂移和对CBT方法的更高依从性,但显示出缺乏合作、同理心和文化理解。通过CTRS评级和心理学家的反馈,我们强调了人类-人工智能协作对可扩展心理健康的重要性。我们的工作概述了在治疗环境中赋予LLM类似人类的主观品质的伦理含义,特别是欺骗性同理心的风险,这可能导致不切实际的患者期望和潜在的伤害。

[NLP-43] mporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR
[NLP-43] VAR中保留订单的最优基于传输的跨模式知识转移学习

链接: https://arxiv.org/abs/2409.02239
作者: Xugang Lu,Peng Shen,Yu Tsao,Hisashi Kawai
关键词-EN: automatic speech recognition, knowledge transfer, speech recognition, linguistic knowledge transfer, knowledge
关键词-ZH: 自动语音识别,知识转移,语音识别,语言知识转移,知识
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
摘要:将语言知识从预先训练的语言模型(PLM)转移到声学模型可以极大地提高自动语音识别(ASR)的性能。然而,由于跨通道中特征分布的异质性,设计一种有效的语言序列和声学序列之间的特征比对和知识传递模型仍然是一个具有挑战性的任务。最优传输(OT)有效地度量了概率分布的差异,具有在声学和语言形态之间对齐和传递知识的巨大潜力。然而,原OT算法将声学和语言特征序列视为两个无序集进行比对,在OT耦合估计过程中忽略了时序信息。因此,需要一个耗时的预训练阶段来学习声学和语言表征之间的良好匹配。本文提出了一种基于时间顺序保持理论(TOT)的ASR跨模式对齐与知识转移(CAKT)方法(TOT-CAKT)。在TOT-CAKT中,声学序列的局部相邻帧被平滑地映射到语言序列的相邻区域,在特征对齐和匹配中保持它们之间的时序关系。在TOT-CAKT模型框架下,我们用一个预先训练好的汉语PLM进行了汉语ASR的语言知识迁移实验。实验结果表明,与现有的几种基于语言知识传递的ASR模型相比,TOT-CAKT显著提高了ASR的性能,并弥补了原有基于OT的ASR序列特征对齐方法的不足。

[NLP-44] Unforgettable Generalization in Language Models
[NLP-44] 语言模型中不可想象的概括

链接: https://arxiv.org/abs/2409.02228
作者: Eric Zhang,Leshem Chosen,Jacob Andreas
关键词-EN: training set, training, forgetting, tasks, LMs
关键词-ZH: 训练集、训练、遗忘、任务、LM
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 9 figures, published in First Conference on Language Modeling 2024

点击查看摘要

Abstract:When language models (LMs) are trained to forget (or "unlearn’‘) a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training’’ set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten’’ task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs’ initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs’ representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.
摘要:当语言模型(LMS)被训练来忘记(或忘记)一项技能时,他们的行为到底发生了多大的变化?我们研究了变送器LMS的行为,在这种行为中,通过对随机标签的微调,任务被遗忘了。这种LMS学习对用于遗忘的“训练”集中的单个例子产生近乎随机的预测。然而,在任务之间,最小二乘法在训练集之外的样本上是否改变LM预测方面表现出极大的可变性。在一些任务(如蕴涵分类)中,遗忘会很好地概括,并导致模型对新任务实例产生没有信息的预测;在其他任务(如物理常识推理和科学问题回答)中,遗忘只会影响训练样本,并且即使对于与训练集中出现的例子非常相似的例子,模型也会继续准确地执行“被遗忘”的任务。数据集难度并不能预测一种行为是否会被遗忘;相反,遗忘中的泛化是由LMS初始任务预测的置信度和训练数据的LM表示的可变性(弱)预测的,其中低置信度和低变异性都与更大的泛化相关联。也许最令人惊讶的是,随机标签遗忘似乎对训练集的内容不太敏感:例如,对带有随机标签的科学问题进行训练的模型继续准确地回答其他科学问题,但开始在蕴涵分类任务中产生随机标签。最后,我们证明了即使是概括的遗忘也是浅的:在LMS表征上训练的线性探测器在遗忘后仍然可以可靠地执行任务。我们的结果突出了通过微调从模型中执行有针对性的技能移除的难度和不可预测性。

[NLP-45] Efficient and Scalable Estimation of Tool Representations in Vector Space
[NLP-45] 载体空间中工具表示的有效且可扩展的估计

链接: https://arxiv.org/abs/2409.02141
作者: Suhong Moon,Siddharth Jha,Lutfi Eren Erdogan,Sehoon Kim,Woosang Lim,Kurt Keutzer,Amir Gholami
关键词-EN: execute complex tasks, external information sources, Recent advancements, tool retrieval, complex tasks
关键词-ZH: 执行复杂任务、外部信息源、最新进展、工具检索、复杂任务
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \urlthis https URL
摘要:函数调用和工具使用方面的最新进展使大型语言模型(LLM)能够与外部信息源交互并执行复杂任务,从而显著增强了它们的能力。然而,当有大量工具可用时,LLMS有限的上下文窗口带来了挑战,这就需要有效的方法来管理提示长度和保持准确性。现有的方法,如微调LLM或利用它们的推理能力,要么需要频繁的再培训,要么会产生显著的延迟开销。更有效的解决方案包括训练较小的模型来检索与给定查询最相关的工具,尽管这需要高质量的特定于领域的数据。为了应对这些挑战,我们提出了一个新的框架,用于为工具检索应用程序生成合成数据,并提出了一种使用小型编码器模型的高效数据驱动工具检索策略。在LLMS的支持下,我们创建了一个新的工具检索数据集–工具库,它反映了真实的人类用户使用情况。对于工具检索方法,我们提出了新的方法:(1)基于使用情况驱动的工具嵌入生成方法,(2)基于迭代提高工具检索质量的阶段性检索方法,(3)MLC:基于多标签分类问题的框架工具检索。使用这些新方法,我们在工具台数据集上实现了高达27.28的Recall@K改进,在工具库上实现了30.5的Recall@K改进。此外,我们还给出了进一步的实验结果来严格验证我们的方法。我们的代码可在此HTTPS URL中找到

[NLP-46] Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data
[NLP-46] 大型语言模型与经典机器学习:使用多维表格数据预测COVID-19死亡率的性能

链接: https://arxiv.org/abs/2409.02136
作者: Mohammadreza Ghaffarzadeh-Esfahani,Mahdi Ghaffarzadeh-Esfahani,Arian Salahi-Niri,Hossein Toreyhi,Zahra Atf,Amirali Mohsenzadeh-Kermani,Mahshad Sarikhani,Zohreh Tajabadi,Fatemeh Shojaeian,Mohammad Hassan Bagheri,Aydin Feyzi,Mohammadamin Tarighatpayma,Narges Gazmeh,Fateme Heydari,Hossein Afshar,Amirreza Allahgholipour,Farid Alimardani,Ameneh Salehi,Naghmeh Asadimanesh,Mohammad Amin Khalafi,Hadis Shabanipour,Ali Moradi,Sajjad Hossein Zadeh,Omid Yazdani,Romina Esbati,Moozhan Maleki,Danial Samiei Nasr,Amirali Soheili,Hossein Majlesi,Saba Shahsavan,Alireza Soheilipour,Nooshin Goudarzi,Erfan Taherifard,Hamidreza Hatamabadi,Jamil S Samaan,Thomas Savage,Ankit Sakhuja,Ali Soroush,Girish Nadkarni,Ilad Alavi Darazam,Mohamad Amin Pourhoseingholi,Seyed Amir Ahmad Safavi-Naini
关键词-EN: large language models, CML models, study aimed, aimed to evaluate, evaluate and compare
关键词-ZH: 大型语言模型、CML模型、旨在评估、评估和比较的研究
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at: this https URL and this https URL . The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com)

点击查看摘要

Abstract:Background: This study aimed to evaluate and compare the performance of classical machine learning models (CMLs) and large language models (LLMs) in predicting mortality associated with COVID-19 by utilizing a high-dimensional tabular dataset. Materials and Methods: We analyzed data from 9,134 COVID-19 patients collected across four hospitals. Seven CML models, including XGBoost and random forest (RF), were trained and evaluated. The structured data was converted into text for zero-shot classification by eight LLMs, including GPT-4 and Mistral-7b. Additionally, Mistral-7b was fine-tuned using the QLoRA approach to enhance its predictive capabilities. Results: Among the CML models, XGBoost and RF achieved the highest accuracy, with F1 scores of 0.87 for internal validation and 0.83 for external validation. In the LLM category, GPT-4 was the top performer with an F1 score of 0.43. Fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, resulting in an F1 score of 0.74, which was stable during external validation. Conclusion: While LLMs show moderate performance in zero-shot classification, fine-tuning can significantly enhance their effectiveness, potentially aligning them closer to CML models. However, CMLs still outperform LLMs in high-dimensional tabular data tasks. Comments: Code is available at: this https URL and this https URL. The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 92C50, 68T50 ACMclasses: J.3 Cite as: arXiv:2409.02136 [cs.LG] (or arXiv:2409.02136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.02136 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Seyed Amir Ahmad Safavi-Naini [view email] [v1] Mon, 2 Sep 2024 14:51:12 UTC (2,557 KB)
摘要:背景:本研究旨在利用高维表格数据集评估和比较经典机器学习模型(CMLS)和大语言模型(LLMS)在预测新冠肺炎相关死亡率方面的性能。材料和方法:我们分析了从四家医院收集的9,134名新冠肺炎患者的数据。对包括XGBoost和随机森林(RF)在内的7个CML模型进行了训练和评估。包括GPT-4和Mistral-7b在内的8个LLMS将结构化数据转换为文本进行零射击分类。此外,米斯特拉尔-7b还使用QLoRA方法进行了微调,以增强其预测能力。结果:在CML模型中,XGBoost和RF的准确率最高,内部验证的F1分数为0.87,外部验证的F1分数为0.83。在LLM类别中,GPT-4是表现最好的,F1得分为0.43。微调的Mistral-7b显著提高了召回率,从1%提高到79%,F1得分为0.74,在外部验证期间保持稳定。结论:虽然LLMS在零镜头分类中表现出中等的性能,但微调可以显著提高它们的有效性,潜在地使它们更接近CML模型。然而,在高维表格数据任务中,CMLS仍然优于LLMS。备注:代码位于:This HTTPS URL和This HTTPS URL。数据集可根据以下合理要求从相应作者处获得(sDamirsa@ymail.com)主题:机器学习(cs.LG);人工智能(cs.AI);计算与语言(cs.CL)msc类:92C50、68T50 ACM类:J.3引用AS:arxiv:2409.02136cs.lghttps://doi.org/10.48550/arXiv.2409.02136 Focus要通过DataCite了解更多arxiv发布的DOI来自:Seed Amir Ahmad Safavi-Naini[查看电子邮件][v1]Mon,2 Sep 2024 14:51:12 UTC(2,557 KB)

[NLP-47] Deep Knowledge-Infusion For Explainable Depression Detection
[NLP-47] 深入知识注入可解释抑郁症检测

链接: https://arxiv.org/abs/2409.02122
作者: Sumit Dalal,Sarika Jain,Mayank Dave
关键词-EN: Discovering individuals depression, Discovering individuals, increasingly important, social media, depression
关键词-ZH: 发现个人抑郁症,发现个人,越来越重要,社交媒体,抑郁症
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Discovering individuals depression on social media has become increasingly important. Researchers employed ML/DL or lexicon-based methods for automated depression detection. Lexicon based methods, explainable and easy to implement, match words from user posts in a depression dictionary without considering contexts. While the DL models can leverage contextual information, their black-box nature limits their adoption in the domain. Though surrogate models like LIME and SHAP can produce explanations for DL models, the explanations are suitable for the developer and of limited use to the end user. We propose a Knolwedge-infused Neural Network (KiNN) incorporating domain-specific knowledge from DepressionFeature ontology (DFO) in a neural network to endow the model with user-level explainability regarding concepts and processes the clinician understands. Further, commonsense knowledge from the Commonsense Transformer (COMET) trained on ATOMIC is also infused to consider the generic emotional aspects of user posts in depression detection. The model is evaluated on three expertly curated datasets related to depression. We observed the model to have a statistically significant (p0.1) boost in performance over the best domain-specific model, MentalBERT, across CLEF e-Risk (25% MCC increase, 12% F1 increase). A similar trend is observed across the PRIMATE dataset, where the proposed model performed better than MentalBERT (2.5% MCC increase, 19% F1 increase). The observations confirm the generated explanations to be informative for MHPs compared to post hoc model explanations. Results demonstrated that the user-level explainability of KiNN also surpasses the performance of baseline models and can provide explanations where other baselines fall short. Infusing the domain and commonsense knowledge in KiNN enhances the ability of models like GPT-3.5 to generate application-relevant explanations.
摘要:在社交媒体上发现个体抑郁变得越来越重要。研究人员使用了基于ML/DL或词典的方法来自动检测抑郁症。基于词典的方法,易于解释和易于实现,在不考虑上下文的情况下,匹配抑郁症词典中用户帖子中的单词。虽然数字图书馆模型可以利用上下文信息,但它们的黑箱性质限制了它们在领域中的采用。虽然像LIME和Shap这样的代理模型可以为DL模型生成解释,但这些解释适合开发人员,对最终用户的使用有限。我们提出了一种知识灌输神经网络(KINN),将抑郁特征本体(DFO)中的领域特定知识整合到神经网络中,以赋予该模型关于临床医生理解的概念和过程的用户级可解释性。此外,来自常识转换器(COMET)的常识知识也被注入到原子上,以考虑用户帖子在抑郁检测中的一般情感方面。该模型在三个与抑郁症相关的专业精选数据集上进行了评估。我们观察到,与最好的领域特定模型MentalBERT相比,该模型在统计上有显著的性能提升(p0.1),跨越了CLEF e-Risk(MCC增加25%,F1增加12%)。在灵长类数据集中也观察到了类似的趋势,所提出的模型比MentalBERT更好(MCC增加2.5%,F1增加19%)。观察结果证实,与后模型解释相比,所生成的解释对MHPS具有信息性。结果表明,KINN的用户级可解释性也超过了基线模型的性能,并可以提供其他基线不足的解释。在Kinn中注入领域和常识知识增强了GPT-3.5等模型生成与应用程序相关的解释的能力。

[NLP-48] CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models
[NLP-48] CoRA:利用大型语言模型的公共子空间优化低等级适应

链接: https://arxiv.org/abs/2409.02119
作者: Xiaojun Xiao,Sen Shen,Qiming Bao,Hongfei Rong,Kairui Liu,Zhongsheng Wang,Jiamou Liu
关键词-EN: large language models, fine-tuning large language, conserving computational resources, fine-tuning large models, constraints is crucial
关键词-ZH: 大型语言模型、微调大型语言、节省计算资源、微调大型模型、约束至关重要
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In fine-tuning large language models (LLMs), conserving computational resources while maintaining effectiveness and improving outcomes within the same computational constraints is crucial. The Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models by reducing the number of trainable parameters and computational costs. However, current advancements in LoRA might be focused on its fine-tuning methodologies, with not as much exploration as might be expected into further compression of LoRA. Since most of LoRA’s parameters might still be superfluous, this may lead to unnecessary wastage of computational resources. In this paper, we propose \textbfCoRA: leveraging shared knowledge to optimize LoRA training by substituting its matrix B with a common subspace from large models. Our two-fold method includes (1) Freezing the substitute matrix B to halve parameters while training matrix A for specific tasks and (2) Using the substitute matrix B as an enhanced initial state for the original matrix B , achieving improved results with the same parameters. Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters. At the same time, the second approach has some improvements compared to LoRA’s original fine-tuning performance. They generally attest to the effectiveness of our work.
摘要:在对大型语言模型进行微调时,在相同的计算约束下节约计算资源、保持有效性和改善结果是至关重要的。低阶自适应(LORA)策略通过减少可训练参数的数量和计算成本来平衡微调大型模型的效率和性能。然而,LORA目前的进展可能集中在其微调方法上,对LORA的进一步压缩没有预期的那么多探索。由于LORA的大多数参数可能仍然是多余的,这可能会导致计算资源的不必要浪费。在本文中,我们提出了一种利用共享知识来优化LORA训练的方法–利用共享知识将LORA的矩阵B替换为来自大型模型的公共子空间。我们的两重方法包括:(1)冻结替换矩阵B以将参数减半,同时针对特定任务训练矩阵A;(2)使用替换矩阵B作为原始矩阵B的增强初始状态,在相同参数的情况下获得更好的结果。我们的实验表明,第一种方法达到了与原始LORA微调相同的效果,但比参数减半的效率更高。同时,第二种方法与LoRa最初的微调性能相比有一些改进。他们大体上证明了我们工作的有效性。

[NLP-49] SO: Self-Training with Scaled Preference Optimization
[NLP-49] SO:具有规模偏好优化的自我训练

链接: https://arxiv.org/abs/2409.02118
作者: Kaihui Chen,Hao Yi,Qingyang Li,Tianyu Qi,Yulan Hu,Fuzheng Zhang,Yong Liu
关键词-EN: Enhancing the conformity, ongoing research challenge, Preference, large language models, Direct Preference Optimization
关键词-ZH: 增强一致性、持续的研究挑战、偏好、大型语言模型、直接偏好优化
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the conformity of large language models (LLMs) to human preferences remains an ongoing research challenge. Recently, offline approaches such as Direct Preference Optimization (DPO) have gained prominence as attractive options due to offering effective improvement in simple, efficient, and stable without interactions with reward models. However, these offline preference optimization methods highly rely on the quality of pairwise preference samples. Meanwhile, numerous iterative methods require additional training of reward models to select positive and negative samples from the model’s own generated responses for preference learning. Furthermore, as LLMs’ capabilities advance, it is quite challenging to continuously construct high-quality positive and negative preference instances from the model’s outputs due to the lack of diversity. To tackle these challenges, we propose TSO, or Self-Training with Scaled Preference Optimization, a framework for preference optimization that conducts self-training preference learning without training an additional reward model. TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses. Furthermore, TSO introduces corrections for model preference errors through human and AI feedback. Finally, TSO adopts iterative and dual clip reward strategies to update the reference model and its responses, adaptively adjusting preference data and balancing the optimization process. Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks, providing practical insight into preference data construction and model training strategies in the alignment domain.
摘要:提高大语言模型(LLM)与人类偏好的一致性仍然是一个持续的研究挑战。最近,诸如直接偏好优化(DPO)这样的线下方法由于在不与奖励模型交互的情况下在简单、高效和稳定方面提供了有效的改进而成为有吸引力的选择。然而,这些离线偏好优化方法高度依赖于成对偏好样本的质量。同时,许多迭代方法需要对奖励模型进行额外的训练,以便从模型自己生成的响应中选择正样本和负样本进行偏好学习。此外,随着LLMS能力的提高,由于缺乏多样性,从模型的输出连续构造高质量的正向和负向偏好实例是相当具有挑战性的。为了应对这些挑战,我们提出了TSO,即带有缩放偏好优化的自我训练,这是一个偏好优化框架,它进行自我训练偏好学习,而不需要训练额外的奖励模型。TSO通过构建模型矩阵并结合人类偏好反应来增强反应的多样性。此外,TSO通过人工反馈和人工智能反馈引入了对模型偏好错误的修正。最后,TSO采用迭代和双片段奖励策略来更新参考模型及其响应,自适应地调整偏好数据,平衡优化过程。实验结果表明,TSO在各种比对评价基准上都优于现有的主流方法,为比对领域中偏好数据的构建和模型训练策略提供了实用的见解。

[NLP-50] ny-Toxic-Detector: A compact transformer-based model for toxic content detection
[NLP-50] ny-Toxic-Detector:一个基于变压器的紧凑型有毒物质检测模型

链接: https://arxiv.org/abs/2409.02114
作者: Michiel Kamphuis
关键词-EN: toxic content detection, compact transformer-based model, transformer-based model designed, compact transformer-based, designed for toxic
关键词-ZH: 有毒内容检测,基于紧凑型变压器的模型,设计的基于变压器的模型,基于紧凑型变压器的,专为有毒物质设计
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:This paper presents Tiny-toxic-detector, a compact transformer-based model designed for toxic content detection. Despite having only 2.1 million parameters, Tiny-toxic-detector achieves competitive performance on benchmark datasets, with 90.97% accuracy on ToxiGen and 86.98% accuracy on the Jigsaw dataset, rivaling models over 50 times its size. This efficiency enables deployment in resource-constrained environments, addressing the need for effective content moderation tools that balance performance with computational efficiency. The model architecture features 4 transformer encoder layers, each with 2 attention heads, an embedding dimension of 64, and a feedforward dimension of 128. Trained on both public and private datasets, Tiny-toxic-detector demonstrates the potential of efficient, task-specific models for addressing online toxicity. The paper covers the model architecture, training process, performance benchmarks, and limitations, underscoring its suitability for applications such as social media monitoring and content moderation. By achieving results comparable to much larger models while significantly reducing computational demands, Tiny-toxic-detector represents progress toward more sustainable and scalable AI-driven content moderation solutions.
摘要:本文介绍了一种用于有毒物质检测的基于变压器的紧凑型微型有毒物质检测仪。尽管只有210万个参数,微毒素检测器在基准数据集上的性能具有竞争力,在ToxiGen上的准确率为90.97%,在Jigsaw数据集上的准确率为86.98%,可与其大小50倍以上的模型相媲美。这种效率可以在资源受限的环境中进行部署,从而满足对能够平衡性能和计算效率的有效内容审核工具的需求。该模型体系结构具有4个变压器编码层,每个层有2个关注头,嵌入维度为,前馈维度为128。在公共和私人数据集上进行培训的微型有毒探测器展示了解决在线毒性问题的有效、特定任务模型的潜力。白皮书涵盖了模型架构、培训流程、性能基准和限制,强调了其适用于社交媒体监控和内容审核等应用程序。通过实现与更大的模型相当的结果,同时显著减少计算需求,微小有毒检测器代表着朝着更可持续和可扩展的AI驱动的内容审核解决方案的进步。

[NLP-51] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI
[NLP-51] GenAgent:通过自动化工作流生成构建协作人工智能系统–ComfyUI案例研究

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
关键词-ZH: 开发整体模型,之前的人工智能研究,研究已经专注于,专注于开发,最大化他们的智能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.
摘要:以前的许多人工智能研究都集中在开发单一模型,以最大限度地提高它们的智能和能力,主要目标是提高特定任务的性能。相反,本文探索了另一种方法:协作式人工智能系统,它使用工作流来集成模型、数据源和管道,以解决复杂和多样化的任务。我们引入了GenAgent,这是一个基于LLM的框架,可以自动生成复杂的工作流,与单一模型相比,提供了更大的灵活性和可伸缩性。GenAgent的核心创新在于用代码表示工作流,同时用协作代理循序渐进地构建工作流。我们在ComfyUI平台上实现了GenAgent,并提出了一个新的基准测试程序OpenComfy。结果表明,GenAgent在运行级和任务级的评估中都优于基准方法,显示了其生成具有卓越有效性和稳定性的复杂工作流的能力。

[NLP-52] Conversational Complexity for Assessing Risk in Large Language Models
[NLP-52] 评估大型语言模型中风险的对话复杂性

链接: https://arxiv.org/abs/2409.01247
作者: John Burden,Manuel Cebrian,Jose Hernandez-Orallo
关键词-EN: Large Language Models, Language Models, enable beneficial applications, Large Language, present a dual-use
关键词-ZH: 大型语言模型,语言模型,实现有益的应用程序,大型语言,提供双重用途
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case was Kevin Roose’s notable conversation with Bing, which elicited harmful outputs after extended interaction. This contrasts with simpler early jailbreaks that produced similar content more easily, raising the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures: Conversational Length (CL), which quantifies the conversation length used to obtain a specific response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user’s instruction sequence leading to the response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimisation of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.
摘要:大型语言模型(LLM)面临着两难境地:它们既能带来有益的应用,又有潜在的危害,特别是通过会话交互。尽管有各种保障措施,但先进的低收入国家仍然脆弱不堪。一个分水岭案例是Kevin Roose与Bing的引人注目的对话,在长时间的互动后引发了有害的结果。与更简单的早期越狱相比,早期越狱更容易产生类似的内容,这引发了一个问题:需要多少对话努力才能从LLMS获得有害信息?我们提出了两个度量:会话长度(CL)和会话复杂性(CC),会话长度(CL)用于量化用于获得特定响应的会话长度,会话复杂性(CC)定义为导致响应的用户指令序列的Kolmogorov复杂性。为了解决Kolmogorov复杂度的不可计算性,我们使用一个参考LLM来近似CC以估计用户指令的可压缩性。将这种方法应用于一个大型的红队数据集,我们执行了一个定量的分析,检查了有害和无害的对话长度和复杂性的统计分布。我们的经验发现表明,这种分布分析和CC最小化是理解人工智能安全的有价值的工具,为有害信息的可访问性提供了见解。这项工作为以伤害路径的算法复杂性为中心的LLM安全的新视角奠定了基础。

人工智能

[AI-0] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)

链接: https://arxiv.org/abs/2409.02920
作者: Yao Mu,Tianxing Chen,Shijia Peng,Zanxin Chen,Zeyu Gao,Yude Zou,Lunkai Lin,Zhiqiang Xie,Ping Luo
关键词-EN: increasingly important areas, Effective collaboration, capabilities are increasingly, increasingly important, important areas
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project page: this https URL

点击查看摘要

Abstract:Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots’ ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real-world teleoperated data with synthetic data from digital twins, designed for dual-arm robotic scenarios. Using the COBOT Magic platform, we have collected diverse data on tool usage and human-robot interaction. We present a innovative approach to creating digital twins using AI-generated content, transforming 2D images into detailed 3D models. Furthermore, we utilize large language models to generate expert-level training data and task-specific pose sequences oriented toward functionality. Our key contributions are: 1) the RoboTwin benchmark dataset, 2) an efficient real-to-simulation pipeline, and 3) the use of language models for automatic expert-level data generation. These advancements are designed to address the shortage of robotic training data, potentially accelerating the development of more capable and versatile robotic systems for a wide range of real-world applications. The project page is available at this https URL

[AI-1] UC-NeRF: Uncertainty-aware Conditional Neural Radiance Fields from Endoscopic Sparse Views

链接: https://arxiv.org/abs/2409.02917
作者: Jiaxin Guo,Jiangliu Wang,Ruofeng Wei,Di Kang,Qi Dou,Yun-hui Liu
关键词-EN: Visualizing surgical scenes, minimally invasive procedures, revealing internal anatomical, internal anatomical structures, Visualizing surgical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at \urlthis https URL.

[AI-2] Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling

链接: https://arxiv.org/abs/2409.02908
作者: Kaiwen Zheng,Yongxin Chen,Hanzi Mao,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
关键词-EN: language modeling tasks, popular research topic, discrete diffusion models, Masked diffusion models, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 40 pages

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs’ original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20 \times speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs’ generation results in the previous literature.

[AI-3] LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

链接: https://arxiv.org/abs/2409.02889
作者: Xidong Wang,Dingjie Song,Shunian Chen,Chen Zhang,Benyou Wang
关键词-EN: Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large, high-resolution image understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 19 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textitdegraded performance with more images and \textithigh computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbfLongLLaVA~(\textbfLong-Context \textbfLarge \textbfLanguage \textbfand \textbfVision \textbfAssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

[AI-4] Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test

链接: https://arxiv.org/abs/2409.02883
作者: Junyoung Park,Eun Hyun Seo,Sunjun Kim,SangHak Yi,Kun Ho Lee,Sungho Won
关键词-EN: Rey Complex Figure, Complex Figure Test, Rey Complex, Complex Figure, Drawing tests
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Drawing tests like the Rey Complex Figure Test (RCFT) are widely used to assess cognitive functions such as visuospatial skills and memory, making them valuable tools for detecting mild cognitive impairment (MCI). Despite their utility, existing predictive models based on these tests often suffer from limitations like small sample sizes and lack of external validation, which undermine their reliability. We developed a multi-stream deep learning framework that integrates two distinct processing streams: a multi-head self-attention based spatial stream using raw RCFT images and a scoring stream employing a previously developed automated scoring system. Our model was trained on data from 1,740 subjects in the Korean cohort and validated on an external hospital dataset of 222 subjects from Korea. The proposed multi-stream model demonstrated superior performance over baseline models (AUC = 0.872, Accuracy = 0.781) in external validation. The integration of both spatial and scoring streams enables the model to capture intricate visual details from the raw images while also incorporating structured scoring data, which together enhance its ability to detect subtle cognitive impairments. This dual approach not only improves predictive accuracy but also increases the robustness of the model, making it more reliable in diverse clinical settings. Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early MCI screening.

[AI-5] Configurable Foundation Models: Building LLMs from a Modular Perspective

链接: https://arxiv.org/abs/2409.02877
作者: Chaojun Xiao,Zhengyan Zhang,Chenyang Song,Dazhi Jiang,Feng Yao,Xu Han,Xiaozhi Wang,Shuo Wang,Yufei Huang,Guanyu Lin,Yingfa Chen,Weilin Zhao,Yuge Tu,Zexuan Zhong,Ao Zhang,Chenglei Si,Khai Hao Moo,Chenyang Zhao,Huimin Chen,Yankai Lin,Zhiyuan Liu,Jingbo Shang,Maosong Sun
关键词-EN: abilities increasingly cumbersome, recently unveiled challenges, unveiled challenges tied, continual scalability due, limited computation resources
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.

[AI-6] Hybrid Imitation-Learning Motion Planner for Urban Driving

链接: https://arxiv.org/abs/2409.02871
作者: Cristian Gariboldi,Matteo Corno,Beng Jin
关键词-EN: open source datasets, nuPlan and Argoverse, release of open, open source, source datasets
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the release of open source datasets such as nuPlan and Argoverse, the research around learning-based planners has spread a lot in the last years. Existing systems have shown excellent capabilities in imitating the human driver behaviour, but they struggle to guarantee safe closed-loop driving. Conversely, optimization-based planners offer greater security in short-term planning scenarios. To confront this challenge, in this paper we propose a novel hybrid motion planner that integrates both learning-based and optimization-based techniques. Initially, a multilayer perceptron (MLP) generates a human-like trajectory, which is then refined by an optimization-based component. This component not only minimizes tracking errors but also computes a trajectory that is both kinematically feasible and collision-free with obstacles and road boundaries. Our model effectively balances safety and human-likeness, mitigating the trade-off inherent in these objectives. We validate our approach through simulation experiments and further demonstrate its efficacy by deploying it in real-world self-driving vehicles.

[AI-7] Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant

链接: https://arxiv.org/abs/2409.02864
作者: Joshua Pickard,Marc Andrew Choi,Natalie Oliven,Cooper Stansbury,Jillian Cwycyshyn,Nicholas Galioto,Alex Gorodetsky,Alvaro Velasquez,Indika Rajapakse
关键词-EN: Retrieval Augmentation Data, Augmentation Data, Bioinformatics Retrieval Augmentation, Retrieval Augmentation, BRAD
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We present a prototype for a Bioinformatics Retrieval Augmentation Data (BRAD) digital assistant. BRAD integrates a suite of tools to handle a wide range of bioinformatics tasks, from code execution to online search. We demonstrate BRAD’s capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) BRAD’s ability to run and write complex software pipelines, and (3) BRAD’s ability to organize and distribute tasks across individual and teams of agents. We use BRAD for automation of bioinformatics workflows, performing tasks ranging from gene enrichment and searching the archive to automatic code generation and running biomarker identification pipelines. BRAD is a step toward the ultimate goal to develop a digital twin of laboratories driven by self-contained loops for hypothesis generation and testing of digital biology experiments.

[AI-8] Oops I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

链接: https://arxiv.org/abs/2409.02850
作者: Raphael Lafargue,Luke Smith,Franck Vermet,Mathias Löwe,Ian Reid,Vincent Gripon,Jack Valmadre
关键词-EN: few-shot learning, computing confidence intervals, predominant method, computing confidence, multiple tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at this https URL

[AI-9] R2GQA: Retriever-Reader-Generator Question Answering System to Support Students Understanding Legal Regulations in Higher Education

链接: https://arxiv.org/abs/2409.02840
作者: Phuc-Tinh Pham Do,Duy-Ngoc Dinh Cao,Khanh Quoc Tran,Kiet Van Nguyen
关键词-EN: Question Answering system, Machine Reader module, Machine Reader, Question Answering, Retriever module employs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this article, we propose the R2GQA system, a Retriever-Reader-Generator Question Answering system, consisting of three main components: Document Retriever, Machine Reader, and Answer Generator. The Retriever module employs advanced information retrieval techniques to extract the context of articles from a dataset of legal regulation documents. The Machine Reader module utilizes state-of-the-art natural language understanding algorithms to comprehend the retrieved documents and extract answers. Finally, the Generator module synthesizes the extracted answers into concise and informative responses to questions of students regarding legal regulations. Furthermore, we built the ViRHE4QA dataset in the domain of university training regulations, comprising 9,758 question-answer pairs with a rigorous construction process. This is the first Vietnamese dataset in the higher regulations domain with various types of answers, both extractive and abstractive. In addition, the R2GQA system is the first system to offer abstractive answers in Vietnamese. This paper discusses the design and implementation of each module within the R2GQA system on the ViRHE4QA dataset, highlighting their functionalities and interactions. Furthermore, we present experimental results demonstrating the effectiveness and utility of the proposed system in supporting the comprehension of students of legal regulations in higher education settings. In general, the R2GQA system and the ViRHE4QA dataset promise to contribute significantly to related research and help students navigate complex legal documents and regulations, empowering them to make informed decisions and adhere to institutional policies effectively. Our dataset is available for research purposes.

[AI-10] Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models

链接: https://arxiv.org/abs/2409.02836
作者: Moein Shahiki Tash,Zahra Ahani,Mohim Tash,Olga Kolesnikova,Grigori Sidorov
关键词-EN: leveraging advanced natural, language processing techniques, Regret Detection behaviors, advanced natural language, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named “Prediction statements,” categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.

[AI-11] A hybrid FEM-PINN method for time-dependent partial differential equations

链接: https://arxiv.org/abs/2409.02810
作者: Xiaodong Feng,Haojiong Shangguan,Tao Tang,Xiaoliang Wan,Tao Zhou
关键词-EN: partial differential equations, solving evolution partial, evolution partial differential, deep neural networks, time finite element
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注: 25pages

点击查看摘要

Abstract:In this work, we present a hybrid numerical method for solving evolution partial differential equations (PDEs) by merging the time finite element method with deep neural networks. In contrast to the conventional deep learning-based formulation where the neural network is defined on a spatiotemporal domain, our methodology utilizes finite element basis functions in the time direction where the space-dependent coefficients are defined as the output of a neural network. We then apply the Galerkin or collocation projection in the time direction to obtain a system of PDEs for the space-dependent coefficients which is approximated in the framework of PINN. The advantages of such a hybrid formulation are twofold: statistical errors are avoided for the integral in the time direction, and the neural network’s output can be regarded as a set of reduced spatial basis functions. To further alleviate the difficulties from high dimensionality and low regularity, we have developed an adaptive sampling strategy that refines the training set. More specifically, we use an explicit density model to approximate the distribution induced by the PDE residual and then augment the training set with new time-dependent random samples given by the learned density model. The effectiveness and efficiency of our proposed method have been demonstrated through a series of numerical experiments.

[AI-12] owards Edge-Based Data Lake Architecture for Intelligent Transportation System

链接: https://arxiv.org/abs/2409.02808
作者: Danilo Fernandes,Douglas L. L. Moura,Gean Santos,Geymerson S. Ramos,Fabiane Queiroz,Andre L. L. Aquino
关键词-EN: rapid urbanization growth, enhance transportation efficiency, efficiency and safety, Intelligent Transportation Systems, rapid urbanization
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rapid urbanization growth has underscored the need for innovative solutions to enhance transportation efficiency and safety. Intelligent Transportation Systems (ITS) have emerged as a promising solution in this context. However, analyzing and processing the massive and intricate data generated by ITS presents significant challenges for traditional data processing systems. This work proposes an Edge-based Data Lake Architecture to integrate and analyze the complex data from ITS efficiently. The architecture offers scalability, fault tolerance, and performance, improving decision-making and enhancing innovative services for a more intelligent transportation ecosystem. We demonstrate the effectiveness of the architecture through an analysis of three different use cases: (i) Vehicular Sensor Network, (ii) Mobile Network, and (iii) Driver Identification applications.

[AI-13] Governing dual-use technologies: Case studies of international security agreements and lessons for AI governance

链接: https://arxiv.org/abs/2409.02779
作者: Akash R. Wasil,Peter Barnett,Michael Gerovitch,Roman Hauksson,Tom Reed,Jack William Miller
关键词-EN: reducing global security, global security risks, play an important, important role, role in reducing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:International AI governance agreements and institutions may play an important role in reducing global security risks from advanced AI. To inform the design of such agreements and institutions, we conducted case studies of historical and contemporary international security agreements. We focused specifically on those arrangements around dual-use technologies, examining agreements in nuclear security, chemical weapons, biosecurity, and export controls. For each agreement, we examined four key areas: (a) purpose, (b) core powers, © governance structure, and (d) instances of non-compliance. From these case studies, we extracted lessons for the design of international AI agreements and governance institutions. We discuss the importance of robust verification methods, strategies for balancing power between nations, mechanisms for adapting to rapid technological change, approaches to managing trade-offs between transparency and security, incentives for participation, and effective enforcement mechanisms.

[AI-14] An incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting

链接: https://arxiv.org/abs/2409.02760
作者: Zhuolin Li,Zhen Zhang,Witold Pedrycz
关键词-EN: potentially non-monotonic preferences, enabling decision makers, progressively provide assignment, max-margin optimization-based model, incremental preference elicitation-based
类目: Artificial Intelligence (cs.AI)
*备注: 37 pages, 22 figures

点击查看摘要

Abstract:This paper introduces a novel incremental preference elicitation-based approach to learning potentially non-monotonic preferences in multi-criteria sorting (MCS) problems, enabling decision makers to progressively provide assignment example preference information. Specifically, we first construct a max-margin optimization-based model to model potentially non-monotonic preferences and inconsistent assignment example preference information in each iteration of the incremental preference elicitation process. Using the optimal objective function value of the max-margin optimization-based model, we devise information amount measurement methods and question selection strategies to pinpoint the most informative alternative in each iteration within the framework of uncertainty sampling in active learning. Once the termination criterion is satisfied, the sorting result for non-reference alternatives can be determined through the use of two optimization models, i.e., the max-margin optimization-based model and the complexity controlling optimization model. Subsequently, two incremental preference elicitation-based algorithms are developed to learn potentially non-monotonic preferences, considering different termination criteria. Ultimately, we apply the proposed approach to a credit rating problem to elucidate the detailed implementation steps, and perform computational experiments on both artificial and real-world data sets to compare the proposed question selection strategies with several benchmark strategies.

[AI-15] ractable Offline Learning of Regular Decision Processes

链接: https://arxiv.org/abs/2409.02747
作者: Ahana Deb,Roberto Cipollone,Anders Jonsson,Alessandro Ronca,Mohammad Sadegh Talebi
关键词-EN: Regular Decision Processes, called Regular Decision, Decision Processes, studies offline Reinforcement, Regular Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: To appear in EWRL 2024

点击查看摘要

Abstract:This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on L_\infty^\mathsfp -distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.

[AI-16] GET-UP: GEomeTric-aware Depth Estimation with Radar Points UPsampling WACV2025

链接: https://arxiv.org/abs/2409.02720
作者: Huawei Sun,Zixu Wang,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
关键词-EN: Depth estimation plays, Depth estimation, radar-camera depth estimation, autonomous driving, facilitating a comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Depth estimation plays a pivotal role in autonomous driving, facilitating a comprehensive understanding of the vehicle’s 3D surroundings. Radar, with its robustness to adverse weather conditions and capability to measure distances, has drawn significant interest for radar-camera depth estimation. However, existing algorithms process the inherently noisy and sparse radar data by projecting 3D points onto the image plane for pixel-level feature extraction, overlooking the valuable geometric information contained within the radar point cloud. To address this gap, we propose GET-UP, leveraging attention-enhanced Graph Neural Networks (GNN) to exchange and aggregate both 2D and 3D information from radar data. This approach effectively enriches the feature representation by incorporating spatial relationships compared to traditional methods that rely only on 2D feature extraction. Furthermore, we incorporate a point cloud upsampling task to densify the radar point cloud, rectify point positions, and derive additional 3D features under the guidance of lidar data. Finally, we fuse radar and camera features during the decoding phase for depth estimation. We benchmark our proposed GET-UP on the nuScenes dataset, achieving state-of-the-art performance with a 15.3% and 14.7% improvement in MAE and RMSE over the previously best-performing model.

[AI-17] Creating a Gen-AI based Track and Trace Assistant MVP (SuperTracy) for PostNL

链接: https://arxiv.org/abs/2409.02711
作者: Mohammad Reshadati
关键词-EN: Minimal Viable Product, brought a lot, instance to improve, customer service, service and automating
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The developments in the field of generative AI has brought a lot of opportunities for companies, for instance to improve efficiency in customer service and automating tasks. PostNL, the biggest parcel and E-commerce corporation of the Netherlands wants to use generative AI to enhance the communication around track and trace of parcels. During the internship a Minimal Viable Product (MVP) is created to showcase the value of using generative AI technologies, to enhance parcel tracking, analyzing the parcel’s journey and being able to communicate about it in an easy to understand manner. The primary goal was to develop an in-house LLM-based system, reducing dependency on external platforms and establishing the feasibility of a dedicated generative AI team within the company. This multi-agent LLM based system aimed to construct parcel journey stories and identify logistical disruptions with heightened efficiency and accuracy. The research involved deploying a sophisticated AI-driven communication system, employing Retrieval-Augmented Generation (RAG) for enhanced response precision, and optimizing large language models (LLMs) tailored to domain specific tasks. The MVP successfully implemented a multi-agent open-source LLM system, called SuperTracy. SuperTracy is capable of autonomously managing a broad spectrum of user inquiries and improving internal knowledge handling. Results and evaluation demonstrated technological innovation and feasibility, notably in communication about the track and trace of a parcel, which exceeded initial expectations. These advancements highlight the potential of AI-driven solutions in logistics, suggesting many opportunities for further refinement and broader implementation within PostNL operational framework. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2409.02711 [cs.AI] (or arXiv:2409.02711v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.02711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-18] Incorporating Like-Minded Peers to Overcome Friend Data Sparsity in Session-Based Social Recommendations

链接: https://arxiv.org/abs/2409.02702
作者: Chunyan An,Yunhan Li,Qiang Yang,Winston K.G. Seah,Zhixu Li,Conghao Yanga
关键词-EN: Session-based Social Recommendation, Session-based Recommendation, leverages social relationships, Session-based Social, target user
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: None

点击查看摘要

Abstract:Session-based Social Recommendation (SSR) leverages social relationships within online networks to enhance the performance of Session-based Recommendation (SR). However, existing SSR algorithms often encounter the challenge of friend data sparsity''. Moreover, significant discrepancies can exist between the purchase preferences of social network friends and those of the target user, reducing the influence of friends relative to the target user's own preferences. To address these challenges, this paper introduces the concept of Like-minded Peers’’ (LMP), representing users whose preferences align with the target user’s current session based on their historical sessions. This is the first work, to our knowledge, that uses LMP to enhance the modeling of social influence in SSR. This approach not only alleviates the problem of friend data sparsity but also effectively incorporates users with similar preferences to the target user. We propose a novel model named Transformer Encoder with Graph Attention Aggregator Recommendation (TEGAARec), which includes the TEGAA module and the GAT-based social aggregation module. The TEGAA module captures and merges both long-term and short-term interests for target users and LMP users. Concurrently, the GAT-based social aggregation module is designed to aggregate the target users’ dynamic interests and social influence in a weighted manner. Extensive experiments on four real-world datasets demonstrate the efficacy and superiority of our proposed model and ablation studies are done to illustrate the contributions of each component in TEGAARec.

[AI-19] Decision Transformer for Enhancing Neural Local Search on the Job Shop Scheduling Problem

链接: https://arxiv.org/abs/2409.02697
作者: Constantin Waubert de Puiseau,Fabian Wolz,Merlin Montag,Jannik Peters,Hasan Tercan,Tobias Meisen
关键词-EN: shop scheduling problem, job shop scheduling, scheduling problem, industry for decades, job shop
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: currently under review for IEEE Transactions on Cybernetics

点击查看摘要

Abstract:The job shop scheduling problem (JSSP) and its solution algorithms have been of enduring interest in both academia and industry for decades. In recent years, machine learning (ML) is playing an increasingly important role in advancing existing and building new heuristic solutions for the JSSP, aiming to find better solutions in shorter computation times. In this paper we build on top of a state-of-the-art deep reinforcement learning (DRL) agent, called Neural Local Search (NLS), which can efficiently and effectively control a large local neighborhood search on the JSSP. In particular, we develop a method for training the decision transformer (DT) algorithm on search trajectories taken by a trained NLS agent to further improve upon the learned decision-making sequences. Our experiments show that the DT successfully learns local search strategies that are different and, in many cases, more effective than those of the NLS agent itself. In terms of the tradeoff between solution quality and acceptable computational time needed for the search, the DT is particularly superior in application scenarios where longer computational times are acceptable. In this case, it makes up for the longer inference times required per search step, which are caused by the larger neural network architecture, through better quality decisions per step. Thereby, the DT achieves state-of-the-art results for solving the JSSP with ML-enhanced search.

[AI-20] he Role of Artificial Intelligence and Machine Learning in Software Testing

链接: https://arxiv.org/abs/2409.02693
作者: Ahmed Ramadan,Husam Yasin,Burhan Pektas
关键词-EN: Artificial Intelligence, Machine Learning, including software development, Software testing, software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) and Machine Learning (ML) have significantly impacted various industries, including software development. Software testing, a crucial part of the software development lifecycle (SDLC), ensures the quality and reliability of software products. Traditionally, software testing has been a labor-intensive process requiring significant manual effort. However, the advent of AI and ML has transformed this landscape by introducing automation and intelligent decision-making capabilities. AI and ML technologies enhance the efficiency and effectiveness of software testing by automating complex tasks such as test case generation, test execution, and result analysis. These technologies reduce the time required for testing and improve the accuracy of defect detection, ultimately leading to higher quality software. AI can predict potential areas of failure by analyzing historical data and identifying patterns, which allows for more targeted and efficient testing. This paper explores the role of AI and ML in software testing by reviewing existing literature, analyzing current tools and techniques, and presenting case studies that demonstrate the practical benefits of these technologies. The literature review provides a comprehensive overview of the advancements in AI and ML applications in software testing, highlighting key methodologies and findings from various studies. The analysis of current tools showcases the capabilities of popular AI-driven testing tools such as Eggplant AI, this http URL, Selenium, Appvance, Applitools Eyes, Katalon Studio, and Tricentis Tosca, each offering unique features and advantages. Case studies included in this paper illustrate real-world applications of AI and ML in software testing, showing significant improvements in testing efficiency, accuracy, and overall software quality.

[AI-21] LLM-Assisted Visual Analytics: Opportunities and Challenges

链接: https://arxiv.org/abs/2409.02691
作者: Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Pranava Madhyastha
关键词-EN: intuitive natural language, natural language interactions, large language models, visual analytics, explore the integration
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted at EG UK Computer Graphics Visual Computing 2024

点击查看摘要

Abstract:We explore the integration of large language models (LLMs) into visual analytics (VA) systems to transform their capabilities through intuitive natural language interactions. We survey current research directions in this emerging field, examining how LLMs are integrated into data management, language interaction, visualisation generation, and language generation processes. We highlight the new possibilities that LLMs bring to VA, especially how they can change VA processes beyond the usual use cases. We especially highlight building new visualisation-language models, allowing access of a breadth of domain knowledge, multimodal interaction, and opportunities with guidance. Finally, we carefully consider the prominent challenges of using current LLMs in VA tasks. Our discussions in this paper aim to guide future researchers working on LLM-assisted VA systems and help them navigate common obstacles when developing these systems.

[AI-22] Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs

链接: https://arxiv.org/abs/2409.02686
作者: Ruoyu Wang,Xiaoxuan Li,Lina Yao
关键词-EN: Large Language Models, Large Language, recent studies reveal, Language Models, demonstrated remarkable efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model’s reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.

[AI-23] RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models

链接: https://arxiv.org/abs/2409.02685
作者: Hyunji Lee,Luca Soldaini,Arman Cohan,Minjoon Seo,Kyle Lo
关键词-EN: Information retrieval methods, methods often rely, MSMARCO, Information retrieval, multiple domain-specific
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.

[AI-24] Neural Networks with LSTM and GRU in Modeling Active Fires in the Amazon

链接: https://arxiv.org/abs/2409.02681
作者: Ramon Tavares
关键词-EN: Gated Recurrent Unit, fire spots detected, detected fire spots, historical time series, time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 16 pages, in Portuguese language, 24 figures

点击查看摘要

Abstract:This study presents a comprehensive methodology for modeling and forecasting the historical time series of fire spots detected by the AQUA_M-T satellite in the Amazon, Brazil. The approach utilizes a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict monthly accumulations of daily detected fire spots. A summary of the data revealed a consistent seasonality over time, with annual maximum and minimum fire spot values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through rigorous statistical analysis. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to the test and validation sets, and confirming the convergence of the model parameters. The results indicate that the mixed LSTM and GRU model offers improved accuracy in forecasting 12 months ahead, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in fire spot forecasting. In addition to improving forecast accuracy, the proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new avenues for research and development in machine learning and natural phenomenon prediction. Keywords: Time Series Forecasting, Recurrent Neural Networks, Deep Learning.

[AI-25] Independence Constrained Disentangled Representation Learning from Epistemological Perspective

链接: https://arxiv.org/abs/2409.02672
作者: Ruoyu Wang,Lina Yao
关键词-EN: Disentangled Representation Learning, identifies semantically meaningful, Representation Learning aims, Disentangled Representation, Representation Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process. Nevertheless, there is no consensus regarding a universally accepted definition for the objective of disentangled representation learning. In particular, there is a considerable amount of discourse regarding whether should the latent variables be mutually independent or not. In this paper, we first investigate these arguments on the interrelationships between latent variables by establishing a conceptual bridge between Epistemology and Disentangled Representation Learning. Then, inspired by these interdisciplinary concepts, we introduce a two-level latent space framework to provide a general solution to the prior arguments on this issue. Finally, we propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint within the Generative Adversarial Network (GAN) framework. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches in both quantitative and qualitative evaluations. The method exhibits strong performance across multiple commonly used metrics and demonstrates a great capability in disentangling various semantic factors, leading to an improved quality of controllable generation, which consequently benefits the explainability of the algorithm.

[AI-26] Causality-Aware Transformer Networks for Robotic Navigation

链接: https://arxiv.org/abs/2409.02669
作者: Ruoyu Wang,Yao Liu,Yuanjiang Cao,Lina Yao
关键词-EN: developing versatile Embodied, garnered growing interest, Recent advances, machine learning algorithms, Causal Understanding Module
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling, potentially limiting its performance in Embodied AI tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Embodied AI tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Embodied AI. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the models’s Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method’s generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks and simulation environments. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings.

[AI-27] PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

链接: https://arxiv.org/abs/2409.02657
作者: Jun Ling,Yiwen Wang,Han Xue,Rong Xie,Li Song
关键词-EN: previous audio-driven talking, head, text prompts, talking head generation, audio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 7+5 pages, 15 figures

点击查看摘要

Abstract:While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: this https URL.

[AI-28] OpenFact at CheckThat! 2024: Combining Multiple Attack Methods for Effective Adversarial Text Generation

链接: https://arxiv.org/abs/2409.02649
作者: Włodzimierz Lewoniewski,Piotr Stolarski,Milena Stróżyna,Elzbieta Lewańska,Aleksandra Wojewoda,Ewelina Księżniak,Marcin Sawiński
关键词-EN: paper presents, presents the experiments, Credibility Assessment, credibility assessment issues, Adversarial
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: CLEF 2024 - Conference and Labs of the Evaluation Forum

点击查看摘要

Abstract:This paper presents the experiments and results for the CheckThat! Lab at CLEF 2024 Task 6: Robustness of Credibility Assessment with Adversarial Examples (InCrediblAE). The primary objective of this task was to generate adversarial examples in five problem domains in order to evaluate the robustness of widely used text classification methods (fine-tuned BERT, BiLSTM, and RoBERTa) when applied to credibility assessment issues. This study explores the application of ensemble learning to enhance adversarial attacks on natural language processing (NLP) models. We systematically tested and refined several adversarial attack methods, including BERT-Attack, Genetic algorithms, TextFooler, and CLARE, on five datasets across various misinformation tasks. By developing modified versions of BERT-Attack and hybrid methods, we achieved significant improvements in attack effectiveness. Our results demonstrate the potential of modification and combining multiple methods to create more sophisticated and effective adversarial attack strategies, contributing to the development of more robust and secure systems. Comments: CLEF 2024 - Conference and Labs of the Evaluation Forum Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.02649 [cs.CL] (or arXiv:2409.02649v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.02649 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024)

[AI-29] Evaluating Environments Using Exploratory Agents

链接: https://arxiv.org/abs/2409.02632
作者: Bobby Khaleque,Mike Cook,Jeremy Gow
关键词-EN: key part, Exploration, levels, video games, Abstract
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 9 Pages, 9 figures, 2 tables, work in progress

点击查看摘要

Abstract:Exploration is a key part of many video games. We investigate the using an exploratory agent to provide feedback on the design of procedurally generated game levels, 5 engaging levels and 5 unengaging levels. We expand upon a framework introduced in previous research which models motivations for exploration and introduce a fitness function for evaluating an environment’s potential for exploration. Our study showed that our exploratory agent can clearly distinguish between engaging and unengaging levels. The findings suggest that our agent has the potential to serve as an effective tool for assessing procedurally generated levels, in terms of exploration. This work contributes to the growing field of AI-driven game design by offering new insights into how game environments can be evaluated and optimised for player exploration.

[AI-30] AdvSecureNet: A Python Toolkit for Adversarial Machine Learning

链接: https://arxiv.org/abs/2409.02629
作者: Melih Catal,Manuel Günther
关键词-EN: Machine learning models, adversarial machine learning, models are vulnerable, Machine learning, learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at this https URL and installable via PyPI.

[AI-31] SurgTrack: CAD-Free 3D Tracking of Real-world Surgical Instruments

链接: https://arxiv.org/abs/2409.02598
作者: Wenwu Guo,Jinlin Wu,Zhen Chen,Qingxiang Zhao,Miao Xu,Zhen Lei,Hongbin Liu
关键词-EN: received increasing attention, Vision-based surgical navigation, increasing attention due, vision-based navigation system, tracking surgical instruments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-based surgical navigation has received increasing attention due to its non-invasive, cost-effective, and flexible advantages. In particular, a critical element of the vision-based navigation system is tracking surgical instruments. Compared with 2D instrument tracking methods, 3D instrument tracking has broader value in clinical practice, but is also more challenging due to weak texture, occlusion, and lack of Computer-Aided Design (CAD) models for 3D registration. To solve these challenges, we propose the SurgTrack, a two-stage 3D instrument tracking method for CAD-free and robust real-world applications. In the first registration stage, we incorporate an Instrument Signed Distance Field (SDF) modeling the 3D representation of instruments, achieving CAD-freed 3D registration. Due to this, we can obtain the location and orientation of instruments in the 3D space by matching the video stream with the registered SDF model. In the second tracking stage, we devise a posture graph optimization module, leveraging the historical tracking results of the posture memory pool to optimize the tracking results and improve the occlusion robustness. Furthermore, we collect the Instrument3D dataset to comprehensively evaluate the 3D tracking of surgical instruments. The extensive experiments validate the superiority and scalability of our SurgTrack, by outperforming the state-of-the-arts with a remarkable improvement. The code and dataset are available at this https URL.

[AI-32] AlignGroup: Learning and Aligning Group Consensus with Member Preferences for Group Recommendation CIKM2024

链接: https://arxiv.org/abs/2409.02580
作者: Jinfeng Xu,Zheyu Chen,Jinze Li,Shuo Yang,Hewei Wang,Edith C.-H. Ngai
关键词-EN: Group, providing personalized recommendations, group consensus, human society, activities are important
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10 pages, accepted by CIKM 2024

点击查看摘要

Abstract:Group activities are important behaviors in human society, providing personalized recommendations for groups is referred to as the group recommendation task. Existing methods can usually be categorized into two strategies to infer group preferences: 1) determining group preferences by aggregating members’ personalized preferences, and 2) inferring group consensus by capturing group members’ coherent decisions after common compromises. However, the former would suffer from the lack of group-level considerations, and the latter overlooks the fine-grained preferences of individual users. To this end, we propose a novel group recommendation method AlignGroup, which focuses on both group consensus and individual preferences of group members to infer the group decision-making. Specifically, AlignGroup explores group consensus through a well-designed hypergraph neural network that efficiently learns intra- and inter-group relationships. Moreover, AlignGroup innovatively utilizes a self-supervised alignment task to capture fine-grained group decision-making by aligning the group consensus with members’ common preferences. Extensive experiments on two real-world datasets validate that our AlignGroup outperforms the state-of-the-art on both the group recommendation task and the user recommendation task, as well as outperforms the efficiency of most baselines.

[AI-33] Solving Video Inverse Problems Using Image Diffusion Models

链接: https://arxiv.org/abs/2409.02574
作者: Taesung Kwon,Jong Chul Ye
关键词-EN: including image super-resolution, video inverse problems, image diffusion models, diffusion model-based inverse, inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: this https URL

[AI-34] Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models

链接: https://arxiv.org/abs/2409.02572
作者: Fatma Yasmine Loumachi,Mohamed Chahine Ghanem
关键词-EN: Timeline Forensics, Timeline Analysis, correlate events resulting, analysing temporal digital, chronological timeline
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Timeline Analysis (TA) is a key part of Timeline Forensics (TF) in Digital Forensics (DF), focusing primarily on examining and analysing temporal digital artefacts such as timestamps, derived from event logs, file metadata, and other related data to correlate events resulting from cyber incidents and reconstruct their chronological timeline. Traditional tools often struggle to efficiently process the vast volume and variety of data acquired during DF investigations and Incident Response (IR) processes. This paper presents a novel framework, GenDFIR, that combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to advance and automate the TA process. Our approach consists of two main stages (1) We use R-BAI to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then converted into embeddings for processing by an LLM with the help of a Retrieval-Augmented Generation (RAG) agent. The LLM consequently leverages its capabilities to perform automated TA on the artefacts and predict potential incident scenarios. To validate our framework, we evaluate GenDFIR performance, efficiency, and reliability using various metrics across synthetic cyber incident simulation scenarios. This paper presents a proof of concept, where the findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This novel approach highlights the power of Generative AI (GenAI), specifically LLMs, and opens new avenues for advanced threat detection and incident reconstruction, representing a significant step forward in the field.

[AI-35] More is More: Addition Bias in Large Language Models

链接: https://arxiv.org/abs/2409.02569
作者: Luca Santagata,Cristiano De Nobili
关键词-EN: Large Language Models, Language Models, cognitive bias observed, Large Language, drawing a parallel
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 8 figures

点击查看摘要

Abstract:In this paper, we investigate the presence of additive bias in Large Language Models (LLMs), drawing a parallel to the cognitive bias observed in humans where individuals tend to favor additive over subtractive changes. Using a series of controlled experiments, we tested various LLMs, including GPT-3.5 Turbo, Claude 3.5 Sonnet, Mistral, Math \Sigma tral, and Llama 3.1, on tasks designed to measure their propensity for additive versus subtractive modifications. Our findings demonstrate a significant preference for additive changes across all tested models. For example, in a palindrome creation task, Llama 3.1 favored adding letters 97.85% of the time over removing them. Similarly, in a Lego tower balancing task, GPT-3.5 Turbo chose to add a brick 76.38% of the time rather than remove one. In a text summarization task, Mistral 7B produced longer summaries in 59.40% to 75.10% of cases when asked to improve its own or others’ writing. These results indicate that, similar to humans, LLMs exhibit a marked additive bias, which might have implications when LLMs are used on a large scale. Addittive bias might increase resource use and environmental impact, leading to higher economic costs due to overconsumption and waste. This bias should be considered in the development and application of LLMs to ensure balanced and efficient problem-solving approaches.

[AI-36] Vision-Language Navigation with Continual Learning

链接: https://arxiv.org/abs/2409.02561
作者: Zhiyuan Li,Yanfeng Lv,Ziqin Tu,Di Shang,Hong Qiao
关键词-EN: natural language instructions, embedded intelligence, language instructions, critical domain, domain within embedded
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-language navigation (VLN) is a critical domain within embedded intelligence, requiring agents to navigate 3D environments based on natural language instructions. Traditional VLN research has focused on improving environmental understanding and decision accuracy. However, these approaches often exhibit a significant performance gap when agents are deployed in novel environments, mainly due to the limited diversity of training data. Expanding datasets to cover a broader range of environments is impractical and costly. We propose the Vision-Language Navigation with Continual Learning (VLNCL) paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge. VLNCL enables agents to maintain an environmental memory and extract relevant knowledge, allowing rapid adaptation to new environments while preserving existing information. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory replay mechanisms integrated with VLN agents. This method facilitates consolidating past experiences and enhances generalization across new tasks. By utilizing a multi-scenario memory buffer, the agent efficiently organizes and replays task memories, thereby bolstering its ability to adapt quickly to new environments and mitigating catastrophic forgetting. Our work pioneers continual learning in VLN agents, introducing a novel experimental setup and evaluation metrics. We demonstrate the effectiveness of our approach through extensive evaluations and establish a benchmark for the VLNCL paradigm. Comparative experiments with existing continual learning and VLN methods show significant improvements, achieving state-of-the-art performance in continual learning ability and highlighting the potential of our approach in enabling rapid adaptation while preserving prior knowledge.

[AI-37] Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation

链接: https://arxiv.org/abs/2409.02555
作者: Kangkai Zhang,Shiming Ge,Ruixin Shi,Dan Zeng
关键词-EN: challenging task due, Recognizing objects, challenging task, task due, lack of informative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[AI-38] A Sequential Decision-Making Model for Perimeter Identification

链接: https://arxiv.org/abs/2409.02549
作者: Ayal Taitler
关键词-EN: requiring traffic flow, traffic flow monitoring, identification involves ascertaining, area or zone, requiring traffic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Perimeter identification involves ascertaining the boundaries of a designated area or zone, requiring traffic flow monitoring, control, or optimization. Various methodologies and technologies exist for accurately defining these perimeters; however, they often necessitate specialized equipment, precise mapping, or comprehensive data for effective problem delineation. In this study, we propose a sequential decision-making framework for perimeter search, designed to operate efficiently in real-time and require only publicly accessible information. We conceptualize the perimeter search as a game between a playing agent and an artificial environment, where the agent’s objective is to identify the optimal perimeter by sequentially improving the current perimeter. We detail the model for the game and discuss its adaptability in determining the definition of an optimal perimeter. Ultimately, we showcase the model’s efficacy through a real-world scenario, highlighting the identification of corresponding optimal perimeters.

[AI-39] Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models

链接: https://arxiv.org/abs/2409.02530
作者: Chih-Yuan Li,Jun-Ting Wu,Chan Hsu,Ming-Yen Lin,Yihuang Kang
关键词-EN: Glomerular Filtration Rate, estimated Glomerular Filtration, Filtration Rate, Glomerular Filtration, estimated Glomerular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This preprint version includes corrections of typographical errors related to numerical values in Table 2, which were present in the version published at the BDH workshop in MIPR 2024. These corrections do not affect the overall conclusions of the study

点击查看摘要

Abstract:The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.

[AI-40] Cog-GA: A Large Language Models -based Generative Agent for Vision-Language Navigation in Continuous Environments

链接: https://arxiv.org/abs/2409.02522
作者: Zhiyuan Li,Yanfeng Lu,Yao Mu,Hong Qiao
关键词-EN: Continuous Environments, spaces solely guided, natural language instructions, Vision Language Navigation, Vision Language
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI, demanding agents to navigate freely in unbounded 3D spaces solely guided by natural language instructions. This task introduces distinct challenges in multimodal comprehension, spatial reasoning, and decision-making. To address these challenges, we introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes. Firstly, it constructs a cognitive map, integrating temporal, spatial, and semantic elements, thereby facilitating the development of spatial memory within LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints, strategically optimizing the exploration trajectory to maximize navigational efficiency. Each waypoint is accompanied by a dual-channel scene description, categorizing environmental cues into ‘what’ and ‘where’ streams as the brain. This segregation enhances the agent’s attentional focus, enabling it to discern pertinent spatial information for navigation. A reflective mechanism complements these strategies by capturing feedback from prior navigation experiences, facilitating continual learning and adaptive replanning. Extensive evaluations conducted on VLN-CE benchmarks validate Cog-GA’s state-of-the-art performance and ability to simulate human-like navigation behaviors. This research significantly contributes to the development of strategic and interpretable VLN-CE agents.

[AI-41] Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

链接: https://arxiv.org/abs/2409.02512
作者: Jifeng Hu,Li Shen,Sili Huang,Zhejian Yang,Hechang Chen,Lichao Sun,Yi Chang,Dacheng Tao
关键词-EN: Artificial neural networks, shown remarkable superiority, Artificial neural, neural networks, superiority in gaming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks’ datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.

[AI-42] CoAst: Validation-Free Contribution Assessment for Federated Learning based on Cross-Round Valuation

链接: https://arxiv.org/abs/2409.02495
作者: Hao Wu,Likun Zhang,Shucheng Li,Fengyuan Xu,Sheng Zhong
关键词-EN: federated learning, data held, model performance, participant, validation data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants’ contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.

[AI-43] NeuroSpex: Neuro-Guided Speaker Extraction with Cross-Modal Attention

链接: https://arxiv.org/abs/2409.02489
作者: Dashanka De Silva,Siqi Cai,Saurav Pahuja,Tanja Schultz,Haizhou Li
关键词-EN: elicited neural responses, measurable through electroencephalography, study of auditory, exists a robust, robust correlation
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail party computationally. In this paper, we present a neuro-guided speaker extraction model, i.e. NeuroSpex, using the EEG response of the listener as the sole auxiliary reference cue to extract attended speech from monaural speech mixtures. We propose a novel EEG signal encoder that captures the attention information. Additionally, we propose a cross-attention (CA) mechanism to enhance the speech feature representations, generating a speaker extraction mask. Experimental results on a publicly available dataset demonstrate that our proposed model outperforms two baseline models across various evaluation metrics.

[AI-44] Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization IROS2024

链接: https://arxiv.org/abs/2409.02486
作者: Cho-Ying Wu,Yiqi Zhong,Junying Wang,Ulrich Neumann
关键词-EN: Indoor robots rely, obstacle detection, robots rely, navigation or obstacle, indoor single-image depth
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IROS 2024. The version supersedes 2305.07269 . arXiv admin note: text overlap with arXiv:2305.07269

点击查看摘要

Abstract:Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.

[AI-45] Adversarial Attacks on Machine Learning-Aided Visualizations

链接: https://arxiv.org/abs/2409.02485
作者: Takanori Fujiwara,Kostiantyn Kucher,Junpeng Wang,Rafael M. Martins,Andreas Kerren,Anders Ynnerman
关键词-EN: high societal impact, machine learning, techniques to generate, societal impact, field is rapidly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This is the author’s version of the article that has been accepted by the Journal of Visualization

点击查看摘要

Abstract:Research in ML4VIS investigates how to use machine learning (ML) techniques to generate visualizations, and the field is rapidly growing with high societal impact. However, as with any computational pipeline that employs ML processes, ML4VIS approaches are susceptible to a range of ML-specific adversarial attacks. These attacks can manipulate visualization generations, causing analysts to be tricked and their judgments to be impaired. Due to a lack of synthesis from both visualization and ML perspectives, this security aspect is largely overlooked by the current ML4VIS literature. To bridge this gap, we investigate the potential vulnerabilities of ML-aided visualizations from adversarial attacks using a holistic lens of both visualization and ML perspectives. We first identify the attack surface (i.e., attack entry points) that is unique in ML-aided visualizations. We then exemplify five different adversarial attacks. These examples highlight the range of possible attacks when considering the attack surface and multiple different adversary capabilities. Our results show that adversaries can induce various attacks, such as creating arbitrary and deceptive visualizations, by systematically identifying input attributes that are influential in ML inferences. Based on our observations of the attack surface characteristics and the attack examples, we underline the importance of comprehensive studies of security issues and defense mechanisms as a call of urgency for the ML4VIS community.

[AI-46] ASAR: Transferable Attack on Skeletal Action Recognition

链接: https://arxiv.org/abs/2409.02483
作者: Yunfeng Diao,Baiqi Wu,Ruixuan Zhang,Ajian Liu,Xingxing Wei,Meng Wang,He Wang
关键词-EN: Human Activity Recognition, Human Activity, human behaviors, Activity Recognition, Skeletal Action Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2407.08572

点击查看摘要

Abstract:Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.

[AI-47] What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations EMNLP2024

链接: https://arxiv.org/abs/2409.02449
作者: Kavya Manohar,Leena G Pillai
关键词-EN: automatic speech recognition, evaluating multilingual automatic, multilingual automatic speech, ASR models, leading ASR models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Sumbitted to EMNLP 2024

点击查看摘要

Abstract:This paper explores the pitfalls in evaluating multilingual automatic speech recognition (ASR) models, with a particular focus on Indic language scripts. We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta’s MMS, Seamless, and Assembly AI’s Conformer, and their unintended consequences on performance metrics. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. We conclude by proposing a shift towards developing normalization routines that leverage native linguistic expertise, ensuring more robust and accurate evaluations of multilingual ASR models.

[AI-48] Detecting Korean Food Using Image using Hierarchical Model

链接: https://arxiv.org/abs/2409.02448
作者: Hoang Khanh Lam,Kahandakanaththage Maduni Pramuditha Perera
关键词-EN: Korean Food lovers, Korean Food, Food lovers, food before consuming, identify the Korean
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A solution was made available for Korean Food lovers who have dietary restrictions to identify the Korean food before consuming. Just by uploading a clear photo of the dish, people can get to know what they are eating. Image processing techniques together with machine learning helped to come up with this solution.

[AI-49] Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2409.02428
作者: Guanwen Xie,Jingzehua Xu,Yiyuan Yang,Shuai Zhang
关键词-EN: Leveraging large language, large language models, demonstrates significant potential, Leveraging large, functions demonstrates significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.

[AI-50] Accelerating Large Language Model Training with Hybrid GPU-based Compression

链接: https://arxiv.org/abs/2409.02423
作者: Lang Xu,Quentin Anthony,Qinghua Zhou,Nawras Alnaasan,Radha R. Gulhane,Aamir Shafi,Hari Subramoni,Dhabaleswar K. Panda
关键词-EN: efficient Large Language, Large Language Model, Large Language, strategies widely adopted, efficient Large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naïve compression scheme across all collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (arXiv:2301.02654), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a 12.7% increase in samples per second while reaching baseline loss convergence. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.02423 [cs.DC] (or arXiv:2409.02423v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2409.02423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-51] Abstractive Text Summarization: State of the Art Challenges and Improvements

链接: https://arxiv.org/abs/2409.02413
作者: Hassan Shakil,Ahmad Farooq,Jugal Kalita
关键词-EN: Specifically focusing, prospective research directions, opposed to extractive, survey presents, summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 Tables, 7 Figures

点击查看摘要

Abstract:Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.

[AI-52] Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation

链接: https://arxiv.org/abs/2409.02404
作者: Shiming Ge,Bochao Liu,Pengju Wang,Yong Li,Dan Zeng
关键词-EN: privacy leakage risk, synthetic data, data, practical deployment, proved successful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: This paper is accepted by IEEE Transactions on Image Processing (TIP)

点击查看摘要

Abstract:While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.

[AI-53] Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts

链接: https://arxiv.org/abs/2409.02390
作者: Jie Su,Fang Cai,Shu-Kuo Zhao,Xin-Yi Wang,Tian-Yi Qian,Da-Hui Wang,Bo Hong
关键词-EN: conducting computational simulations, Uncovering the fundamental, developing mathematical models, fundamental neural correlates, developing mathematical
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Uncovering the fundamental neural correlates of biological intelligence, developing mathematical models, and conducting computational simulations are critical for advancing new paradigms in artificial intelligence (AI). In this study, we implemented a comprehensive visual decision-making model that spans from visual input to behavioral output, using a neural dynamics modeling approach. Drawing inspiration from the key components of the dorsal visual pathway in primates, our model not only aligns closely with human behavior but also reflects neural activities in primates, and achieving accuracy comparable to convolutional neural networks (CNNs). Moreover, magnetic resonance imaging (MRI) identified key neuroimaging features such as structural connections and functional connectivity that are associated with performance in perceptual decision-making tasks. A neuroimaging-informed fine-tuning approach was introduced and applied to the model, leading to performance improvements that paralleled the behavioral variations observed among subjects. Compared to classical deep learning models, our model more accurately replicates the behavioral performance of biological intelligence, relying on the structural characteristics of biological neural networks rather than extensive training data, and demonstrating enhanced resilience to perturbation.

[AI-54] Multi-modal Situated Reasoning in 3D Scenes

链接: https://arxiv.org/abs/2409.02389
作者: Xiongkun Linghu,Jiangyong Huang,Xuesong Niu,Xiaojian Ma,Baoxiong Jia,Siyuan Huang
关键词-EN: embodied AI agents, Situated Question Answering, awareness is essential, Multi-modal Situated, situated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Project page: this https URL

点击查看摘要

Abstract:Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models’ situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.

[AI-55] Large Language Models and Cognitive Science: A Comprehensive Review of Similarities Differences and Challenges

链接: https://arxiv.org/abs/2409.02387
作者: Qian Niu,Junyu Liu,Ziqian Bi,Pohsun Feng,Benji Peng,Keyu Chen
关键词-EN: Large Language Models, Large Language, Language Models, intersection of Large, comprehensive review explores
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:This comprehensive review explores the intersection of Large Language Models (LLMs) and cognitive science, examining similarities and differences between LLMs and human cognitive processes. We analyze methods for evaluating LLMs cognitive abilities and discuss their potential as cognitive models. The review covers applications of LLMs in various cognitive fields, highlighting insights gained for cognitive science research. We assess cognitive biases and limitations of LLMs, along with proposed methods for improving their performance. The integration of LLMs with cognitive architectures is examined, revealing promising avenues for enhancing artificial intelligence (AI) capabilities. Key challenges and future research directions are identified, emphasizing the need for continued refinement of LLMs to better align with human cognition. This review provides a balanced perspective on the current state and future potential of LLMs in advancing our understanding of both artificial and human intelligence.

[AI-56] Coral Model Generation from Single Images for Virtual Reality Applications

链接: https://arxiv.org/abs/2409.02376
作者: Jie Fu(University of the Arts London, Creative Computing Institute, London, United Kingdom),Shun Fu(Bloks Technology Company, Shanghai, China),Mick Grierson(University of the Arts London, Creative Computing Institute, London, United Kingdom)
关键词-EN: rapid development, models, coral, Traditional methods, Traditional methods struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485

点击查看摘要

Abstract:With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive “artworks,” best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.

[AI-57] Do Large Language Models Possess Sensitive to Sentiment?

链接: https://arxiv.org/abs/2409.02370
作者: Yang Liu,Xichou Zhu,Zhou Shen,Yi Liu,Min Li,Yujun Chen,Benzi John,Zhenzhen Ma,Tao Hu,Zhiyang Xu,Wei Luo,Junhui Wang
关键词-EN: Large Language Models, Large Language, language understanding, Language Models, recently displayed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models’ outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.

[AI-58] NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval

链接: https://arxiv.org/abs/2409.02343
作者: Sepanta Zeighami,Zac Wellmer,Aditya Parameswaran
关键词-EN: Nearest Neighbor search, Nearest Neighbor, Retrieval-Augmented Generation, Neighbor search, dense vector embeddings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract: k -Nearest Neighbor search on dense vector embeddings ( k -NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of k -NN retrieval. We present a thorough theoretical and experimental study of NUDGE’s non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.

[AI-59] Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Experts Feedback

链接: https://arxiv.org/abs/2409.02337
作者: Deepak Raina,Mythra V. Balakuntala,Byung Wook Kim,Juan Wachs,Richard Voyles
关键词-EN: intervention and diagnosis, offering non-invasive, real-time imaging, widely employed, employed for clinical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE Transactions on Medical Robotics and Bionics (TMRB) 2024

点击查看摘要

Abstract:Ultrasound is widely employed for clinical intervention and diagnosis, due to its advantages of offering non-invasive, radiation-free, and real-time imaging. However, the accessibility of this dexterous procedure is limited due to the substantial training and expertise required of operators. The robotic ultrasound (RUS) offers a viable solution to address this limitation; nonetheless, achieving human-level proficiency remains challenging. Learning from demonstrations (LfD) methods have been explored in RUS, which learns the policy prior from a dataset of offline demonstrations to encode the mental model of the expert sonographer. However, active engagement of experts, i.e. Coaching, during the training of RUS has not been explored thus far. Coaching is known for enhancing efficiency and performance in human training. This paper proposes a coaching framework for RUS to amplify its performance. The framework combines DRL (self-supervised practice) with sparse expert’s feedback through coaching. The DRL employs an off-policy Soft Actor-Critic (SAC) network, with a reward based on image quality rating. The coaching by experts is modeled as a Partially Observable Markov Decision Process (POMDP), which updates the policy parameters based on the correction by the expert. The validation study on phantoms showed that coaching increases the learning rate by 25% and the number of high-quality image acquisition by 74.5% .

[AI-60] Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

链接: https://arxiv.org/abs/2409.02326
作者: Yuxiang Wei,Hojae Han,Rajhans Samdani
关键词-EN: Recent studies, increasingly demonstrating, crucial for effective, data, pretraining
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of “high-quality” remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

[AI-61] meDiT: General-purpose Diffusion Transformers for Time Series Foundation Model ICML2024

链接: https://arxiv.org/abs/2409.02322
作者: Defu Cao,Wen Ye,Yizhou Zhang,Yan Liu
关键词-EN: Large Language Models, building foundation models, time series, foundation models, recent advances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 Pages, 6 Figures, 11 Tables. First present at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark’’ datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.

[AI-62] On the Benefits of Memory for Modeling Time-Dependent PDEs

链接: https://arxiv.org/abs/2409.02313
作者: Ricardo Buitrago Ruiz,Tanya Marwah,Albert Gu,Andrej Risteski
关键词-EN: partial differential equations, traditional numerical methods, solving partial differential, Data-driven techniques, differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs). These techniques frequently offer a better trade-off between computational cost and accuracy for many PDE families of interest. For time-dependent PDEs, existing methodologies typically treat PDEs as Markovian systems, i.e., the evolution of the system only depends on the ``current state’', and not the past states. However, distortion of the input signals – e.g., due to discretization or low-pass filtering – can render the evolution of the distorted signals non-Markovian. In this work, motivated by the Mori-Zwanzig theory of model reduction, we investigate the impact of architectures with memory for modeling PDEs: that is, when past states are explicitly used to predict the future. We introduce Memory Neural Operator (MemNO), a network based on the recent SSM architectures and Fourier Neural Operator (FNO). We empirically demonstrate on a variety of PDE families of interest that when the input is given on a low-resolution grid, MemNO significantly outperforms the baselines without memory, achieving more than 6 times less error on unseen PDEs. Via a combination of theory and experiments, we show that the effect of memory is particularly significant when the solution of the PDE has high frequency Fourier components (e.g., low-viscosity fluid dynamics), and it also increases robustness to observation noise.

[AI-63] Initial Development and Evaluation of the Creative Artificial Intelligence through Recurring Developments and Determinations (CAIRDD) System

链接: https://arxiv.org/abs/2409.02291
作者: Jeremy Straub,Zach Johnson
关键词-EN: Computer system creativity, artificial general intelligence, Computer system, AGI, Computer
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Computer system creativity is a key step on the pathway to artificial general intelligence (AGI). It is elusive, however, due to the fact that human creativity is not fully understood and, thus, it is difficult to develop this capability in software. Large language models (LLMs) provide a facsimile of creativity and the appearance of sentience, while not actually being either creative or sentient. While LLMs have created bona fide new content, in some cases - such as with harmful hallucinations - inadvertently, their deliberate creativity is seen by some to not match that of humans. In response to this challenge, this paper proposes a technique for enhancing LLM output creativity via an iterative process of concept injection and refinement. Initial work on the development of the Creative Artificial Intelligence through Recurring Developments and Determinations (CAIRDD) system is presented and the efficacy of key system components is evaluated.

[AI-64] Biochemical Prostate Cancer Recurrence Prediction: Thinking Fast Slow

链接: https://arxiv.org/abs/2409.02284
作者: Suhang You,Sanyukta Adap,Siddhesh Thakur,Bhakti Baheti,Spyridon Bakas
关键词-EN: patients after prostatectomy, prostate cancer, cancer is essential, essential for prognostic, prognostic monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures, methodology paper for LEOPRARD Challenge

点击查看摘要

Abstract:Time to biochemical recurrence in prostate cancer is essential for prognostic monitoring of the progression of patients after prostatectomy, which assesses the efficacy of the surgery. In this work, we proposed to leverage multiple instance learning through a two-stage thinking fast \ slow'' strategy for the time to recurrence (TTR) prediction. The first (thinking fast’‘) stage finds the most relevant WSI area for biochemical recurrence and the second (``thinking slow’') stage leverages higher resolution patches to predict TTR. Our approach reveals a mean C-index ( Ci ) of 0.733 ( \theta=0.059 ) on our internal validation and Ci=0.603 on the LEOPARD challenge validation set. Post hoc attention visualization shows that the most attentive area contributes to the TTR prediction.

[AI-65] Reinforcement Learning-enabled Satellite Constellation Reconfiguration and Retasking for Mission-Critical Applications

链接: https://arxiv.org/abs/2409.02270
作者: Hassan El Alami,Danda B. Rawat
关键词-EN: reduced operational costs, increasing user demands, rapidly advancing due, satellite constellation applications, user demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted for publication in the IEEE Military Communications Conference (IEEE MILCOM 2024)

点击查看摘要

Abstract:The development of satellite constellation applications is rapidly advancing due to increasing user demands, reduced operational costs, and technological advancements. However, a significant gap in the existing literature concerns reconfiguration and retasking issues within satellite constellations, which is the primary focus of our research. In this work, we critically assess the impact of satellite failures on constellation performance and the associated task requirements. To facilitate this analysis, we introduce a system modeling approach for GPS satellite constellations, enabling an investigation into performance dynamics and task distribution strategies, particularly in scenarios where satellite failures occur during mission-critical operations. Additionally, we introduce reinforcement learning (RL) techniques, specifically Q-learning, Policy Gradient, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO), for managing satellite constellations, addressing the challenges posed by reconfiguration and retasking following satellite failures. Our results demonstrate that DQN and PPO achieve effective outcomes in terms of average rewards, task completion rates, and response times.

[AI-66] Action-Based ADHD Diagnosis in Video

链接: https://arxiv.org/abs/2409.02261
作者: Yichun Li,Yuxing Yang,Syed Nohsen Naqvi
关键词-EN: Attention Deficit Hyperactivity, Deficit Hyperactivity Disorder, Attention Deficit, Hyperactivity Disorder, Deficit Hyperactivity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 31st European Symposium on Artificial Neural Networks

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) causes significant impairment in various domains. Early diagnosis of ADHD and treatment could significantly improve the quality of life and functioning. Recently, machine learning methods have improved the accuracy and efficiency of the ADHD diagnosis process. However, the cost of the equipment and trained staff required by the existing methods are generally huge. Therefore, we introduce the video-based frame-level action recognition network to ADHD diagnosis for the first time. We also record a real multi-modal ADHD dataset and extract three action classes from the video modality for ADHD diagnosis. The whole process data have been reported to CNTW-NHS Foundation Trust, which would be reviewed by medical consultants/professionals and will be made public in due course.

[AI-67] NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise

链接: https://arxiv.org/abs/2409.02251
作者: Abdullah Arafat Miah,Kaan Icer,Resit Sendag,Yu Bi
关键词-EN: deep learning development, Backdoor attacks pose, learning development, backdoor attack, pose a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at this https URL.

[AI-68] FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation WWW INTERSPEECH2024 FAST

链接: https://arxiv.org/abs/2409.02245
作者: Takuhiro Kaneko,Hirokazu Kameoka,Kou Tanaka,Yuto Kondo
关键词-EN: Diffusion-based voice conversion, voice conversion, speaker similarity, VoiceGrad have attracted, attracted interest
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注: Accepted to Interspeech 2024. Project page: this https URL

点击查看摘要

Abstract:Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at this https URL.

[AI-69] mporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR

链接: https://arxiv.org/abs/2409.02239
作者: Xugang Lu,Peng Shen,Yu Tsao,Hisashi Kawai
关键词-EN: automatic speech recognition, knowledge transfer, speech recognition, linguistic knowledge transfer, knowledge
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.

[AI-70] AAI: Threats to Society Remedies and Governance

链接: https://arxiv.org/abs/2409.02219
作者: Don Byrd
关键词-EN: Artificial Intelligence, Toggle, Code Toggle Papers, threats, Papers
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:This document focuses on the threats, especially near-term threats, that Artificial Intelligence (AI) brings to society. Most of the threats discussed here can result from any algorithmic process, not just AI; in addition, defining AI is notoriously difficult. For both reasons, it is important to think of “A+AI”: Algorithms and Artificial Intelligence. In addition to the threats, this paper discusses countermeasures to them, and it includes a table showing which countermeasures are likely to mitigate which threats. Thoughtful governance could manage the risks without seriously impeding progress; in fact, chances are it would accelerate progress by reducing the social chaos that would otherwise be likely. The paper lists specific actions government should take as soon as possible, namely: * Require all social media platforms accessible in the U.S. to offer users verification that their accounts are owned by citizens, and to display every account’s verification status * Establish regulations to require that all products created or significantly modified with A+AI be clearly labeled as such; to restrict use of generative AI to create likenesses of persons; and to require creators of generative AI software to disclose materials used to train their software and to compensate the creators of any copyrighted material used * Fund a crash project of research on mitigating the threats * Fund educational campaigns to raise awareness of the threats Comments: 25 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.02219 [cs.CY] (or arXiv:2409.02219v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2409.02219 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Donald Byrd [view email] [v1] Tue, 3 Sep 2024 18:43:47 UTC (2,743 KB) Full-text links: Access Paper: View a PDF of the paper titled A+AI: Threats to Society, Remedies, and Governance, by Don ByrdView PDFOther Formats view license Current browse context: cs.CY prev | next new | recent | 2024-09 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-71] Fair Railway Network Design

链接: https://arxiv.org/abs/2409.02152
作者: Zixu He,Sirin Botan,Jérôme Lang,Abdallah Saffidine,Florian Sikora,Silas Workman
关键词-EN: public transportation network, designing a public, public transportation, minimise the sum, transportation network
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 32 pages, 18 figures

点击查看摘要

Abstract:When designing a public transportation network in a country, one may want to minimise the sum of travel duration of all inhabitants. This corresponds to a purely utilitarian view and does not involve any fairness consideration, as the resulting network will typically benefit the capital city and/or large central cities while leaving some peripheral cities behind. On the other hand, a more egalitarian view will allow some people to travel between peripheral cities without having to go through a central city. We define a model, propose algorithms for computing solution networks, and report on experiments based on real data.

[AI-72] Optimal Power Grid Operations with Foundation Models

链接: https://arxiv.org/abs/2409.02148
作者: Alban Puech,Jonas Weiss,Thomas Brunschwiler,Hendrik F. Hamann
关键词-EN: renewable energy sources, integrating numerous distributed, demands integrating numerous, energy transition, renewable energy
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The energy transition, crucial for tackling the climate crisis, demands integrating numerous distributed, renewable energy sources into existing grids. Along with climate change and consumer behavioral changes, this leads to changes and variability in generation and load patterns, introducing significant complexity and uncertainty into grid planning and operations. While the industry has already started to exploit AI to overcome computational challenges of established grid simulation tools, we propose the use of AI Foundation Models (FMs) and advances in Graph Neural Networks to efficiently exploit poorly available grid data for different downstream tasks, enhancing grid operations. For capturing the grid’s underlying physics, we believe that building a self-supervised model learning the power flow dynamics is a critical first step towards developing an FM for the power grid. We show how this approach may close the gap between the industry needs and current grid analysis capabilities, to bring the industry closer to optimal grid operation and planning.

[AI-73] A Multimodal Object-level Contrast Learning Method for Cancer Survival Risk Prediction

链接: https://arxiv.org/abs/2409.02145
作者: Zekang Yang,Hong Liu,Xiangdong Wang
关键词-EN: Computer-aided cancer survival, survival risk, survival risk prediction, cancer survival risk, Computer-aided cancer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer-aided cancer survival risk prediction plays an important role in the timely treatment of patients. This is a challenging weakly supervised ordinal regression task associated with multiple clinical factors involved such as pathological images, genomic data and etc. In this paper, we propose a new training method, multimodal object-level contrast learning, for cancer survival risk prediction. First, we construct contrast learning pairs based on the survival risk relationship among the samples in the training sample set. Then we introduce the object-level contrast learning method to train the survival risk predictor. We further extend it to the multimodal scenario by applying cross-modal constrast. Considering the heterogeneity of pathological images and genomics data, we construct a multimodal survival risk predictor employing attention-based and self-normalizing based nerural network respectively. Finally, the survival risk predictor trained by our proposed method outperforms state-of-the-art methods on two public multimodal cancer datasets for survival risk prediction.

[AI-74] Efficient and Scalable Estimation of Tool Representations in Vector Space

链接: https://arxiv.org/abs/2409.02141
作者: Suhong Moon,Siddharth Jha,Lutfi Eren Erdogan,Sehoon Kim,Woosang Lim,Kurt Keutzer,Amir Gholami
关键词-EN: execute complex tasks, external information sources, Recent advancements, tool retrieval, complex tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \urlthis https URL

[AI-75] Self-Supervised Learning for Identifying Defects in Sewer Footage ICML2024

链接: https://arxiv.org/abs/2409.02140
作者: Daniel Otero,Rafael Mateus
关键词-EN: expensive modern investments, modern investments requiring, investments requiring time-intensive, requiring time-intensive manual, Sewerage infrastructure
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Poster at the LatinX in AI Workshop @ ICML 2024

点击查看摘要

Abstract:Sewerage infrastructure is among the most expensive modern investments requiring time-intensive manual inspections by qualified personnel. Our study addresses the need for automated solutions without relying on large amounts of labeled data. We propose a novel application of Self-Supervised Learning (SSL) for sewer inspection that offers a scalable and cost-effective solution for defect detection. We achieve competitive results with a model that is at least 5 times smaller than other approaches found in the literature and obtain competitive performance with 10% of the available data when training with a larger architecture. Our findings highlight the potential of SSL to revolutionize sewer maintenance in resource-limited settings.

[AI-76] he Role of Transformer Models in Advancing Blockchain Technology: A Systematic Review

链接: https://arxiv.org/abs/2409.02139
作者: Tianxu Liu,Yanbin Wang,Jianguo Sun,Ye Tian,Yanyu Huang,Tao Xue,Peiyue Li,Yiwei Liu
关键词-EN: architectures,have shown unprecedented, technology rapidly evolves, scalability grows.Transformer models, shown unprecedented potential, deep learning architectures,have
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As blockchain technology rapidly evolves, the demand for enhanced efficiency, security, and scalability grows.Transformer models, as powerful deep learning architectures,have shown unprecedented potential in addressing various blockchain challenges. However, a systematic review of Transformer applications in blockchain is lacking. This paper aims to fill this research gap by surveying over 200 relevant papers, comprehensively reviewing practical cases and research progress of Transformers in blockchain applications. Our survey covers key areas including anomaly detection, smart contract security analysis, cryptocurrency prediction and trend analysis, and code summary generation. To clearly articulate the advancements of Transformers across various blockchain domains, we adopt a domain-oriented classification system, organizing and introducing representative methods based on major challenges in current blockchain research. For each research domain,we first introduce its background and objectives, then review previous representative methods and analyze their limitations,and finally introduce the advancements brought by Transformer models. Furthermore, we explore the challenges of utilizing Transformer, such as data privacy, model complexity, and real-time processing requirements. Finally, this article proposes future research directions, emphasizing the importance of exploring the Transformer architecture in depth to adapt it to specific blockchain applications, and discusses its potential role in promoting the development of blockchain technology. This review aims to provide new perspectives and a research foundation for the integrated development of blockchain technology and machine learning, supporting further innovation and application expansion of blockchain technology.

[AI-77] A Financial Time Series Denoiser Based on Diffusion Model

链接: https://arxiv.org/abs/2409.02138
作者: Zhuohan Wang,Carmine Ventre
关键词-EN: ultimately decision making, posing significant challenges, Financial time series, accurate data interpretation, exhibit low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:Financial time series often exhibit low signal-to-noise ratio, posing significant challenges for accurate data interpretation and prediction and ultimately decision making. Generative models have gained attention as powerful tools for simulating and predicting intricate data patterns, with the diffusion model emerging as a particularly effective method. This paper introduces a novel approach utilizing the diffusion model as a denoiser for financial time series in order to improve data predictability and trading performance. By leveraging the forward and reverse processes of the conditional diffusion model to add and remove noise progressively, we reconstruct original data from noisy inputs. Our extensive experiments demonstrate that diffusion model-based denoised time series significantly enhance the performance on downstream future return classification tasks. Moreover, trading signals derived from the denoised data yield more profitable trades with fewer transactions, thereby minimizing transaction costs and increasing overall trading efficiency. Finally, we show that by using classifiers trained on denoised time series, we can recognize the noising state of the market and obtain excess return.

[AI-78] Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

链接: https://arxiv.org/abs/2409.02136
作者: Mohammadreza Ghaffarzadeh-Esfahani,Mahdi Ghaffarzadeh-Esfahani,Arian Salahi-Niri,Hossein Toreyhi,Zahra Atf,Amirali Mohsenzadeh-Kermani,Mahshad Sarikhani,Zohreh Tajabadi,Fatemeh Shojaeian,Mohammad Hassan Bagheri,Aydin Feyzi,Mohammadamin Tarighatpayma,Narges Gazmeh,Fateme Heydari,Hossein Afshar,Amirreza Allahgholipour,Farid Alimardani,Ameneh Salehi,Naghmeh Asadimanesh,Mohammad Amin Khalafi,Hadis Shabanipour,Ali Moradi,Sajjad Hossein Zadeh,Omid Yazdani,Romina Esbati,Moozhan Maleki,Danial Samiei Nasr,Amirali Soheili,Hossein Majlesi,Saba Shahsavan,Alireza Soheilipour,Nooshin Goudarzi,Erfan Taherifard,Hamidreza Hatamabadi,Jamil S Samaan,Thomas Savage,Ankit Sakhuja,Ali Soroush,Girish Nadkarni,Ilad Alavi Darazam,Mohamad Amin Pourhoseingholi,Seyed Amir Ahmad Safavi-Naini
关键词-EN: large language models, CML models, study aimed, aimed to evaluate, evaluate and compare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at: this https URL and this https URL . The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com)

点击查看摘要

Abstract:Background: This study aimed to evaluate and compare the performance of classical machine learning models (CMLs) and large language models (LLMs) in predicting mortality associated with COVID-19 by utilizing a high-dimensional tabular dataset. Materials and Methods: We analyzed data from 9,134 COVID-19 patients collected across four hospitals. Seven CML models, including XGBoost and random forest (RF), were trained and evaluated. The structured data was converted into text for zero-shot classification by eight LLMs, including GPT-4 and Mistral-7b. Additionally, Mistral-7b was fine-tuned using the QLoRA approach to enhance its predictive capabilities. Results: Among the CML models, XGBoost and RF achieved the highest accuracy, with F1 scores of 0.87 for internal validation and 0.83 for external validation. In the LLM category, GPT-4 was the top performer with an F1 score of 0.43. Fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, resulting in an F1 score of 0.74, which was stable during external validation. Conclusion: While LLMs show moderate performance in zero-shot classification, fine-tuning can significantly enhance their effectiveness, potentially aligning them closer to CML models. However, CMLs still outperform LLMs in high-dimensional tabular data tasks. Comments: Code is available at: this https URL and this https URL. The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 92C50, 68T50 ACMclasses: J.3 Cite as: arXiv:2409.02136 [cs.LG] (or arXiv:2409.02136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.02136 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Seyed Amir Ahmad Safavi-Naini [view email] [v1] Mon, 2 Sep 2024 14:51:12 UTC (2,557 KB)

[AI-79] Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.02134
作者: Samer Francy,Raghubir Singh
关键词-EN: image classification tasks, work evaluates, image classification, classification tasks, compression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work evaluates the compression techniques on ConvNeXt models in image classification tasks using the CIFAR-10 dataset. Structured pruning, unstructured pruning, and dynamic quantization methods are evaluated to reduce model size and computational complexity while maintaining accuracy. The experiments, conducted on cloud-based platforms and edge device, assess the performance of these techniques. Results show significant reductions in model size, with up to 75% reduction achieved using structured pruning techniques. Additionally, dynamic quantization achieves a reduction of up to 95% in the number of parameters. Fine-tuned models exhibit improved compression performance, indicating the benefits of pre-training in conjunction with compression techniques. Unstructured pruning methods reveal trends in accuracy and compression, with limited reductions in computational complexity. The combination of OTOV3 pruning and dynamic quantization further enhances compression performance, resulting 89.7% reduction in size, 95% reduction with number of parameters and MACs, and 3.8% increase with accuracy. The deployment of the final compressed model on edge device demonstrates high accuracy 92.5% and low inference time 20 ms, validating the effectiveness of compression techniques for real-world edge computing applications.

[AI-80] From Predictive Importance to Causality: Which Machine Learning Model Reflects Reality?

链接: https://arxiv.org/abs/2409.02130
作者: Muhammad Arbab Arshad,Pallavi Kandanur,Saurabh Sonawani
关键词-EN: Ames Housing Dataset, analyzes the Ames, Dataset using CatBoost, Ames Housing, Housing Dataset
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study analyzes the Ames Housing Dataset using CatBoost and LightGBM models to explore feature importance and causal relationships in housing price prediction. We examine the correlation between SHAP values and EconML predictions, achieving high accuracy in price forecasting. Our analysis reveals a moderate Spearman rank correlation of 0.48 between SHAP-based feature importance and causally significant features, highlighting the complexity of aligning predictive modeling with causal understanding in housing market analysis. Through extensive causal analysis, including heterogeneity exploration and policy tree interpretation, we provide insights into how specific features like porches impact housing prices across various scenarios. This work underscores the need for integrated approaches that combine predictive power with causal insights in real estate valuation, offering valuable guidance for stakeholders in the industry.

[AI-81] he Application of Artificial Neural Network Model to Predicting the Acid Mine Drainage from Long-Term Lab Scale Kinetic Test

链接: https://arxiv.org/abs/2409.02128
作者: Muhammad Sonny Abfertiawan,Muchammad Daniyal Kautsar,Faiz Hasan,Yoseph Palinggi,Kris Pranoto
关键词-EN: Acid mine drainage, coal mining industry, lab-scale kinetic tests, lab-scale kinetic, common environmental problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 7th Environmental Technology and Management Conference (ETMC 2023)

点击查看摘要

Abstract:Acid mine drainage (AMD) is one of the common environmental problems in the coal mining industry that was formed by the oxidation of sulfide minerals in the overburden or waste rock. The prediction of acid generation through AMD is important to do in overburden management and planning the post-mining land use. One of the methods used to predict AMD is a lab-scale kinetic test to determine the rate of acid formation over time using representative samples in the field. However, this test requires a long-time procedure and large amount of chemical reagents lead to inefficient cost. On the other hand, there is potential for machine learning to learn the pattern behind the lab-scale kinetic test data. This study describes an approach to use artificial neural network (ANN) modeling to predict the result from lab-scale kinetic tests. Various ANN model is used based on 83 weeks experiments of lab-scale kinetic tests with 100% potential acid-forming rock. The model approaches the monitoring of pH, ORP, conductivity, TDS, sulfate, and heavy metals (Fe and Mn). The overall Nash-Sutcliffe Efficiency (NSE) obtained in this study was 0.99 on training and validation data, indicating a strong correlation and accurate prediction compared to the actual lab-scale kinetic tests data. This show the ANN ability to learn patterns, trends, and seasonality from past data for accurate forecasting, thereby highlighting its significant contribution to solving AMD problems. This research is also expected to establish the foundation for a new approach to predict AMD, with time efficient, accurate, and cost-effectiveness in future applications.

[AI-82] Enabling Trustworthy Federated Learning in Industrial IoT: Bridging the Gap Between Interpretability and Robustness

链接: https://arxiv.org/abs/2409.02127
作者: Senthil Kumar Jagatheesaperumal,Mohamed Rahouti,Ali Alfatemi,Nasir Ghani,Vu Khanh Quy,Abdellah Chehri
关键词-EN: Federated Learning, keeping data localized, machine learning, allowing collaborative model, collaborative model training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Federated Learning (FL) represents a paradigm shift in machine learning, allowing collaborative model training while keeping data localized. This approach is particularly pertinent in the Industrial Internet of Things (IIoT) context, where data privacy, security, and efficient utilization of distributed resources are paramount. The essence of FL in IIoT lies in its ability to learn from diverse, distributed data sources without requiring central data storage, thus enhancing privacy and reducing communication overheads. However, despite its potential, several challenges impede the widespread adoption of FL in IIoT, notably in ensuring interpretability and robustness. This article focuses on enabling trustworthy FL in IIoT by bridging the gap between interpretability and robustness, which is crucial for enhancing trust, improving decision-making, and ensuring compliance with regulations. Moreover, the design strategies summarized in this article ensure that FL systems in IIoT are transparent and reliable, vital in industrial settings where decisions have significant safety and economic impacts. The case studies in the IIoT environment driven by trustworthy FL models are provided, wherein the practical insights of trustworthy communications between IIoT systems and their end users are highlighted.

[AI-83] rajWeaver: Trajectory Recovery with State Propagation Diffusion Model

链接: https://arxiv.org/abs/2409.02124
作者: Jinming Wang,Hai Wang,Hongkai Wen,Geyong Min,Man Luo
关键词-EN: large amount, vehicles and goods, proliferation of location-aware, goods flow, location-aware devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First submission, extended to 10 pages include ref

点击查看摘要

Abstract:With the proliferation of location-aware devices, large amount of trajectories have been generated when agents such as people, vehicles and goods flow around the urban environment. These raw trajectories, typically collected from various sources such as GPS in cars, personal mobile devices, and public transport, are often sparse and fragmented due to limited sampling rates, infrastructure coverage and data loss. In this context, trajectory recovery aims to reconstruct such sparse raw trajectories into their dense and continuous counterparts, so that fine-grained movement of agents across space and time can be captured faithfully. Existing trajectory recovery approaches typically rely on the prior knowledge of travel mode or motion patterns, and often fail in densely populated urban areas where accurate maps are absent. In this paper, we present a new recovery framework called TrajWeaver based on probabilistic diffusion models, which is able to recover dense and refined trajectories from the sparse raw ones, conditioned on various auxiliary features such as Areas of Interest along the way, user identity and waybill information. The core of TrajWeaver is a novel State Propagation Diffusion Model (SPDM), which introduces a new state propagation mechanism on top of the standard diffusion models, so that knowledge computed in earlier diffusion steps can be reused later, improving the recovery performance while reducing the number of steps needed. Extensive experiments show that the proposed TrajWeaver can recover from raw trajectories of various lengths, sparsity levels and heterogeneous travel modes, and outperform the state-of-the-art baselines significantly in recovery accuracy. Our code is available at: https://anonymous.4open.science/r/TrajWeaver/

[AI-84] PuYun: Medium-Range Global Weather Forecasting Using Large Kernel Attention Convolutional Networks

链接: https://arxiv.org/abs/2409.02123
作者: Shengchen Zhu,Yiming Chen,Peiying Yu,Xiang Qu,Yuxiao Zhou,Yiming Ma,Zhizhan Zhao,Yukai Liu,Hao Mi,Bin Wang
关键词-EN: mitigating weather-related impacts, weather-related impacts, essential for understanding, understanding and mitigating, mitigating weather-related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate weather forecasting is essential for understanding and mitigating weather-related impacts. In this paper, we present PuYun, an autoregressive cascade model that leverages large kernel attention convolutional networks. The model’s design inherently supports extended weather prediction horizons while broadening the effective receptive field. The integration of large kernel attention mechanisms within the convolutional layers enhances the model’s capacity to capture fine-grained spatial details, thereby improving its predictive accuracy for meteorological phenomena. We introduce PuYun, comprising PuYun-Short for 0-5 day forecasts and PuYun-Medium for 5-10 day predictions. This approach enhances the accuracy of 10-day weather forecasting. Through evaluation, we demonstrate that PuYun-Short alone surpasses the performance of both GraphCast and FuXi-Short in generating accurate 10-day forecasts. Specifically, on the 10th day, PuYun-Short reduces the RMSE for Z500 to 720 m^2/s^2 , compared to 732 m^2/s^2 for GraphCast and 740 m^2/s^2 for FuXi-Short. Additionally, the RMSE for T2M is reduced to 2.60 K, compared to 2.63 K for GraphCast and 2.65 K for FuXi-Short. Furthermore, when employing a cascaded approach by integrating PuYun-Short and PuYun-Medium, our method achieves superior results compared to the combined performance of FuXi-Short and FuXi-Medium. On the 10th day, the RMSE for Z500 is further reduced to 638 m^2/s^2 , compared to 641 m^2/s^2 for FuXi. These findings underscore the effectiveness of our model ensemble in advancing medium-range weather prediction. Our training code and model will be open-sourced. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph) Cite as: arXiv:2409.02123 [cs.LG] (or arXiv:2409.02123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.02123 Focus to learn more arXiv-issued DOI via DataCite

[AI-85] Deep Knowledge-Infusion For Explainable Depression Detection

链接: https://arxiv.org/abs/2409.02122
作者: Sumit Dalal,Sarika Jain,Mayank Dave
关键词-EN: Discovering individuals depression, Discovering individuals, increasingly important, social media, depression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:Discovering individuals depression on social media has become increasingly important. Researchers employed ML/DL or lexicon-based methods for automated depression detection. Lexicon based methods, explainable and easy to implement, match words from user posts in a depression dictionary without considering contexts. While the DL models can leverage contextual information, their black-box nature limits their adoption in the domain. Though surrogate models like LIME and SHAP can produce explanations for DL models, the explanations are suitable for the developer and of limited use to the end user. We propose a Knolwedge-infused Neural Network (KiNN) incorporating domain-specific knowledge from DepressionFeature ontology (DFO) in a neural network to endow the model with user-level explainability regarding concepts and processes the clinician understands. Further, commonsense knowledge from the Commonsense Transformer (COMET) trained on ATOMIC is also infused to consider the generic emotional aspects of user posts in depression detection. The model is evaluated on three expertly curated datasets related to depression. We observed the model to have a statistically significant (p0.1) boost in performance over the best domain-specific model, MentalBERT, across CLEF e-Risk (25% MCC increase, 12% F1 increase). A similar trend is observed across the PRIMATE dataset, where the proposed model performed better than MentalBERT (2.5% MCC increase, 19% F1 increase). The observations confirm the generated explanations to be informative for MHPs compared to post hoc model explanations. Results demonstrated that the user-level explainability of KiNN also surpasses the performance of baseline models and can provide explanations where other baselines fall short. Infusing the domain and commonsense knowledge in KiNN enhances the ability of models like GPT-3.5 to generate application-relevant explanations.

[AI-86] CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

链接: https://arxiv.org/abs/2409.02119
作者: Xiaojun Xiao,Sen Shen,Qiming Bao,Hongfei Rong,Kairui Liu,Zhongsheng Wang,Jiamou Liu
关键词-EN: large language models, fine-tuning large language, conserving computational resources, fine-tuning large models, constraints is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In fine-tuning large language models (LLMs), conserving computational resources while maintaining effectiveness and improving outcomes within the same computational constraints is crucial. The Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models by reducing the number of trainable parameters and computational costs. However, current advancements in LoRA might be focused on its fine-tuning methodologies, with not as much exploration as might be expected into further compression of LoRA. Since most of LoRA’s parameters might still be superfluous, this may lead to unnecessary wastage of computational resources. In this paper, we propose \textbfCoRA: leveraging shared knowledge to optimize LoRA training by substituting its matrix B with a common subspace from large models. Our two-fold method includes (1) Freezing the substitute matrix B to halve parameters while training matrix A for specific tasks and (2) Using the substitute matrix B as an enhanced initial state for the original matrix B , achieving improved results with the same parameters. Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters. At the same time, the second approach has some improvements compared to LoRA’s original fine-tuning performance. They generally attest to the effectiveness of our work.

[AI-87] SO: Self-Training with Scaled Preference Optimization

链接: https://arxiv.org/abs/2409.02118
作者: Kaihui Chen,Hao Yi,Qingyang Li,Tianyu Qi,Yulan Hu,Fuzheng Zhang,Yong Liu
关键词-EN: Enhancing the conformity, ongoing research challenge, Preference, large language models, Direct Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Enhancing the conformity of large language models (LLMs) to human preferences remains an ongoing research challenge. Recently, offline approaches such as Direct Preference Optimization (DPO) have gained prominence as attractive options due to offering effective improvement in simple, efficient, and stable without interactions with reward models. However, these offline preference optimization methods highly rely on the quality of pairwise preference samples. Meanwhile, numerous iterative methods require additional training of reward models to select positive and negative samples from the model’s own generated responses for preference learning. Furthermore, as LLMs’ capabilities advance, it is quite challenging to continuously construct high-quality positive and negative preference instances from the model’s outputs due to the lack of diversity. To tackle these challenges, we propose TSO, or Self-Training with Scaled Preference Optimization, a framework for preference optimization that conducts self-training preference learning without training an additional reward model. TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses. Furthermore, TSO introduces corrections for model preference errors through human and AI feedback. Finally, TSO adopts iterative and dual clip reward strategies to update the reference model and its responses, adaptively adjusting preference data and balancing the optimization process. Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks, providing practical insight into preference data construction and model training strategies in the alignment domain.

[AI-88] ny-Toxic-Detector: A compact transformer-based model for toxic content detection

链接: https://arxiv.org/abs/2409.02114
作者: Michiel Kamphuis
关键词-EN: toxic content detection, compact transformer-based model, transformer-based model designed, compact transformer-based, designed for toxic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 6 pages

点击查看摘要

Abstract:This paper presents Tiny-toxic-detector, a compact transformer-based model designed for toxic content detection. Despite having only 2.1 million parameters, Tiny-toxic-detector achieves competitive performance on benchmark datasets, with 90.97% accuracy on ToxiGen and 86.98% accuracy on the Jigsaw dataset, rivaling models over 50 times its size. This efficiency enables deployment in resource-constrained environments, addressing the need for effective content moderation tools that balance performance with computational efficiency. The model architecture features 4 transformer encoder layers, each with 2 attention heads, an embedding dimension of 64, and a feedforward dimension of 128. Trained on both public and private datasets, Tiny-toxic-detector demonstrates the potential of efficient, task-specific models for addressing online toxicity. The paper covers the model architecture, training process, performance benchmarks, and limitations, underscoring its suitability for applications such as social media monitoring and content moderation. By achieving results comparable to much larger models while significantly reducing computational demands, Tiny-toxic-detector represents progress toward more sustainable and scalable AI-driven content moderation solutions.

[AI-89] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.

[AI-90] Conversational Complexity for Assessing Risk in Large Language Models

链接: https://arxiv.org/abs/2409.01247
作者: John Burden,Manuel Cebrian,Jose Hernandez-Orallo
关键词-EN: Large Language Models, Language Models, enable beneficial applications, Large Language, present a dual-use
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) present a dual-use dilemma: they enable beneficial applications while harboring potential for harm, particularly through conversational interactions. Despite various safeguards, advanced LLMs remain vulnerable. A watershed case was Kevin Roose’s notable conversation with Bing, which elicited harmful outputs after extended interaction. This contrasts with simpler early jailbreaks that produced similar content more easily, raising the question: How much conversational effort is needed to elicit harmful information from LLMs? We propose two measures: Conversational Length (CL), which quantifies the conversation length used to obtain a specific response, and Conversational Complexity (CC), defined as the Kolmogorov complexity of the user’s instruction sequence leading to the response. To address the incomputability of Kolmogorov complexity, we approximate CC using a reference LLM to estimate the compressibility of user instructions. Applying this approach to a large red-teaming dataset, we perform a quantitative analysis examining the statistical distribution of harmful and harmless conversational lengths and complexities. Our empirical findings suggest that this distributional analysis and the minimisation of CC serve as valuable tools for understanding AI safety, offering insights into the accessibility of harmful information. This work establishes a foundation for a new perspective on LLM safety, centered around the algorithmic complexity of pathways to harm.

[AI-91] Driver Digital Twin for Online Prediction of Personalized Lane Change Behavior

链接: https://arxiv.org/abs/2211.01294
作者: Xishun Liao,Xuanpeng Zhao,Ziran Wang,Zhouqiao Zhao,Kyungtae Han,Rohit Gupta,Matthew J. Barth,Guoyuan Wu
关键词-EN: foreseeable future, supposed to share, share the road, road with human-driven, Digital Twin
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Connected and automated vehicles (CAVs) are supposed to share the road with human-driven vehicles (HDVs) in a foreseeable future. Therefore, considering the mixed traffic environment is more pragmatic, as the well-planned operation of CAVs may be interrupted by HDVs. In the circumstance that human behaviors have significant impacts, CAVs need to understand HDV behaviors to make safe actions. In this study, we develop a Driver Digital Twin (DDT) for the online prediction of personalized lane change behavior, allowing CAVs to predict surrounding vehicles’ behaviors with the help of the digital twin technology. DDT is deployed on a vehicle-edge-cloud architecture, where the cloud server models the driver behavior for each HDV based on the historical naturalistic driving data, while the edge server processes the real-time data from each driver with his/her digital twin on the cloud to predict the lane change maneuver. The proposed system is first evaluated on a human-in-the-loop co-simulation platform, and then in a field implementation with three passenger vehicles connected through the 4G/LTE cellular network. The lane change intention can be recognized in 6 seconds on average before the vehicle crosses the lane separation line, and the Mean Euclidean Distance between the predicted trajectory and GPS ground truth is 1.03 meters within a 4-second prediction window. Compared to the general model, using a personalized model can improve prediction accuracy by 27.8%. The demonstration video of the proposed system can be watched at this https URL.

[AI-92] Fast High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

链接: https://arxiv.org/abs/2409.02451
作者: Yisi Liu,Bohan Yu,Drake Lin,Peter Wu,Cheol Jun Cho,Gopala Krishna Anumanchipalli
关键词-EN: vocal tract filter, electromagnetic articulography, trajectories like electromagnetic, vocal tract, tract filter
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: accepted for Spoken Language Technology Workshop 2024

点击查看摘要

Abstract:Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.

[AI-93] Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Translation

链接: https://arxiv.org/abs/2409.02391
作者: Ali Merali
关键词-EN: Large Language Model, Large Language, Language Model, paper derives, empirical relationships
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper derives ‘scaling laws’ – empirical relationships between the amount of training compute used for a Large Language Model (LLM) and its performance – for economic outcomes. In a preregistered experiment, 300 professional translators completed 1800 tasks with access to one of thirteen LLMs with differing model training compute sizes (or a control). Our results show that model scaling substantially raises productivity: for every 10x increase in model compute, translators completed tasks 12.3% quicker, received 0.18 s.d. higher grades, and earned 16.1% more per minute (including bonus payments). Further, the gains from model scaling are much higher for lower-skilled workers who gain a 4x larger improvement in task completion speed. These results imply further frontier model scaling – which is currently estimated at 4x increase per year – may have significant economic implications.

[AI-94] Speech Foundation Model Ensembles for the Controlled Singing Voice Deepfake Detection (CtrSVDD) Challenge 2024

链接: https://arxiv.org/abs/2409.02302
作者: Anmol Guragain,Tianchi Liu,Zihan Pan,Hardik B. Sailor,Qiongqiong Wang
关键词-EN: Voice Deepfake Detection, Controlled Singing Voice, deepfake singing voices, Singing Voice Deepfake, pooled equal error
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:This work details our approach to achieving a leading system with a 1.79% pooled equal error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection (CtrSVDD). The rapid advancement of generative AI models presents significant challenges for detecting AI-generated deepfake singing voices, attracting increased research attention. The Singing Voice Deepfake Detection (SVDD) Challenge 2024 aims to address this complex task. In this work, we explore the ensemble methods, utilizing speech foundation models to develop robust singing voice anti-spoofing systems. We also introduce a novel Squeeze-and-Excitation Aggregation (SEA) method, which efficiently and effectively integrates representation features from the speech foundation models, surpassing the performance of our other individual systems. Evaluation results confirm the efficacy of our approach in detecting deepfake singing voices. The codes can be accessed at this https URL.

计算机视觉

[CV-0] HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

链接: https://arxiv.org/abs/2409.02919
作者: Xinyu Liu,Yingqing He,Lanqing Guo,Xiang Li,Bu Jin,Peng Li,Yan Li,Chi-Min Chan,Qifeng Chen,Wei Xue,Wenhan Luo,Qingfeng Liu,Yike Guo
关键词-EN: pretrained diffusion models, diffusion models, resolution and higher, pretrained diffusion, struggle with issues
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.

[CV-1] UC-NeRF: Uncertainty-aware Conditional Neural Radiance Fields from Endoscopic Sparse Views

链接: https://arxiv.org/abs/2409.02917
作者: Jiaxin Guo,Jiangliu Wang,Ruofeng Wei,Di Kang,Qi Dou,Yun-hui Liu
关键词-EN: Visualizing surgical scenes, minimally invasive procedures, revealing internal anatomical, internal anatomical structures, Visualizing surgical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visualizing surgical scenes is crucial for revealing internal anatomical structures during minimally invasive procedures. Novel View Synthesis is a vital technique that offers geometry and appearance reconstruction, enhancing understanding, planning, and decision-making in surgical scenes. Despite the impressive achievements of Neural Radiance Field (NeRF), its direct application to surgical scenes produces unsatisfying results due to two challenges: endoscopic sparse views and significant photometric inconsistencies. In this paper, we propose uncertainty-aware conditional NeRF for novel view synthesis to tackle the severe shape-radiance ambiguity from sparse surgical views. The core of UC-NeRF is to incorporate the multi-view uncertainty estimation to condition the neural radiance field for modeling the severe photometric inconsistencies adaptively. Specifically, our UC-NeRF first builds a consistency learner in the form of multi-view stereo network, to establish the geometric correspondence from sparse views and generate uncertainty estimation and feature priors. In neural rendering, we design a base-adaptive NeRF network to exploit the uncertainty estimation for explicitly handling the photometric inconsistencies. Furthermore, an uncertainty-guided geometry distillation is employed to enhance geometry learning. Experiments on the SCARED and Hamlyn datasets demonstrate our superior performance in rendering appearance and geometry, consistently outperforming the current state-of-the-art approaches. Our code will be released at \urlthis https URL.

[CV-2] Can LVLMs Obtain a Drivers License? A Benchmark Towards Reliable AGI for Autonomous Driving

链接: https://arxiv.org/abs/2409.02914
作者: Yuhang Lu,Yichen Yao,Jiadong Tu,Jiangnan Shao,Yuexin Ma,Xinge Zhu
关键词-EN: garnered significant attention, recently garnered significant, Large Vision-Language Models, significant attention, recently garnered
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver’s license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset. The project page can be found at: \urlthis https URL

[CV-3] SITAR: Semi-supervised Image Transformer for Action Recognition ICPR2024

链接: https://arxiv.org/abs/2409.02910
作者: Owais Iqbal,Omprakash Chakraborty,Aftab Hussain,Rameswar Panda,Abir Das
关键词-EN: annotating visual data, Recognizing actions, classified nature, labeled videos remains, limited set
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICPR 2024

点击查看摘要

Abstract:Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep 3 D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.

[CV-4] LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

链接: https://arxiv.org/abs/2409.02889
作者: Xidong Wang,Dingjie Song,Shunian Chen,Chen Zhang,Benyou Wang
关键词-EN: Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large, high-resolution image understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 19 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as \textitdegraded performance with more images and \textithigh computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model \textbfLongLLaVA~(\textbfLong-Context \textbfLarge \textbfLanguage \textbfand \textbfVision \textbfAssistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.

[CV-5] Multi-stream deep learning framework to predict mild cognitive impairment with Rey Complex Figure Test

链接: https://arxiv.org/abs/2409.02883
作者: Junyoung Park,Eun Hyun Seo,Sunjun Kim,SangHak Yi,Kun Ho Lee,Sungho Won
关键词-EN: Rey Complex Figure, Complex Figure Test, Rey Complex, Complex Figure, Drawing tests
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Drawing tests like the Rey Complex Figure Test (RCFT) are widely used to assess cognitive functions such as visuospatial skills and memory, making them valuable tools for detecting mild cognitive impairment (MCI). Despite their utility, existing predictive models based on these tests often suffer from limitations like small sample sizes and lack of external validation, which undermine their reliability. We developed a multi-stream deep learning framework that integrates two distinct processing streams: a multi-head self-attention based spatial stream using raw RCFT images and a scoring stream employing a previously developed automated scoring system. Our model was trained on data from 1,740 subjects in the Korean cohort and validated on an external hospital dataset of 222 subjects from Korea. The proposed multi-stream model demonstrated superior performance over baseline models (AUC = 0.872, Accuracy = 0.781) in external validation. The integration of both spatial and scoring streams enables the model to capture intricate visual details from the raw images while also incorporating structured scoring data, which together enhance its ability to detect subtle cognitive impairments. This dual approach not only improves predictive accuracy but also increases the robustness of the model, making it more reliable in diverse clinical settings. Our model has practical implications for clinical settings, where it could serve as a cost-effective tool for early MCI screening.

[CV-6] Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV2024

链接: https://arxiv.org/abs/2409.02882
作者: Guangtao Zheng,Wenqian Ye,Aidong Zhang
关键词-EN: spurious bias, spurious, few-shot classifiers, Few-shot, bias
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at this https URL.

[CV-7] he Impact of Balancing Real and Synthetic Data on Accuracy and Fairness in Face Recognition ECCV2024

链接: https://arxiv.org/abs/2409.02867
作者: Andrea Atzori,Pietro Cosseddu,Gianni Fenu,Mirko Marras
关键词-EN: recent years, deep face recognition, advancements in deep, fueled an increasing, increasing demand
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Synthetic Data for Computer Vision Workshop - Side Event at ECCV 2024

点击查看摘要

Abstract:Over the recent years, the advancements in deep face recognition have fueled an increasing demand for large and diverse datasets. Nevertheless, the authentic data acquired to create those datasets is typically sourced from the web, which, in many cases, can lead to significant privacy issues due to the lack of explicit user consent. Furthermore, obtaining a demographically balanced, large dataset is even more difficult because of the natural imbalance in the distribution of images from different demographic groups. In this paper, we investigate the impact of demographically balanced authentic and synthetic data, both individually and in combination, on the accuracy and fairness of face recognition models. Initially, several generative methods were used to balance the demographic representations of the corresponding synthetic datasets. Then a state-of-the-art face encoder was trained and evaluated using (combinations of) synthetic and authentic images. Our findings emphasized two main points: (i) the increased effectiveness of training data generated by diffusion-based models in enhancing accuracy, whether used alone or combined with subsets of authentic data, and (ii) the minimal impact of incorporating balanced data from pre-trained generative methods on fairness (in nearly all tested scenarios using combined datasets, fairness scores remained either unchanged or worsened, even when compared to unbalanced authentic datasets). Source code and data are available at \urlthis https URL for reproducibility.

[CV-8] Hybrid-Segmentor: A Hybrid Approach to Automated Fine-Grained Crack Segmentation in Civil Infrastructure

链接: https://arxiv.org/abs/2409.02866
作者: June Moh Goo,Xenios Milidonis,Alessandro Artusi,Jan Boehm,Carlo Ciliberto
关键词-EN: Detecting and segmenting, roads and buildings, crucial for safety, safety and cost-effective, cost-effective maintenance
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:Detecting and segmenting cracks in infrastructure, such as roads and buildings, is crucial for safety and cost-effective maintenance. In spite of the potential of deep learning, there are challenges in achieving precise results and handling diverse crack types. With the proposed dataset and model, we aim to enhance crack detection and infrastructure maintenance. We introduce Hybrid-Segmentor, an encoder-decoder based approach that is capable of extracting both fine-grained local and global crack features. This allows the model to improve its generalization capabilities in distinguish various type of shapes, surfaces and sizes of cracks. To keep the computational performances low for practical purposes, while maintaining the high the generalization capabilities of the model, we incorporate a self-attention model at the encoder level, while reducing the complexity of the decoder component. The proposed model outperforms existing benchmark models across 5 quantitative metrics (accuracy 0.971, precision 0.804, recall 0.744, F1-score 0.770, and IoU score 0.630), achieving state-of-the-art status.

[CV-9] Visually Grounded Speech Models for Low-resource Languages and Cognitive Modelling

链接: https://arxiv.org/abs/2409.02865
作者: Leanne Nortje
关键词-EN: unlabelled speech paired, learn from unlabelled, VGS, visually grounded speech, VGS models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: PhD Dissertation

点击查看摘要

Abstract:This dissertation examines visually grounded speech (VGS) models that learn from unlabelled speech paired with images. It focuses on applications for low-resource languages and understanding human language acquisition. We introduce a task called visually prompted keyword localisation to detect and localise keywords in speech using images. We demonstrate the effectiveness of VGS models in few-shot learning scenarios for low-resource languages like Yoruba. Additionally, we examine the mutual exclusivity bias in VGS models. Our monolingual VGS model exhibits this bias, but we found that multilingualism does not affect the bias in this VGS model similarly to what is observed in children.

[CV-10] Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

链接: https://arxiv.org/abs/2409.02851
作者: Zhibin Liu,Haoye Dong,Aviral Chharia,Hefeng Wu
关键词-EN: plausible unseen parts, requires accurate modeling, single RGB image, Gaussian Splatting, Gaussian Splatting module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 Pages, 8 figures, Project page: this https URL

点击查看摘要

Abstract:Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: this https URL

[CV-11] MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling

链接: https://arxiv.org/abs/2409.02846
作者: Jihye Ahn,Hyesong Choi,Soomin Kim,Dongbo Min
关键词-EN: Masked Image Modeling, stereo matching, Image Modeling Distilled, Modeling Distilled Stereo, stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this, we propose to use an auxiliary network (teacher), updated via Exponential Moving Average (EMA), along with the original stereo model (student), where teacher predictions serve as pseudo supervisory signals to effectively distill knowledge into the student model. State-of-the-arts performance is achieved with the proposed method on several stereo matching such as ETH3D and KITTI 2015. Additionally, to demonstrate that our model effectively leverages locality inductive bias, we provide the attention distance measurement.

[CV-12] ConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation

链接: https://arxiv.org/abs/2409.02838
作者: Hayeon Jo,Hyesong Choi,Minhee Cho,Dongbo Min
关键词-EN: Transfer learning based, models grow exponentially, deep models grow, Transfer learning, grow exponentially
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.

[CV-13] ExpLLM: Towards Chain of Thought for Facial Expression Recognition

链接: https://arxiv.org/abs/2409.02828
作者: Xing Lan,Jian Xue,Ji Qi,Dongmei Jiang,Ke Lu,Tat-Seng Chua
关键词-EN: Facial expression recognition, critical task, task in multimedia, multimedia with significant, significant implications
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: project page: this https URL

点击查看摘要

Abstract:Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU’s name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.

[CV-14] Deep Learning Meets Satellite Images – An Evaluation on Handcrafted and Learning-based Features for Multi-date Satellite Stereo Images ECCV2024

链接: https://arxiv.org/abs/2409.02825
作者: Shuang Song,Luca Morelli,Xinyi Wu,Rongjun Qin,Hessah Albanwan,Fabio Remondino
关键词-EN: digital surface models, surface models, critical step, digital surface, matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024 Workshop - TradiCV

点击查看摘要

Abstract:A critical step in the digital surface models(DSM) generation is feature matching. Off-track (or multi-date) satellite stereo images, in particular, can challenge the performance of feature matching due to spectral distortions between images, long baseline, and wide intersection angles. Feature matching methods have evolved over the years from handcrafted methods (e.g., SIFT) to learning-based methods (e.g., SuperPoint and SuperGlue). In this paper, we compare the performance of different features, also known as feature extraction and matching methods, applied to satellite imagery. A wide range of stereo pairs(~500) covering two separate study sites are used. SIFT, as a widely used classic feature extraction and matching algorithm, is compared with seven deep-learning matching methods: SuperGlue, LightGlue, LoFTR, ASpanFormer, DKM, GIM-LightGlue, and GIM-DKM. Results demonstrate that traditional matching methods are still competitive in this age of deep learning, although for particular scenarios learning-based methods are very promising.

[CV-15] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

链接: https://arxiv.org/abs/2409.02813
作者: Xiang Yue,Tianyu Zheng,Yuansheng Ni,Yubo Wang,Kai Zhang,Shengbang Tong,Yuxuan Sun,Ming Yin,Botao Yu,Ge Zhang,Huan Sun,Yu Su,Wenhu Chen,Graham Neubig
关键词-EN: Massive Multi-discipline Multimodal, Multi-discipline Multimodal Understanding, Massive Multi-discipline, paper introduces MMMU-Pro, Multi-discipline Multimodal
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see” and “read” simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

[CV-16] UnLearning from Experience to Avoid Spurious Correlations

链接: https://arxiv.org/abs/2409.02792
作者: Jeff Mitchell,Jesús Martínez del Rincón,Niall McLaughlin
关键词-EN: deep neural networks, spurious correlations, networks can achieve, student model, deep neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student’s output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.

[CV-17] MedUnA: Language guided Unsupervised Adaptation of Vision-Language Models for Medical Image Classification

链接: https://arxiv.org/abs/2409.02729
作者: Umaima Rahman,Raza Imam,Dwarikanath Mahapatra,Boulbaba Ben Amor
关键词-EN: medical image classification, texttt, labeled medical images, unsupervised learning, labeled medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In medical image classification, supervised learning is challenging due to the lack of labeled medical images. Contrary to the traditional \textitmodus operandi of pre-training followed by fine-tuning, this work leverages the visual-textual alignment within Vision-Language models (\textttVLMs) to facilitate the unsupervised learning. Specifically, we propose \underlineMedical \underlineUnsupervised \underlineAdaptation (\textttMedUnA), constituting two-stage training: Adapter Pre-training, and Unsupervised Learning. In the first stage, we use descriptions generated by a Large Language Model (\textttLLM) corresponding to class labels, which are passed through the text encoder \textttBioBERT. The resulting text embeddings are then aligned with the class labels by training a lightweight \textttadapter. We choose \texttt\textttLLMs because of their capability to generate detailed, contextually relevant descriptions to obtain enhanced text embeddings. In the second stage, the trained \textttadapter is integrated with the visual encoder of \textttMedCLIP. This stage employs a contrastive entropy-based loss and prompt tuning to align visual embeddings. We incorporate self-entropy minimization into the overall training objective to ensure more confident embeddings, which are crucial for effective unsupervised learning and alignment. We evaluate the performance of \textttMedUnA on three different kinds of data modalities - chest X-rays, eye fundus and skin lesion images. The results demonstrate significant accuracy gain on average compared to the baselines across different datasets, highlighting the efficacy of our approach. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.02729 [cs.CV] (or arXiv:2409.02729v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.02729 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-18] GET-UP: GEomeTric-aware Depth Estimation with Radar Points UPsampling WACV2025

链接: https://arxiv.org/abs/2409.02720
作者: Huawei Sun,Zixu Wang,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille
关键词-EN: Depth estimation plays, Depth estimation, radar-camera depth estimation, autonomous driving, facilitating a comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: Accepted by WACV 2025

点击查看摘要

Abstract:Depth estimation plays a pivotal role in autonomous driving, facilitating a comprehensive understanding of the vehicle’s 3D surroundings. Radar, with its robustness to adverse weather conditions and capability to measure distances, has drawn significant interest for radar-camera depth estimation. However, existing algorithms process the inherently noisy and sparse radar data by projecting 3D points onto the image plane for pixel-level feature extraction, overlooking the valuable geometric information contained within the radar point cloud. To address this gap, we propose GET-UP, leveraging attention-enhanced Graph Neural Networks (GNN) to exchange and aggregate both 2D and 3D information from radar data. This approach effectively enriches the feature representation by incorporating spatial relationships compared to traditional methods that rely only on 2D feature extraction. Furthermore, we incorporate a point cloud upsampling task to densify the radar point cloud, rectify point positions, and derive additional 3D features under the guidance of lidar data. Finally, we fuse radar and camera features during the decoding phase for depth estimation. We benchmark our proposed GET-UP on the nuScenes dataset, achieving state-of-the-art performance with a 15.3% and 14.7% improvement in MAE and RMSE over the previously best-performing model.

[CV-19] LIPIDS: Learning-based Illumination Planning In Discretized (Light) Space for Photometric Stereo WACV2025

链接: https://arxiv.org/abs/2409.02716
作者: Ashish Tiwari,Mihir Sutariya,Shanmuganathan Raman
关键词-EN: differently illuminated images, Photometric stereo, obtaining per-pixel surface, stereo, Photometric
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in WACV 2025

点击查看摘要

Abstract:Photometric stereo is a powerful method for obtaining per-pixel surface normals from differently illuminated images of an object. While several methods address photometric stereo with different image (or light) counts ranging from one to two to a hundred, very few focus on learning optimal lighting configuration. Finding an optimal configuration is challenging due to the vast number of possible lighting directions. Moreover, exhaustively sampling all possibilities is impractical due to time and resource constraints. Photometric stereo methods have demonstrated promising performance on existing datasets, which feature limited light directions sparsely sampled from the light space. Therefore, can we optimally utilize these datasets for illumination planning? In this work, we introduce LIPIDS - Learning-based Illumination Planning In Discretized light Space to achieve minimal and optimal lighting configurations for photometric stereo under arbitrary light distribution. We propose a Light Sampling Network (LSNet) that optimizes lighting direction for a fixed number of lights by minimizing the normal loss through a normal regression network. The learned light configurations can directly estimate surface normals during inference, even using an off-the-shelf photometric stereo method. Extensive qualitative and quantitative analyses on synthetic and real-world datasets show that photometric stereo under learned lighting configurations through LIPIDS either surpasses or is nearly comparable to existing illumination planning methods across different photometric stereo backbones.

[CV-20] Recoverable Anonymization for Pose Estimation: A Privacy-Enhancing Approach

链接: https://arxiv.org/abs/2409.02715
作者: Wenjun Huang,Yang Ni,Arghavan Rezvani,SungHeon Jeong,Hanning Chen,Yezi Liu,Fei Wen,Mohsen Imani
关键词-EN: Human pose estimation, Human pose, Human, HPE, SPI
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human pose estimation (HPE) is crucial for various applications. However, deploying HPE algorithms in surveillance contexts raises significant privacy concerns due to the potential leakage of sensitive personal information (SPI) such as facial features, and ethnicity. Existing privacy-enhancing methods often compromise either privacy or performance, or they require costly additional modalities. We propose a novel privacy-enhancing system that generates privacy-enhanced portraits while maintaining high HPE performance. Our key innovations include the reversible recovery of SPI for authorized personnel and the preservation of contextual information. By jointly optimizing a privacy-enhancing module, a privacy recovery module, and a pose estimator, our system ensures robust privacy protection, efficient SPI recovery, and high-performance HPE. Experimental results demonstrate the system’s robust performance in privacy enhancement, SPI recovery, and HPE.

[CV-21] MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning WACV2025

链接: https://arxiv.org/abs/2409.02714
作者: Jiarui Sun,M. Ugur Akcal,Wei Zhang,Girish Chowdhary
关键词-EN: poses significant challenges, visual Reinforcement Learning, extracting informative state, observations poses significant, visual Reinforcement
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: WACV 2025

点击查看摘要

Abstract:In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency, primarily due to the complexity of extracting informative state representations from high-dimensional data. Previous methods such as contrastive-based approaches have made strides in improving sample efficiency but fall short in modeling the nuanced evolution of states. To address this, we introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking to explicitly model state evolution in visual RL. Specifically, we propose a self-supervised dual-component strategy that integrates (1) a graph construction of pixel-based observations for spatial-temporal masking, coupled with (2) a multi-level contrastive learning mechanism that enriches state representations by emphasizing temporal continuity and change of states. MOOSS advances the understanding of state dynamics by disrupting and learning from spatial-temporal correlations, which facilitates policy learning. Our comprehensive evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency, demonstrating the effectiveness of our method. Our code is released at this https URL.

[CV-22] CLDA: Collaborative Learning for Enhanced Unsupervised Domain Adaptation

链接: https://arxiv.org/abs/2409.02699
作者: Minhee Cho,Hyesong Choi,Hayeon Jo,Dongbo Min
关键词-EN: Unsupervised Domain Adaptation, labeled source domain, unlabeled target domain, Unsupervised Domain, Domain Adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high-performance models demand significant resources, resulting in prohibitive deployment costs and highlighting the need for small yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) in a Teacher-Student framework can be a common approach, but we find that domain shift in UDA leads to a significant increase in non-salient parameters in the teacher model, degrading model’s generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning, a method that updates the teacher’s non-salient parameters using the student model and at the same time enhance the student’s performance using the updated teacher model. Experiments across various tasks and datasets show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for teacher and +1.4% mIoU for student compared to the baseline model in the GTA to Cityscapes. In the Synthia to Cityscapes, it achieves an improvement of +0.8% mIoU for teacher and +2.0% mIoU for student.

[CV-23] Rethinking HTG Evaluation: Bridging Generation and Recognition

链接: https://arxiv.org/abs/2409.02683
作者: Konstantina Nikolaidou,George Retsinas,Giorgos Sfikas,Marcus Liwicki
关键词-EN: natural image tasks, extensively studied, HTG, text, Writer Identification models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The evaluation of generative models for natural image tasks has been extensively studied. Similar protocols and metrics are used in cases with unique particularities, such as Handwriting Generation, even if they might not be completely appropriate. In this work, we introduce three measures tailored for HTG evaluation, \textHTG_\textHTR , \textHTG_\textstyle , and \textHTG_\textOOV , and argue that they are more expedient to evaluate the quality of generated handwritten images. The metrics rely on the recognition error/accuracy of Handwriting Text Recognition and Writer Identification models and emphasize writing style, textual content, and diversity as the main aspects that adhere to the content of handwritten images. We conduct comprehensive experiments on the IAM handwriting database, showcasing that widely used metrics such as FID fail to properly quantify the diversity and the practical utility of generated handwriting samples. Our findings show that our metrics are richer in information and underscore the necessity of standardized evaluation protocols in HTG. The proposed metrics provide a more robust and informative protocol for assessing HTG quality, contributing to improved performance in HTR. Code for the evaluation protocol is available at: this https URL.

[CV-24] Improved Single Camera BEV Perception Using Multi-Camera Training ITSC2024

链接: https://arxiv.org/abs/2409.02676
作者: Daniel Busch,Ido Freeman,Richard Meyes,Tobias Meisen
关键词-EN: Bird Eye View, Bird Eye, downstream autonomous driving, autonomous driving tasks, Eye View
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This Paper has been accepted to the 27th IEEE International Conference on Intelligent Transportation Systems (ITSC 2024)

点击查看摘要

Abstract:Bird’s Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.

[CV-25] Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

链接: https://arxiv.org/abs/2409.02664
作者: Kaiqing Lin,Yuzhen Lin,Weixiang Li,Taiping Yao,Bin Li
关键词-EN: faces poses huge, poses huge potential, huge potential negative, potential negative impacts, deepfake faces poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via data perturbations, our method can reprogram a pretrained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. Furthermore, we insert a pseudo-word guided by facial identity into the text prompt. Extensive experiments on several popular benchmarks demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to WildDeepfake) using a pre-trained CLIP model with our proposed reprogramming method; (2) our superior performances are at less cost of trainable parameters, making it a promising approach for real-world applications.

[CV-26] PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

链接: https://arxiv.org/abs/2409.02657
作者: Jun Ling,Yiwen Wang,Han Xue,Rong Xie,Li Song
关键词-EN: previous audio-driven talking, head, text prompts, talking head generation, audio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 7+5 pages, 15 figures

点击查看摘要

Abstract:While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbfPoseTalk, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: this https URL.

[CV-27] Skip-and-Play: Depth-Driven Pose-Preserved Image Generation for Any Objects

链接: https://arxiv.org/abs/2409.02653
作者: Kyungmin Jo,Jaegul Choo
关键词-EN: prompting subsequent efforts, high-quality images solely, pose control, depth-based pose control, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The emergence of diffusion models has enabled the generation of diverse high-quality images solely from text, prompting subsequent efforts to enhance the controllability of these models. Despite the improvement in controllability, pose control remains limited to specific objects (e.g., humans) or poses (e.g., frontal view) due to the fact that pose is generally controlled via camera parameters (e.g., rotation angle) or keypoints (e.g., eyes, nose). Specifically, camera parameters-conditional pose control models generate unrealistic images depending on the object, owing to the small size of 3D datasets for training. Also, keypoint-based approaches encounter challenges in acquiring reliable keypoints for various objects (e.g., church) or poses (e.g., back view). To address these limitations, we propose depth-based pose control, as depth maps are easily obtainable from a single depth estimation model regardless of objects and poses, unlike camera parameters and keypoints. However, depth-based pose control confronts issues of shape dependency, as depth maps influence not only the pose but also the shape of the generated images. To tackle this issue, we propose Skip-and-Play (SnP), designed via analysis of the impact of three components of depth-conditional ControlNet on the pose and the shape of the generated images. To be specific, based on the analysis, we selectively skip parts of the components to mitigate shape dependency on the depth map while preserving the pose. Through various experiments, we demonstrate the superiority of SnP over baselines and showcase the ability of SnP to generate images of diverse objects and poses. Remarkably, SnP exhibits the ability to generate images even when the objects in the condition (e.g., a horse) and the prompt (e.g., a hedgehog) differ from each other.

[CV-28] Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering

链接: https://arxiv.org/abs/2409.02647
作者: Cornelius Bürkle,Fabian Oboril,Kay-Ulrich Scholl
关键词-EN: expanding digital display, digital display options, automotive industry, expanding digital, Cyclic Redundancy Checks
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: 9 pages

点击查看摘要

Abstract:The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate “good” telltales, i.e. those that a human driver will understand correctly, and “corrupted” telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all “corrupted” test patterns were correctly classified, while no false alarms were triggered.

[CV-29] MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

链接: https://arxiv.org/abs/2409.02638
作者: Junyi Ma,Xieyuanli Chen,Wentao Bao,Jingyi Xu,Hesheng Wang
关键词-EN: embodied artificial intelligence, Understanding human intentions, Understanding human, artificial intelligence, path to embodied
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence. As a branch of egocentric vision techniques, hand trajectory prediction plays a vital role in comprehending human motion patterns, benefiting downstream tasks in extended reality and robot manipulation. However, capturing high-level human intentions consistent with reasonable temporal causality is challenging when only egocentric videos are available. This difficulty is exacerbated under camera egomotion interference and the absence of affordance labels to explicitly guide the optimization of hand waypoint distribution. In this work, we propose a novel hand trajectory prediction method dubbed MADiff, which forecasts future hand waypoints with diffusion models. The devised denoising operation in the latent space is achieved by our proposed motion-aware Mamba, where the camera wearer’s egomotion is integrated to achieve motion-driven selective scan (MDSS). To discern the relationship between hands and scenarios without explicit affordance supervision, we leverage a foundation model that fuses visual and language features to capture high-level semantics from video clips. Comprehensive experiments conducted on five public datasets with the existing and our proposed new evaluation metrics demonstrate that MADiff predicts comparably reasonable hand trajectories compared to the state-of-the-art baselines, and achieves real-time performance. We will release our code and pretrained models of MADiff at the project page: this https URL.

[CV-30] Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

链接: https://arxiv.org/abs/2409.02634
作者: Jianwen Jiang,Chao Liang,Jiaqi Yang,Gaojie Lin,Tianyun Zhong,Yanbo Zheng
关键词-EN: recently achieved significant, achieved significant breakthroughs, video generation techniques, diffusion-based video generation, human video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.

[CV-31] AdvSecureNet: A Python Toolkit for Adversarial Machine Learning

链接: https://arxiv.org/abs/2409.02629
作者: Melih Catal,Manuel Günther
关键词-EN: Machine learning models, adversarial machine learning, models are vulnerable, Machine learning, learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at this https URL and installable via PyPI.

[CV-32] GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering

链接: https://arxiv.org/abs/2409.02611
作者: Lingling Zhang,Muye Huang,QianYing Wang,Yaxian Wang,Wenjun Wu,Jun Liu
关键词-EN: data report generation, business data analysis, Chart Question Answering, visual chart content, report generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Chart Question Answering (CQA) aims at answering questions based on the visual chart content, which plays an important role in chart sumarization, business data analysis, and data report generation. CQA is a challenging multi-modal task because of the strong context dependence and complex reasoning requirement. The former refers to answering this question strictly based on the analysis of the visual content or internal data of the given chart, while the latter emphasizes the various logical and numerical reasoning involved in answer prediction process. In this paper, we pay more attention on the complex reasoning in CQA task, and propose a novel Graph-of-Thought (GoT) guided compositional reasoning model called GoT-CQA to overcome this problem. At first, we transform the chart-oriented question into a directed acyclic GoT composed of multiple operator nodes, including localization, numerical and logical operator. It intuitively reflects the human brain’s solution process to this question. After that, we design an efficient auto-compositional reasoning framework guided by the GoT, to excute the multi-step reasoning operations in various types of questions. Comprehensive experiments on ChartQA and PlotQA-D datasets show that GoT-CQA achieves outstanding performance, especially in complex human-written and reasoning questions, comparing with the latest popular baselines.

[CV-33] A Medical Multimodal Large Language Model for Pediatric Pneumonia

链接: https://arxiv.org/abs/2409.02608
作者: Weiwei Tian,Xinyu Huang,Tianhao Cheng,Wen He,Jinwu Fang,Rui Feng,Daoying Geng,Xiaobo Zhang
关键词-EN: Pediatric pneumonia, treating pediatric pneumonia, pediatric pneumonia shares, years worldwide, imposing a substantial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:Pediatric pneumonia is the leading cause of death among children under five years worldwide, imposing a substantial burden on affected families. Currently, there are three significant hurdles in diagnosing and treating pediatric pneumonia. Firstly, pediatric pneumonia shares similar symptoms with other respiratory diseases, making rapid and accurate differential diagnosis challenging. Secondly, primary hospitals often lack sufficient medical resources and experienced doctors. Lastly, providing personalized diagnostic reports and treatment recommendations is labor-intensive and time-consuming. To tackle these challenges, we proposed a Medical Multimodal Large Language Model for Pediatric Pneumonia (P2Med-MLLM). It was capable of handling diverse clinical tasks, such as generating free-text radiology reports and medical records within a unified framework. Specifically, P2Med-MLLM can process both pure text and image-text data, trained on an extensive and large-scale dataset (P2Med-MD), including real clinical information from 163,999 outpatient and 8,684 inpatient cases. This dataset comprised 2D chest X-ray images, 3D chest CT images, corresponding radiology reports, and outpatient and inpatient records. We designed a three-stage training strategy to enable P2Med-MLLM to comprehend medical knowledge and follow instructions for various clinical tasks. To rigorously evaluate P2Med-MLLM’s performance, we developed P2Med-MBench, a benchmark consisting of 642 meticulously verified samples by pediatric pulmonology specialists, covering six clinical decision-support tasks and a balanced variety of diseases. The automated scoring results demonstrated the superiority of P2Med-MLLM. This work plays a crucial role in assisting primary care doctors with prompt disease diagnosis and treatment planning, reducing severe symptom mortality rates, and optimizing the allocation of medical resources.

[CV-34] A Fashion Item Recommendation Model in Hyperbolic Space CVPR2024

链接: https://arxiv.org/abs/2409.02599
作者: Ryotaro Shimizu,Yu Wang,Masanari Kimura,Yuki Hirakawa,Takashi Wada,Yuki Saito,Julian McAuley
关键词-EN: fashion item recommendation, incorporates hyperbolic geometry, propose a fashion, geometry into user, item recommendation model
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This work was presented at the CVFAD Workshop at CVPR 2024

点击查看摘要

Abstract:In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users’ purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.

[CV-35] SurgTrack: CAD-Free 3D Tracking of Real-world Surgical Instruments

链接: https://arxiv.org/abs/2409.02598
作者: Wenwu Guo,Jinlin Wu,Zhen Chen,Qingxiang Zhao,Miao Xu,Zhen Lei,Hongbin Liu
关键词-EN: received increasing attention, Vision-based surgical navigation, increasing attention due, vision-based navigation system, tracking surgical instruments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Vision-based surgical navigation has received increasing attention due to its non-invasive, cost-effective, and flexible advantages. In particular, a critical element of the vision-based navigation system is tracking surgical instruments. Compared with 2D instrument tracking methods, 3D instrument tracking has broader value in clinical practice, but is also more challenging due to weak texture, occlusion, and lack of Computer-Aided Design (CAD) models for 3D registration. To solve these challenges, we propose the SurgTrack, a two-stage 3D instrument tracking method for CAD-free and robust real-world applications. In the first registration stage, we incorporate an Instrument Signed Distance Field (SDF) modeling the 3D representation of instruments, achieving CAD-freed 3D registration. Due to this, we can obtain the location and orientation of instruments in the 3D space by matching the video stream with the registered SDF model. In the second tracking stage, we devise a posture graph optimization module, leveraging the historical tracking results of the posture memory pool to optimize the tracking results and improve the occlusion robustness. Furthermore, we collect the Instrument3D dataset to comprehensively evaluate the 3D tracking of surgical instruments. The extensive experiments validate the superiority and scalability of our SurgTrack, by outperforming the state-of-the-arts with a remarkable improvement. The code and dataset are available at this https URL.

[CV-36] BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network

链接: https://arxiv.org/abs/2409.02584
作者: N. T. Diba,N. Akter,S. A. H. Chowdhury,J. E. Giti
关键词-EN: Body Mass Index, person Body Mass, Mass Index, Body Mass, BMI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A person’s Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community’s or an individual’s nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.

[CV-37] Object Gaussian for Monocular 6D Pose Estimation from Sparse Views

链接: https://arxiv.org/abs/2409.02581
作者: Luqing Luo,Shichu Sun,Jiangang Yang,Linfang Zheng,Jinwei Du,Jian Liu
关键词-EN: Monocular object pose, demand costly CAD, Monocular object, vision and robotics, heavily depends
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.

[CV-38] Solving Video Inverse Problems Using Image Diffusion Models

链接: https://arxiv.org/abs/2409.02574
作者: Taesung Kwon,Jong Chul Ye
关键词-EN: including image super-resolution, video inverse problems, image diffusion models, diffusion model-based inverse, inverse problems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page: this https URL

[CV-39] Evaluation Study on SAM 2 for Class-agnostic Instance-level Segmentation

链接: https://arxiv.org/abs/2409.02567
作者: Tiantian Zhang,Zhangjun Zhou,Jialun Pei
关键词-EN: demonstrated powerful zero-shot, powerful zero-shot segmentation, Segment Anything Model, Shadow Instance Detection, Salient Instance Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segment Anything Model (SAM) has demonstrated powerful zero-shot segmentation performance in natural scenes. The recently released Segment Anything Model 2 (SAM2) has further heightened researchers’ expectations towards image segmentation capabilities. To evaluate the performance of SAM2 on class-agnostic instance-level segmentation tasks, we adopt different prompt strategies for SAM2 to cope with instance-level tasks for three relevant scenarios: Salient Instance Segmentation (SIS), Camouflaged Instance Segmentation (CIS), and Shadow Instance Detection (SID). In addition, to further explore the effectiveness of SAM2 in segmenting granular object structures, we also conduct detailed tests on the high-resolution Dichotomous Image Segmentation (DIS) benchmark to assess the fine-grained segmentation capability. Qualitative and quantitative experimental results indicate that the performance of SAM2 varies significantly across different scenarios. Besides, SAM2 is not particularly sensitive to segmenting high-resolution fine details. We hope this technique report can drive the emergence of SAM2-based adapters, aiming to enhance the performance ceiling of large vision models on class-agnostic instance segmentation tasks.

[CV-40] How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations

链接: https://arxiv.org/abs/2409.02566
作者: Florian Blume,Runfeng Qu,Pia Bideau,Martin Maier,Rasha Abdel Rahman,Olaf Hellwich
关键词-EN: Facial expression perception, humans inherently relies, contextual cues, contributing to efficient, flexible processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GCPR 2024

点击查看摘要

Abstract:Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.

[CV-41] Interacting Multiple Model-based Joint Homography Matrix and Multiple Object State Estimation

链接: https://arxiv.org/abs/2409.02562
作者: Paul Johannes Claasen,Johan Pieter de Villiers
关键词-EN: Joint Homography State, IMM Joint Homography, Joint Homography, Homography State Estimation, MOT algorithm
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint submitted to Information Fusion

点击查看摘要

Abstract:A novel MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. By jointly modelling the camera projection matrix as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined through the use of an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets.

[CV-42] Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation

链接: https://arxiv.org/abs/2409.02555
作者: Kangkai Zhang,Shiming Ge,Ruixin Shi,Dan Zeng
关键词-EN: challenging task due, Recognizing objects, challenging task, task due, lack of informative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[CV-43] Real-Time Dynamic Scale-Aware Fusion Detection Network: Take Road Damage Detection as an example

链接: https://arxiv.org/abs/2409.02546
作者: Weichao Pan,Xu Wang,Wenqing Huan
关键词-EN: Unmanned Aerial Vehicle, Unmanned Aerial, Aerial Vehicle, reducing labor costs, Road Damage Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV)-based Road Damage Detection (RDD) is important for daily maintenance and safety in cities, especially in terms of significantly reducing labor costs. However, current UAV-based RDD research is still faces many challenges. For example, the damage with irregular size and direction, the masking of damage by the background, and the difficulty of distinguishing damage from the background significantly affect the ability of UAV to detect road damage in daily inspection. To solve these problems and improve the performance of UAV in real-time road damage detection, we design and propose three corresponding modules: a feature extraction module that flexibly adapts to shape and background; a module that fuses multiscale perception and adapts to shape and background ; an efficient downsampling module. Based on these modules, we designed a multi-scale, adaptive road damage detection model with the ability to automatically remove background interference, called Dynamic Scale-Aware Fusion Detection Model (RT-DSAFDet). Experimental results on the UAV-PDD2023 public dataset show that our model RT-DSAFDet achieves a mAP50 of 54.2%, which is 11.1% higher than that of YOLOv10-m, an efficient variant of the latest real-time object detection model YOLOv10, while the amount of parameters is reduced to 1.8M and FLOPs to 4.6G, with a decreased by 88% and 93%, respectively. Furthermore, on the large generalized object detection public dataset MS COCO2017 also shows the superiority of our model with mAP50-95 is the same as YOLOv9-t, but with 0.5% higher mAP50, 10% less parameters volume, and 40% less FLOPs.

[CV-44] UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

链接: https://arxiv.org/abs/2409.02545
作者: Soomin Kim,Hyesong Choi,Jihye Ahn,Dongbo Min
关键词-EN: stereo depth estimation, Unlike other vision, Transformer-based stereo approaches, Transformer-based stereo, Transformer-based stereo architectures
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data. Moreover, to address these challenging tasks of reconstruction-and-prediction, we present a new strategy to vary a masking ratio when training the stereo model with stereo-tailored losses. State-of-the-art performance of UniTT-Stereo is validated on various benchmarks such as ETH3D, KITTI 2012, and KITTI 2015 datasets. Lastly, to investigate the advantages of the proposed approach, we provide a frequency analysis of feature maps and the analysis of locality inductive bias based on attention maps.

[CV-45] StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models ECCV2024

链接: https://arxiv.org/abs/2409.02543
作者: Wen Li,Muyuan Fang,Cheng Zou,Biao Gong,Ruobing Zheng,Meng Wang,Jingdong Chen,Ming Yang
关键词-EN: style, controlling image styles, challenging task, image, burst of innovative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024

点击查看摘要

Abstract:Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at this https URL.

[CV-46] Sample what you cant compress

链接: https://arxiv.org/abs/2409.02529
作者: Vighnesh Birodkar,Gabriel Barcik,James Lyon,Sergey Ioffe,David Minnen,Joshua V. Dillon
关键词-EN: produce blurry results, learned image representations, learned image, produce blurry, blurry results
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach “Sample what you can’t compress”, or SWYCC for short.

[CV-47] SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction

链接: https://arxiv.org/abs/2409.02513
作者: Sumin Son,Hyesong Choi,Dongbo Min
关键词-EN: Masked Image Modeling, achieve exceptional performance, Guided Masked Image, enabling pre-trained models, Masked Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre-trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single-image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine-grained feature representation. To address these limitations, we propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG-MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi-modal pre-training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre-training and downstream tasks. Furthermore, SG-MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge-specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU-v2, and ADE20k datasets demonstrate SG-MIM’s superiority in monocular depth estimation and semantic segmentation.

[CV-48] LD: A Vehicle Tail Light signal Dataset and Benchmark

链接: https://arxiv.org/abs/2409.02508
作者: Jinhao Chai,Shiyi Mu,Shugong Xu
关键词-EN: Understanding other drivers’, crucial for safe, Understanding, drivers’ intentions, intentions is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding other drivers’ intentions is crucial for safe driving. The role of taillights in conveying these intentions is underemphasized in current autonomous driving systems. Accurately identifying taillight signals is essential for predicting vehicle behavior and preventing collisions. Open-source taillight datasets are scarce, often small and inconsistently annotated. To address this gap, we introduce a new large-scale taillight dataset called TLD. Sourced globally, our dataset covers diverse traffic scenarios. To our knowledge, TLD is the first dataset to separately annotate brake lights and turn signals in real driving scenarios. We collected 17.78 hours of driving videos from the internet. This dataset consists of 152k labeled image frames sampled at a rate of 2 Hz, along with 1.5 million unlabeled frames interspersed throughout. Additionally, we have developed a two-stage vehicle light detection model consisting of two primary modules: a vehicle detector and a taillight classifier. Initially, YOLOv10 and DeepSORT captured consecutive vehicle images over time. Subsequently, the two classifiers work simultaneously to determine the states of the brake lights and turn signals. A post-processing procedure is then used to eliminate noise caused by misidentifications and provide the taillight states of the vehicle within a given time frame. Our method shows exceptional performance on our dataset, establishing a benchmark for vehicle taillight detection. The dataset is available at this https URL

[CV-49] Plane2Depth: Hierarchical Adaptive Plane Guidance for Monocular Depth Estimation

链接: https://arxiv.org/abs/2409.02494
作者: Li Liu,Ruijie Zhu,Jiacheng Deng,Ziyang Song,Wenfei Yang,Tianzhu Zhang
关键词-EN: Monocular depth estimation, dense depth map, depth estimation aims, Monocular depth, single image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 12 figures, 8 tables

点击查看摘要

Abstract:Monocular depth estimation aims to infer a dense depth map from a single image, which is a fundamental and prevalent task in computer vision. Many previous works have shown impressive depth estimation results through carefully designed network structures, but they usually ignore the planar information and therefore perform poorly in low-texture areas of indoor scenes. In this paper, we propose Plane2Depth, which adaptively utilizes plane information to improve depth prediction within a hierarchical framework. Specifically, in the proposed plane guided depth generator (PGDG), we design a set of plane queries as prototypes to softly model planes in the scene and predict per-pixel plane coefficients. Then the predicted plane coefficients can be converted into metric depth values with the pinhole camera model. In the proposed adaptive plane query aggregation (APGA) module, we introduce a novel feature interaction approach to improve the aggregation of multi-scale plane features in a top-down manner. Extensive experiments show that our method can achieve outstanding performance, especially in low-texture or repetitive areas. Furthermore, under the same backbone network, our method outperforms the state-of-the-art methods on the NYU-Depth-v2 dataset, achieves competitive results with state-of-the-art methods KITTI dataset and can be generalized to unseen scenes effectively.

[CV-50] Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine

链接: https://arxiv.org/abs/2409.02492
作者: Jialong Li,Zhicheng Zhang,Yunwei Chen,Qiqi Lu,Ye Wu,Xiaoming Liu,QianJin Feng,Yanqiu Feng,Xinyuan Zhang
关键词-EN: holds significant importance, Diffusion tensor imaging, holds significant, neuroscience research, diffusion tensor field
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.

[CV-51] P-GMOT: Tracking Generic Multiple Object by Textual Prompt with Motion-Appearance Cost (MAC) SORT

链接: https://arxiv.org/abs/2409.02490
作者: Duy Le Dinh Anh,Kim Hoang Tran,Ngan Hoang Le
关键词-EN: made substantial advancements, tracking multiple generic, tracking multiple, Multiple Object Tracking, tracking multiple objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While Multi-Object Tracking (MOT) has made substantial advancements, it is limited by heavy reliance on prior knowledge and limited to predefined categories. In contrast, Generic Multiple Object Tracking (GMOT), tracking multiple objects with similar appearance, requires less prior information about the targets but faces challenges with variants like viewpoint, lighting, occlusion, and resolution. Our contributions commence with the introduction of the \textbf\textRefer-GMOT dataset a collection of videos, each accompanied by fine-grained textual descriptions of their attributes. Subsequently, we introduce a novel text prompt-based open-vocabulary GMOT framework, called \textbf\textTP-GMOT, which can track never-seen object categories with zero training examples. Within \textTP-GMOT framework, we introduce two novel components: (i) \textbf\textTP-OD, an object detection by a textual prompt, for accurately detecting unseen objects with specific characteristics. (ii) Motion-Appearance Cost SORT \textbf\textMAC-SORT, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking multiple generic objects with high similarity. Our contributions are benchmarked on the \textRefer-GMOT dataset for GMOT task. Additionally, to assess the generalizability of the proposed \textTP-GMOT framework and the effectiveness of \textMAC-SORT tracker, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models will be publicly available at: this https URL

[CV-52] Boosting Generalizability towards Zero-Shot Cross-Dataset Single-Image Indoor Depth by Meta-Initialization IROS2024

链接: https://arxiv.org/abs/2409.02486
作者: Cho-Ying Wu,Yiqi Zhong,Junying Wang,Ulrich Neumann
关键词-EN: Indoor robots rely, obstacle detection, robots rely, navigation or obstacle, indoor single-image depth
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IROS 2024. The version supersedes 2305.07269 . arXiv admin note: text overlap with arXiv:2305.07269

点击查看摘要

Abstract:Indoor robots rely on depth to perform tasks like navigation or obstacle detection, and single-image depth estimation is widely used to assist perception. Most indoor single-image depth prediction focuses less on model generalizability to unseen datasets, concerned with in-the-wild robustness for system deployment. This work leverages gradient-based meta-learning to gain higher generalizability on zero-shot cross-dataset inference. Unlike the most-studied meta-learning of image classification associated with explicit class labels, no explicit task boundaries exist for continuous depth values tied to highly varying indoor environments regarding object arrangement and scene composition. We propose fine-grained task that treats each RGB-D mini-batch as a task in our meta-learning formulation. We first show that our method on limited data induces a much better prior (max 27.8% in RMSE). Then, finetuning on meta-learned initialization consistently outperforms baselines without the meta approach. Aiming at generalization, we propose zero-shot cross-dataset protocols and validate higher generalizability induced by our meta-initialization, as a simple and useful plugin to many existing depth estimation methods. The work at the intersection of depth and meta-learning potentially drives both research to step closer to practical robotic and machine perception usage.

[CV-53] ASAR: Transferable Attack on Skeletal Action Recognition

链接: https://arxiv.org/abs/2409.02483
作者: Yunfeng Diao,Baiqi Wu,Ruixuan Zhang,Ajian Liu,Xingxing Wei,Meng Wang,He Wang
关键词-EN: Human Activity Recognition, Human Activity, human behaviors, Activity Recognition, Skeletal Action Recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2407.08572

点击查看摘要

Abstract:Skeletal sequences, as well-structured representations of human behaviors, are crucial in Human Activity Recognition (HAR). The transferability of adversarial skeletal sequences enables attacks in real-world HAR scenarios, such as autonomous driving, intelligent surveillance, and human-computer interactions. However, existing Skeleton-based HAR (S-HAR) attacks exhibit weak adversarial transferability and, therefore, cannot be considered true transfer-based S-HAR attacks. More importantly, the reason for this failure remains unclear. In this paper, we study this phenomenon through the lens of loss surface, and find that its sharpness contributes to the poor transferability in S-HAR. Inspired by this observation, we assume and empirically validate that smoothening the rugged loss landscape could potentially improve adversarial transferability in S-HAR. To this end, we propose the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothed model posterior without re-training the pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike previous transfer-based attacks that treat each frame independently and overlook temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack gradient, effectively disrupting the spatial-temporal coherence of S-HARs. To exhaustively evaluate the effectiveness of existing methods and our method, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the supplementary material.

[CV-54] Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

链接: https://arxiv.org/abs/2409.02482
作者: Stefano Esposito,Anpei Chen,Christian Reiser,Samuel Rota Bulò,Lorenzo Porzi,Katja Schwarz,Christian Richardt,Michael Zollhöfer,Peter Kontschieder,Andreas Geiger
关键词-EN: High-quality real-time view, High-quality real-time, High-quality, real-time view synthesis, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.

[CV-55] Detecting Korean Food Using Image using Hierarchical Model

链接: https://arxiv.org/abs/2409.02448
作者: Hoang Khanh Lam,Kahandakanaththage Maduni Pramuditha Perera
关键词-EN: Korean Food lovers, Korean Food, Food lovers, food before consuming, identify the Korean
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A solution was made available for Korean Food lovers who have dietary restrictions to identify the Korean food before consuming. Just by uploading a clear photo of the dish, people can get to know what they are eating. Image processing techniques together with machine learning helped to come up with this solution.

[CV-56] Non-target Divergence Hypothesis: Toward Understanding Domain Gaps in Cross-Modal Knowledge Distillation

链接: https://arxiv.org/abs/2409.02438
作者: Yilong Chen,Zongyi Xu,Xiaoshui Huang,Shanshan Zhao,Xinqi Jiang,Xinyu Gao,Xinbo Gao
关键词-EN: cross-modal knowledge distillation, Compared to single-modal, severe challenges due, knowledge distillation, cross-modal knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compared to single-modal knowledge distillation, cross-modal knowledge distillation faces more severe challenges due to domain gaps between modalities. Although various methods have proposed various solutions to overcome these challenges, there is still limited research on how domain gaps affect cross-modal knowledge distillation. This paper provides an in-depth analysis and evaluation of this issue. We first introduce the Non-Target Divergence Hypothesis (NTDH) to reveal the impact of domain gaps on cross-modal knowledge distillation. Our key finding is that domain gaps between modalities lead to distribution differences in non-target classes, and the smaller these differences, the better the performance of cross-modal knowledge distillation. Subsequently, based on Vapnik-Chervonenkis (VC) theory, we derive the upper and lower bounds of the approximation error for cross-modal knowledge distillation, thereby theoretically validating the NTDH. Finally, experiments on five cross-modal datasets further confirm the validity, generalisability, and applicability of the NTDH.

[CV-57] raining-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

链接: https://arxiv.org/abs/2409.02429
作者: Aishwarya Agarwal,Srikrishna Karanam,Balaji Vasan Srinivasan
关键词-EN: user-supplied reference image, disentangled fashion, controlling the outputs, reference image, color and style
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 17 figures

点击查看摘要

Abstract:We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario.

[CV-58] Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

链接: https://arxiv.org/abs/2409.02426
作者: Peng Wang,Huijie Zhang,Zekai Zhang,Siyi Chen,Yi Ma,Qing Qu
关键词-EN: Recent empirical studies, Recent empirical, diffusion models, image data, image
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 39 pages, 9 figures

点击查看摘要

Abstract:Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.

[CV-59] MOSMOS: Multi-organ segmentation facilitated by medical report supervision

链接: https://arxiv.org/abs/2409.02418
作者: Weiwei Tian,Xinyu Huang,Junlin Hou,Caiyue Ren,Longquan Jiang,Rui-Wei Zhao,Gang Jin,Yuejie Zhang,Daoying Geng
关键词-EN: visual question answering, demonstrated incredible achievements, coarse-grained downstream tasks, modern medical systems, Medical Vision-Language Pre-training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.

[CV-60] Local map Construction Methods with SD map: A Novel Survey

链接: https://arxiv.org/abs/2409.02415
作者: Jiaqi Li,Pingfan Jia,Jiaxing Chen,Jiaxi Liu,Lei He
关键词-EN: Local map perception, autonomous driving technology, Local maps emerging, Local map, map perception methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:In recent years, significant academic advancements have been made in the field of autonomous vehicles, with Local maps emerging as a crucial component of autonomous driving technology. Local maps not only provide intricate details of road networks but also serve as fundamental inputs for critical tasks such as vehicle localization, navigation, and decision-making. Given the characteristics of SD map (Standard Definition Map), which include low cost, ease of acquisition, and high versatility, perception methods that integrate SD map as prior information have demonstrated significant potential in the field of Local map perception. The purpose of this paper is to provide researchers with a comprehensive overview and summary of the latest advancements in the integration of SD map as prior information for Local map perception methods. This review begins by introducing the task definition and general pipeline of local map perception methods that incorporate SD maps as prior information, along with relevant public datasets. And then it focuses on the representation and encoding methods of multi-source information, as well as the methods for fusing multi-source information. In response to this burgeoning trend, this article presents a comprehensive and meticulous overview of the diverse research efforts in this particular field. Finally, the article addresses pertinent issues and future challenges with the aim of guiding researchers in understanding the current trends and methodologies prevalent in the field.

[CV-61] Hadamard Row-Wise Generation Algorithm

链接: https://arxiv.org/abs/2409.02406
作者: Brayan Monroy,Jorge Bacca
关键词-EN: generating specific Hadamard, specific Hadamard rows, specific Hadamard, addressing the memory, Hadamard rows
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce an efficient algorithm for generating specific Hadamard rows, addressing the memory demands of pre-computing the entire matrix. Leveraging Sylvester’s recursive construction, our method generates the required i -th row on demand, significantly reducing computational resources. The algorithm uses the Kronecker product to construct the desired row from the binary representation of the index, without creating the full matrix. This approach is particularly useful for single-pixel imaging systems that need only one row at a time.

[CV-62] Neural Dynamics Model of Visual Decision-Making: Learning from Human Experts

链接: https://arxiv.org/abs/2409.02390
作者: Jie Su,Fang Cai,Shu-Kuo Zhao,Xin-Yi Wang,Tian-Yi Qian,Da-Hui Wang,Bo Hong
关键词-EN: conducting computational simulations, Uncovering the fundamental, developing mathematical models, fundamental neural correlates, developing mathematical
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Uncovering the fundamental neural correlates of biological intelligence, developing mathematical models, and conducting computational simulations are critical for advancing new paradigms in artificial intelligence (AI). In this study, we implemented a comprehensive visual decision-making model that spans from visual input to behavioral output, using a neural dynamics modeling approach. Drawing inspiration from the key components of the dorsal visual pathway in primates, our model not only aligns closely with human behavior but also reflects neural activities in primates, and achieving accuracy comparable to convolutional neural networks (CNNs). Moreover, magnetic resonance imaging (MRI) identified key neuroimaging features such as structural connections and functional connectivity that are associated with performance in perceptual decision-making tasks. A neuroimaging-informed fine-tuning approach was introduced and applied to the model, leading to performance improvements that paralleled the behavioral variations observed among subjects. Compared to classical deep learning models, our model more accurately replicates the behavioral performance of biological intelligence, relying on the structural characteristics of biological neural networks rather than extensive training data, and demonstrating enhanced resilience to perturbation.

[CV-63] Multi-modal Situated Reasoning in 3D Scenes

链接: https://arxiv.org/abs/2409.02389
作者: Xiongkun Linghu,Jiangyong Huang,Xuesong Niu,Xiaojian Ma,Baoxiong Jia,Siyuan Huang
关键词-EN: embodied AI agents, Situated Question Answering, awareness is essential, Multi-modal Situated, situated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Project page: this https URL

点击查看摘要

Abstract:Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models’ situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.

[CV-64] Unified Framework with Consistency across Modalities for Human Activity Recognition BMVC2024

链接: https://arxiv.org/abs/2409.02385
作者: Tuyen Tran,Thao Minh Le,Hung Tran,Truyen Tran
关键词-EN: Recognizing human activities, Recognizing human, activities in videos, videos is challenging, spatio-temporal complexity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to BMVC 2024

点击查看摘要

Abstract:Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ( \textbfCOMPositional h\textbfUman-cen\textbfTric qu\textbfERy machine), a generic neural architecture that models the interactions between a human of interest and its surroundings in both space and time. Thanks to its versatile design, COMPUTER can be leveraged to distill distinctive representations for various input modalities. Additionally, we introduce a consistency loss that enforces agreement in prediction between modalities, exploiting the complementary information from multimodal inputs for robust human movement recognition. Through extensive experiments on action localization and group activity recognition tasks, our approach demonstrates superior performance when compared with state-of-the-art methods. Our code is available at: this https URL.

[CV-65] GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving

链接: https://arxiv.org/abs/2409.02382
作者: Huasong Han,Kaixuan Zhou,Xiaoxiao Long,Yusen Wang,Chunxia Xiao
关键词-EN: Generalizable Gaussian Splatting, Gaussian Splatting, Gaussian Splatting method, achieve realistic rendering, Autonomous Driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes, we introduces a novel virtual lane generation module into GSS method to enables high-quality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane image to further address the problem of lack of data in the virtual lanes. Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.

[CV-66] Coral Model Generation from Single Images for Virtual Reality Applications

链接: https://arxiv.org/abs/2409.02376
作者: Jie Fu(University of the Arts London, Creative Computing Institute, London, United Kingdom),Shun Fu(Bloks Technology Company, Shanghai, China),Mick Grierson(University of the Arts London, Creative Computing Institute, London, United Kingdom)
关键词-EN: rapid development, models, coral, Traditional methods, Traditional methods struggle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: In Proceedings of Explainable AI for the Arts Workshop 2024 (XAIxArts 2024) arXiv:2406.14485

点击查看摘要

Abstract:With the rapid development of VR technology, the demand for high-quality 3D models is increasing. Traditional methods struggle with efficiency and quality in large-scale customization. This paper introduces a deep-learning framework that generates high-precision 3D coral models from a single image. Using the Coral dataset, the framework extracts geometric and texture features, performs 3D reconstruction, and optimizes design and material blending. Advanced optimization and polygon count control ensure shape accuracy, detail retention, and flexible output for various complexities, catering to high-quality rendering and real-time interaction needs.The project incorporates Explainable AI (XAI) to transform AI-generated models into interactive “artworks,” best viewed in VR and XR. This enhances model interpretability and human-machine collaboration. Real-time feedback in VR interactions displays information like coral species and habitat, enriching user experience. The generated models surpass traditional methods in detail, visual quality, and efficiency. This research offers an intelligent approach to 3D content creation for VR, lowering production barriers, and promoting widespread VR applications. Additionally, integrating XAI provides new insights into AI-generated visual content and advances research in 3D vision interpretability.

[CV-67] Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

链接: https://arxiv.org/abs/2409.02374
作者: Siyi Chen,Huijie Zhang,Minzhe Guo,Yifu Lu,Peng Wang,Qing Qu
关键词-EN: LOCO Edit, powerful class, class of generative, Recently, Edit
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at this https URL.

[CV-68] Unfolding Videos Dynamics via Taylor Expansion

链接: https://arxiv.org/abs/2409.02371
作者: Siyi Chen,Minkyu Choi,Zesen Zhao,Kuan Han,Qing Qu,Zhongming Liu
关键词-EN: Taking inspiration, Instance Discrimination, inspiration from physical, existing instance discrimination, instance discrimination frameworks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.

[CV-69] Pluralistic Salient Object Detection

链接: https://arxiv.org/abs/2409.02368
作者: Xuelu Feng,Yunsheng Li,Dongdong Chen,Chunming Qiao,Junsong Yuan,Lu Yuan,Gang Hua
关键词-EN: salient object detection, plausible salient segmentation, salient segmentation results, generating multiple plausible, multiple plausible salient
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient objects, this new setting recognizes the inherent complexity of real-world images, comprising multiple objects, and the ambiguity in defining salient objects due to different user intentions. To study this task, we present two new SOD datasets “DUTS-MM” and “DUS-MQ”, along with newly designed evaluation metrics. DUTS-MM builds upon the DUTS dataset but enriches the ground-truth mask annotations from three aspects which 1) improves the mask quality especially for boundary and fine-grained structures; 2) alleviates the annotation inconsistency issue; and 3) provides multiple ground-truth masks for images with saliency ambiguity. DUTS-MQ consists of approximately 100K image-mask pairs with human-annotated preference scores, enabling the learning of real human preferences in measuring mask quality. Building upon these two datasets, we propose a simple yet effective pluralistic SOD baseline based on a Mixture-of-Experts (MOE) design. Equipped with two prediction heads, it simultaneously predicts multiple masks using different query prompts and predicts human preference scores for each mask candidate. Extensive experiments and analyses underscore the significance of our proposed datasets and affirm the effectiveness of our PSOD framework.

[CV-70] Coaching a Robotic Sonographer: Learning Robotic Ultrasound with Sparse Experts Feedback

链接: https://arxiv.org/abs/2409.02337
作者: Deepak Raina,Mythra V. Balakuntala,Byung Wook Kim,Juan Wachs,Richard Voyles
关键词-EN: intervention and diagnosis, offering non-invasive, real-time imaging, widely employed, employed for clinical
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in IEEE Transactions on Medical Robotics and Bionics (TMRB) 2024

点击查看摘要

Abstract:Ultrasound is widely employed for clinical intervention and diagnosis, due to its advantages of offering non-invasive, radiation-free, and real-time imaging. However, the accessibility of this dexterous procedure is limited due to the substantial training and expertise required of operators. The robotic ultrasound (RUS) offers a viable solution to address this limitation; nonetheless, achieving human-level proficiency remains challenging. Learning from demonstrations (LfD) methods have been explored in RUS, which learns the policy prior from a dataset of offline demonstrations to encode the mental model of the expert sonographer. However, active engagement of experts, i.e. Coaching, during the training of RUS has not been explored thus far. Coaching is known for enhancing efficiency and performance in human training. This paper proposes a coaching framework for RUS to amplify its performance. The framework combines DRL (self-supervised practice) with sparse expert’s feedback through coaching. The DRL employs an off-policy Soft Actor-Critic (SAC) network, with a reward based on image quality rating. The coaching by experts is modeled as a Partially Observable Markov Decision Process (POMDP), which updates the policy parameters based on the correction by the expert. The validation study on phantoms showed that coaching increases the learning rate by 25% and the number of high-quality image acquisition by 74.5% .

[CV-71] What Do You See in Common? Learning Hierarchical Prototypes over Tree-of-Life to Discover Evolutionary Traits

链接: https://arxiv.org/abs/2409.02335
作者: Harish Babu Manogaran,M. Maruf,Arka Daw,Kazi Sajeed Mehrab,Caleb Patrick Charpentier,Josef C. Uyeda,Wasila Dahdul,Matthew J Thompson,Elizabeth G Campolongo,Kaiya L Provost,Paula M. Mabee,Hilmar Lapp,Anuj Karpatne
关键词-EN: discover evolutionary traits, discover evolutionary, evolutionary traits directly, evolutionary traits, phylogenetic tree
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 34 pages, 27 figures

点击查看摘要

Abstract:A grand challenge in biology is to discover evolutionary traits - features of organisms common to a group of species with a shared ancestor in the tree of life (also referred to as phylogenetic tree). With the growing availability of image repositories in biology, there is a tremendous opportunity to discover evolutionary traits directly from images in the form of a hierarchy of prototypes. However, current prototype-based methods are mostly designed to operate over a flat structure of classes and face several challenges in discovering hierarchical prototypes, including the issue of learning over-specific features at internal nodes. To overcome these challenges, we introduce the framework of Hierarchy aligned Commonality through Prototypical Networks (HComP-Net). We empirically show that HComP-Net learns prototypes that are accurate, semantically consistent, and generalizable to unseen species in comparison to baselines on birds, butterflies, and fishes datasets. The code and datasets are available at this https URL.

[CV-72] YoloTag: Vision-based Robust UAV Navigation with Fiducial Markers

链接: https://arxiv.org/abs/2409.02334
作者: Sourav Raxit,Simant Bahadur Singh,Abdullah Al Redwan Newaz
关键词-EN: Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, rapidly build precise, build precise maps
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:By harnessing fiducial markers as visual landmarks in the environment, Unmanned Aerial Vehicles (UAVs) can rapidly build precise maps and navigate spaces safely and efficiently, unlocking their potential for fluent collaboration and coexistence with humans. Existing fiducial marker methods rely on handcrafted feature extraction, which sacrifices accuracy. On the other hand, deep learning pipelines for marker detection fail to meet real-time runtime constraints crucial for navigation applications. In this work, we propose YoloTag \textemdash a real-time fiducial marker-based localization system. YoloTag uses a lightweight YOLO v8 object detector to accurately detect fiducial markers in images while meeting the runtime constraints needed for navigation. The detected markers are then used by an efficient perspective-n-point algorithm to estimate UAV states. However, this localization system introduces noise, causing instability in trajectory tracking. To suppress noise, we design a higher-order Butterworth filter that effectively eliminates noise through frequency domain analysis. We evaluate our algorithm through real-robot experiments in an indoor environment, comparing the trajectory tracking performance of our method against other approaches in terms of several distance metrics.

[CV-73] Visual Servoing for Robotic On-Orbit Servicing: A Survey

链接: https://arxiv.org/abs/2409.02324
作者: Lina María Amaya-Mejía,Mohamed Ghita,Jan Dentler,Miguel Olivares-Mendez,Carol Martinez
关键词-EN: On-orbit servicing, autonomous OOS operations, activities will power, OOS, big step
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
*备注: Accepted for publication at the 2024 International Conference on Space Robotics (iSpaRo)

点击查看摘要

Abstract:On-orbit servicing (OOS) activities will power the next big step for sustainable exploration and commercialization of space. Developing robotic capabilities for autonomous OOS operations is a priority for the space industry. Visual Servoing (VS) enables robots to achieve the precise manoeuvres needed for critical OOS missions by utilizing visual information for motion control. This article presents an overview of existing VS approaches for autonomous OOS operations with space manipulator systems (SMS). We divide the approaches according to their contribution to the typical phases of a robotic OOS mission: a) Recognition, b) Approach, and c) Contact. We also present a discussion on the reviewed VS approaches, identifying current trends. Finally, we highlight the challenges and areas for future research on VS techniques for robotic OOS.

[CV-74] Geometry-aware Feature Matching for Large-Scale Structure from Motion

链接: https://arxiv.org/abs/2409.02310
作者: Gonglin Chen,Jinsen Wu,Haiwei Chen,Wenbin Teng,Zhiyuan Gao,Andrew Feng,Rongjun Qin,Yajie Zhao
关键词-EN: Structure from Motion, crucial for Structure, Establishing consistent, multiple images, images is crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry cues in addition to color cues. This helps fill gaps when there is less overlap in large-scale scenarios. Our method formulates geometric verification as an optimization problem, guiding feature matching within detector-free methods and using sparse correspondences from detector-based methods as anchor points. By enforcing geometric constraints via the Sampson Distance, our approach ensures that the denser correspondences from detector-free methods are geometrically consistent and more accurate. This hybrid strategy significantly improves correspondence density and accuracy, mitigates multi-view inconsistencies, and leads to notable advancements in camera pose accuracy and point cloud density. It outperforms state-of-the-art feature matching methods on benchmark datasets and enables feature matching in challenging extreme large-scale settings.

[CV-75] Unsupervised Welding Defect Detection Using Audio And Video

链接: https://arxiv.org/abs/2409.02290
作者: Georg Stemmer,Jose A. Lopez,Juan A. Del Hoyo Ontiveros,Arvind Raju,Tara Thimmanaik,Sovan Biswas
关键词-EN: explore the application, robotic welding, welding, welding process, defects
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 21 pages

点击查看摘要

Abstract:In this work we explore the application of AI to robotic welding. Robotic welding is a widely used technology in many industries, but robots currently do not have the capability to detect welding defects which get introduced due to various reasons in the welding process. We describe how deep-learning methods can be applied to detect weld defects in real-time by recording the welding process with microphones and a camera. Our findings are based on a large database with more than 4000 welding samples we collected which covers different weld types, materials and various defect categories. All deep learning models are trained in an unsupervised fashion because the space of possible defects is large and the defects in our data may contain biases. We demonstrate that a reliable real-time detection of most categories of weld defects is feasible both from audio and video, with improvements achieved by combining both modalities. Specifically, the multi-modal approach achieves an average Area-under-ROC-Curve (AUC) of 0.92 over all eleven defect types in our data. We conclude the paper with an analysis of the results by defect type and a discussion of future work.

[CV-76] Biochemical Prostate Cancer Recurrence Prediction: Thinking Fast Slow

链接: https://arxiv.org/abs/2409.02284
作者: Suhang You,Sanyukta Adap,Siddhesh Thakur,Bhakti Baheti,Spyridon Bakas
关键词-EN: patients after prostatectomy, prostate cancer, cancer is essential, essential for prognostic, prognostic monitoring
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures, methodology paper for LEOPRARD Challenge

点击查看摘要

Abstract:Time to biochemical recurrence in prostate cancer is essential for prognostic monitoring of the progression of patients after prostatectomy, which assesses the efficacy of the surgery. In this work, we proposed to leverage multiple instance learning through a two-stage thinking fast \ slow'' strategy for the time to recurrence (TTR) prediction. The first (thinking fast’‘) stage finds the most relevant WSI area for biochemical recurrence and the second (``thinking slow’') stage leverages higher resolution patches to predict TTR. Our approach reveals a mean C-index ( Ci ) of 0.733 ( \theta=0.059 ) on our internal validation and Ci=0.603 on the LEOPARD challenge validation set. Post hoc attention visualization shows that the most attentive area contributes to the TTR prediction.

[CV-77] K-Origins: Better Colour Quantification for Neural Networks

链接: https://arxiv.org/abs/2409.02281
作者: Lewis Mason,Mark Martinez
关键词-EN: neural network layer, network layer designed, textbf, layer designed, improve image-based network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures, 1 table

点击查看摘要

Abstract:K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, \textbfX , by the equation \textbfY_k = \textbfX-\textbfJ\cdot w_k for each trainable parameter w_k , where \textbfJ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.

[CV-78] Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems

链接: https://arxiv.org/abs/2409.02278
作者: Sanjita Prajapati,Tanu Singh,Chinmay Hegde,Pranamesh Chakraborty
关键词-EN: shown great potential, diverse applications related, Recent developments, VLM models, VLM
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent developments in vision language models (VLM) have shown great potential for diverse applications related to image understanding. In this study, we have explored state-of-the-art VLM models for vision-based transportation engineering tasks such as image classification and object detection. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these state-of-the-art VLM models to harness the capabilities of language understanding for vision-based transportation tasks. These tasks were performed by applying zero-shot prompting to the VLM models, as zero-shot prompting involves performing tasks without any training on those tasks. It eliminates the need for annotated datasets or fine-tuning for specific tasks. Though these models gave comparative results with benchmark Convolutional Neural Networks (CNN) models in the image classification tasks, for object localization tasks, it still needs improvement. Therefore, this study provides a comprehensive evaluation of the state-of-the-art VLM models highlighting the advantages and limitations of the models, which can be taken as the baseline for future improvement and wide-scale implementation.

[CV-79] ADHD diagnosis based on action characteristics recorded in videos using machine learning

链接: https://arxiv.org/abs/2409.02274
作者: Yichun Li,Syes Mohsen Naqvi,Rajesh Nair
关键词-EN: ADHD diagnosis, timely manner, treatment is increasing, increasing significantly, existing services
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Neuroscience Applied

点击查看摘要

Abstract:Demand for ADHD diagnosis and treatment is increasing significantly and the existing services are unable to meet the demand in a timely manner. In this work, we introduce a novel action recognition method for ADHD diagnosis by identifying and analysing raw video recordings. Our main contributions include 1) designing and implementing a test focusing on the attention and hyperactivity/impulsivity of participants, recorded through three cameras; 2) implementing a novel machine learning ADHD diagnosis system based on action recognition neural networks for the first time; 3) proposing classification criteria to provide diagnosis results and analysis of ADHD action characteristics.

[CV-80] Action-Based ADHD Diagnosis in Video

链接: https://arxiv.org/abs/2409.02261
作者: Yichun Li,Yuxing Yang,Syed Nohsen Naqvi
关键词-EN: Attention Deficit Hyperactivity, Deficit Hyperactivity Disorder, Attention Deficit, Hyperactivity Disorder, Deficit Hyperactivity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 31st European Symposium on Artificial Neural Networks

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) causes significant impairment in various domains. Early diagnosis of ADHD and treatment could significantly improve the quality of life and functioning. Recently, machine learning methods have improved the accuracy and efficiency of the ADHD diagnosis process. However, the cost of the equipment and trained staff required by the existing methods are generally huge. Therefore, we introduce the video-based frame-level action recognition network to ADHD diagnosis for the first time. We also record a real multi-modal ADHD dataset and extract three action classes from the video modality for ADHD diagnosis. The whole process data have been reported to CNTW-NHS Foundation Trust, which would be reviewed by medical consultants/professionals and will be made public in due course.

[CV-81] Optimal L-Systems for Stochastic L-system Inference Problems

链接: https://arxiv.org/abs/2409.02259
作者: Ali Lotfi,Ian McQuillan
关键词-EN: optimal stochastic L-system, stochastic L-system capable, stochastic L-system, stochastic Lindenmayer-system, L-system
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.

[CV-82] How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model?

链接: https://arxiv.org/abs/2409.02253
作者: Saeid Asgari Taghanaki,Joseph Lambourne,Alana Mongkhounsavath
关键词-EN: Large foundation models, Large foundation, optimizing multi-modal models, challenges remain, remain in optimizing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large foundation models have revolutionized the field, yet challenges remain in optimizing multi-modal models for specialized visual tasks. We propose a novel, generalizable methodology to identify preferred image distributions for black-box Vision-Language Models (VLMs) by measuring output consistency across varied input prompts. Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures, with a focus on Computer-Aided Design (CAD) as an exemplar field. We further refine VLM outputs using in-context learning with human feedback, significantly enhancing explanation quality. To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluating VLMs on CAD-related visual question answering tasks. Our evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline performance levels, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation. We release the dataset and evaluation codes at \urlthis https URL.

[CV-83] NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise

链接: https://arxiv.org/abs/2409.02251
作者: Abdullah Arafat Miah,Kaan Icer,Resit Sendag,Yu Bi
关键词-EN: deep learning development, Backdoor attacks pose, learning development, backdoor attack, pose a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at this https URL.

[CV-84] A Novel Audio-Visual Information Fusion System for Mental Disorders Detection

链接: https://arxiv.org/abs/2409.02243
作者: Yichun Li,Shuanglin Li,Syed Mohsen Naqvi
关键词-EN: global healthcare challenge, Mental disorders, Mental, healthcare challenge, foremost contributors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 27th International Conference on Information (FUSION)

点击查看摘要

Abstract:Mental disorders are among the foremost contributors to the global healthcare challenge. Research indicates that timely diagnosis and intervention are vital in treating various mental disorders. However, the early somatization symptoms of certain mental disorders may not be immediately evident, often resulting in their oversight and misdiagnosis. Additionally, the traditional diagnosis methods incur high time and cost. Deep learning methods based on fMRI and EEG have improved the efficiency of the mental disorder detection process. However, the cost of the equipment and trained staff are generally huge. Moreover, most systems are only trained for a specific mental disorder and are not general-purpose. Recently, physiological studies have shown that there are some speech and facial-related symptoms in a few mental disorders (e.g., depression and ADHD). In this paper, we focus on the emotional expression features of mental disorders and introduce a multimodal mental disorder diagnosis system based on audio-visual information input. Our proposed system is based on spatial-temporal attention networks and innovative uses a less computationally intensive pre-train audio recognition network to fine-tune the video recognition module for better results. We also apply the unified system for multiple mental disorders (ADHD and depression) for the first time. The proposed system achieves over 80% accuracy on the real multimodal ADHD dataset and achieves state-of-the-art results on the depression dataset AVEC 2014.

[CV-85] EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision

链接: https://arxiv.org/abs/2409.02224
作者: Yiming Zhao,Taein Kwon,Paul Streli,Marc Pollefeys,Christian Holz
关键词-EN: Virtual Reality, Augmented Reality, precise physical insights, Reality, applications in Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Estimating touch contact and pressure in egocentric vision is a central task for downstream applications in Augmented Reality, Virtual Reality, as well as many robotic applications, because it provides precise physical insights into hand-object interaction and object manipulation. However, existing contact pressure datasets lack egocentric views and hand poses, which are essential for accurate estimation during in-situ operation, both for AR/VR interaction and robotic manipulation. In this paper, we introduce EgoPressure,a novel dataset of touch contact and pressure interaction from an egocentric perspective, complemented with hand pose meshes and fine-grained pressure intensities for each contact. The hand poses in our dataset are optimized using our proposed multi-view sequence-based method that processes footage from our capture rig of 8 accurately calibrated RGBD cameras. EgoPressure comprises 5.0 hours of touch contact and pressure interaction from 21 participants captured by a moving egocentric camera and 7 stationary Kinect cameras, which provided RGB images and depth maps at 30 Hz. In addition, we provide baselines for estimating pressure with different modalities, which will enable future developments and benchmarking on the dataset. Overall, we demonstrate that pressure and hand poses are complementary, which supports our intention to better facilitate the physical understanding of hand-object interactions in AR/VR and robotics research.

[CV-86] Brain-Inspired Online Adaptation for Remote Sensing with Spiking Neural Network

链接: https://arxiv.org/abs/2409.02146
作者: Dexin Duan,Peilin liu,Fei Wen
关键词-EN: unmanned aerial vehicles, On-device computing, remote sensing, deep network-based perception, aerial vehicles
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:On-device computing, or edge computing, is becoming increasingly important for remote sensing, particularly in applications like deep network-based perception on on-orbit satellites and unmanned aerial vehicles (UAVs). In these scenarios, two brain-like capabilities are crucial for remote sensing models: (1) high energy efficiency, allowing the model to operate on edge devices with limited computing resources, and (2) online adaptation, enabling the model to quickly adapt to environmental variations, weather changes, and sensor drift. This work addresses these needs by proposing an online adaptation framework based on spiking neural networks (SNNs) for remote sensing. Starting with a pretrained SNN model, we design an efficient, unsupervised online adaptation algorithm, which adopts an approximation of the BPTT algorithm and only involves forward-in-time computation that significantly reduces the computational complexity of SNN adaptation learning. Besides, we propose an adaptive activation scaling scheme to boost online SNN adaptation performance, particularly in low time-steps. Furthermore, for the more challenging remote sensing detection task, we propose a confidence-based instance weighting scheme, which substantially improves adaptation performance in the detection task. To our knowledge, this work is the first to address the online adaptation of SNNs. Extensive experiments on seven benchmark datasets across classification, segmentation, and detection tasks demonstrate that our proposed method significantly outperforms existing domain adaptation and domain generalization approaches under varying weather conditions. The proposed method enables energy-efficient and fast online adaptation on edge devices, and has much potential in applications such as remote perception on on-orbit satellites and UAV.

[CV-87] Self-Supervised Learning for Identifying Defects in Sewer Footage ICML2024

链接: https://arxiv.org/abs/2409.02140
作者: Daniel Otero,Rafael Mateus
关键词-EN: expensive modern investments, modern investments requiring, investments requiring time-intensive, requiring time-intensive manual, Sewerage infrastructure
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Poster at the LatinX in AI Workshop @ ICML 2024

点击查看摘要

Abstract:Sewerage infrastructure is among the most expensive modern investments requiring time-intensive manual inspections by qualified personnel. Our study addresses the need for automated solutions without relying on large amounts of labeled data. We propose a novel application of Self-Supervised Learning (SSL) for sewer inspection that offers a scalable and cost-effective solution for defect detection. We achieve competitive results with a model that is at least 5 times smaller than other approaches found in the literature and obtain competitive performance with 10% of the available data when training with a larger architecture. Our findings highlight the potential of SSL to revolutionize sewer maintenance in resource-limited settings.

[CV-88] Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.02134
作者: Samer Francy,Raghubir Singh
关键词-EN: image classification tasks, work evaluates, image classification, classification tasks, compression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work evaluates the compression techniques on ConvNeXt models in image classification tasks using the CIFAR-10 dataset. Structured pruning, unstructured pruning, and dynamic quantization methods are evaluated to reduce model size and computational complexity while maintaining accuracy. The experiments, conducted on cloud-based platforms and edge device, assess the performance of these techniques. Results show significant reductions in model size, with up to 75% reduction achieved using structured pruning techniques. Additionally, dynamic quantization achieves a reduction of up to 95% in the number of parameters. Fine-tuned models exhibit improved compression performance, indicating the benefits of pre-training in conjunction with compression techniques. Unstructured pruning methods reveal trends in accuracy and compression, with limited reductions in computational complexity. The combination of OTOV3 pruning and dynamic quantization further enhances compression performance, resulting 89.7% reduction in size, 95% reduction with number of parameters and MACs, and 3.8% increase with accuracy. The deployment of the final compressed model on edge device demonstrates high accuracy 92.5% and low inference time 20 ms, validating the effectiveness of compression techniques for real-world edge computing applications.

[CV-89] GenAgent : Build Collaborative AI Systems with Automated Workflow Generation – Case Studies on ComfyUI

链接: https://arxiv.org/abs/2409.01392
作者: Xiangyuan Xue,Zeyu Lu,Di Huang,Wanli Ouyang,Lei Bai
关键词-EN: developing monolithic models, previous AI research, research has focused, focused on developing, maximize their intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Much previous AI research has focused on developing monolithic models to maximize their intelligence and capability, with the primary goal of enhancing performance on specific tasks. In contrast, this paper explores an alternative approach: collaborative AI systems that use workflows to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex workflows, offering greater flexibility and scalability compared to monolithic models. The core innovation of GenAgent lies in representing workflows with code, alongside constructing workflows with collaborative agents in a step-by-step manner. We implement GenAgent on the ComfyUI platform and propose a new benchmark, OpenComfy. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations, showing its capability to generate complex workflows with superior effectiveness and stability.

[CV-90] CanvOI an Oncology Intelligence Foundation Model: Scaling FLOPS Differently

链接: https://arxiv.org/abs/2409.02885
作者: Jonathan Zalach,Inbal Gazy,Assaf Avinoam,Ron Sinai,Eran Shmuel,Inbar Gilboa,Christine Swisher,Naim Matasci,Reva Basho,David B. Agus
关键词-EN: involving rare conditions, rapidly evolving field, oncopathology faces significant, digital oncopathology faces, faces significant challenges
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:The rapidly evolving field of digital oncopathology faces significant challenges, including the need to address diverse and complex clinical questions, often involving rare conditions, with limited availability of labeled data. These limitations hinder the development of robust AI-driven tools in the biomedical space, where accuracy in probabilistic determinations is of utmost importance. To address this, digital pathology foundation models have begun to emerge, typically developed with the size and diversity of the pre-training dataset and model parameters in mind. Here, we present CanvOI, a ViT-g/10-based foundation model designed to enhance the capabilities of digital pathology by addressing these challenges through a different approach. Considering the unique nature of oncologic histopathological images and the requirements from the embeddings to provide meaningful representations for Multiple Instance Learning (MIL) downstream models, we chose to modify the input image characteristics. By introducing larger tile sizes (380 x 380 pixels) and smaller patch sizes (10 x 10 pixels), we were able to optimize the model’s performance, pushing computational resources in a new direction and achieving state-of-the-art performance on cancer-related benchmarks. CanvOI demonstrated a 1.5-7.4% improvement in averaged AUC compared to other leading foundation models built for digital pathology. Moreover, our results demonstrate that CanvOI significantly outperformed the other models, with the performance gap widening substantially when trained on just 10% of the initial cohort. This work highlights an alternative approach that, if integrated with traditional development approaches, has the potential to advance Oncology Intelligence (OI), overcome some of the current barriers and ultimately improve the clinical outcome of cancer patients.

[CV-91] Automatic facial axes standardization of 3D fetal ultrasound images

链接: https://arxiv.org/abs/2409.02826
作者: Antonia Alomar,Ricardo Rubio,Laura Salort,Gerard Albaiges,Antoni Payà,Gemma Piella,Federico Sukno
关键词-EN: early developmental disturbances, genetic syndromes, developmental disturbances, fetal facial, facial
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Craniofacial anomalies indicate early developmental disturbances and are usually linked to many genetic syndromes. Early diagnosis is critical, yet ultrasound (US) examinations often fail to identify these features. This study presents an AI-driven tool to assist clinicians in standardizing fetal facial axes/planes in 3D US, reducing sonographer workload and facilitating the facial evaluation. Our network, structured into three blocks-feature extractor, rotation and translation regression, and spatial transformer-processes three orthogonal 2D slices to estimate the necessary transformations for standardizing the facial planes in the 3D US. These transformations are applied to the original 3D US using a differentiable module (the spatial transformer block), yielding a standardized 3D US and the corresponding 2D facial standard planes. The dataset used consists of 1180 fetal facial 3D US images acquired between weeks 20 and 35 of gestation. Results show that our network considerably reduces inter-observer rotation variability in the test set, with a mean geodesic angle difference of 14.12 ^\circ \pm 18.27 ^\circ and an Euclidean angle error of 7.45 ^\circ \pm 14.88 ^\circ . These findings demonstrate the network’s ability to effectively standardize facial axes, crucial for consistent fetal facial assessments. In conclusion, the proposed network demonstrates potential for improving the consistency and accuracy of fetal facial assessments in clinical settings, facilitating early evaluation of craniofacial anomalies.

[CV-92] Validation of musculoskeletal segmentation model with uncertainty estimation for bone and muscle assessment in hip-to-knee clinical CT images

链接: https://arxiv.org/abs/2409.02770
作者: Mazen Soufi,Yoshito Otake,Makoto Iwasa,Keisuke Uemura,Tomoki Hakotani,Masahiro Hashimoto,Yoshitake Yamada,Minoru Yamada,Yoichi Yokoyama,Masahiro Jinzaki,Suzushi Kusano,Masaki Takao,Seiji Okada,Nobuhiko Sugano,Yoshinobu Sato
关键词-EN: Deep learning-based image, fully automated, analysis of musculoskeletal, rapid analysis, learning-based image segmentation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 7+10supp figures, 8 tables

点击查看摘要

Abstract:Deep learning-based image segmentation has allowed for the fully automated, accurate, and rapid analysis of musculoskeletal (MSK) structures from medical images. However, current approaches were either applied only to 2D cross-sectional images, addressed few structures, or were validated on small datasets, which limit the application in large-scale databases. This study aimed to validate an improved deep learning model for volumetric MSK segmentation of the hip and thigh with uncertainty estimation from clinical computed tomography (CT) images. Databases of CT images from multiple manufacturers/scanners, disease status, and patient positioning were used. The segmentation accuracy, and accuracy in estimating the structures volume and density, i.e., mean HU, were evaluated. An approach for segmentation failure detection based on predictive uncertainty was also investigated. The model has shown an overall improvement with respect to all segmentation accuracy and structure volume/density evaluation metrics. The predictive uncertainty yielded large areas under the receiver operating characteristic (AUROC) curves (AUROCs=.95) in detecting inaccurate and failed segmentations. The high segmentation and muscle volume/density estimation accuracy, along with the high accuracy in failure detection based on the predictive uncertainty, exhibited the model’s reliability for analyzing individual MSK structures in large-scale CT databases.

[CV-93] Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening

链接: https://arxiv.org/abs/2409.02675
作者: Ivan Pereira-Sánchez,Eloi Sans,Julia Navarro,Joan Duran
关键词-EN: high-resolution panchromatic, low-resolution multispectral, objective of pansharpening, pansharpening and hypersharpening, PAN image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The objective of pansharpening and hypersharpening is to accurately combine a high-resolution panchromatic (PAN) image with a low-resolution multispectral (MS) or hyperspectral (HS) image, respectively. Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches. These techniques involve unrolling the steps of the optimization scheme derived from the minimization of an energy into a deep learning framework, resulting in efficient and highly interpretable architectures. In this paper, we propose a model-based deep unfolded method for satellite image fusion. Our approach is based on a variational formulation that incorporates the classic observation model for MS/HS data, a high-frequency injection constraint based on the PAN image, and an arbitrary convex prior. For the unfolding stage, we introduce upsampling and downsampling layers that use geometric information encoded in the PAN image through residual networks. The backbone of our method is a multi-head attention residual network (MARNet), which replaces the proximity operator in the optimization scheme and combines multiple head attentions with residual learning to exploit image self-similarities via nonlocal operators defined in terms of patches. Additionally, we incorporate a post-processing module based on the MARNet architecture to further enhance the quality of the fused images. Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method and its ability to generalize across different sensor configurations and varying spatial and spectral resolutions. The source code will be available at this https URL.

[CV-94] Creating a Microstructure Latent Space with Rich Material Information for Multiphase Alloy Design

链接: https://arxiv.org/abs/2409.02648
作者: Xudong Ma,Yuqi Zhang,Chenchong Wang,Ming Wang,Mingxin Huang,Wei Xu
关键词-EN: intricate microstructure serves, alloy design, latent space, multiphase alloy design, Traditional alloy design
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The intricate microstructure serves as the cornerstone for the composition/processing-structure-property (CPSP) connection in multiphase alloys. Traditional alloy design methods often overlook microstructural details, which diminishes the reliability and effectiveness of the outcomes. This study introduces an improved alloy design algorithm that integrates authentic microstructural information to establish precise CPSP relationships. The approach utilizes a deep-learning framework based on a variational autoencoder to map real microstructural data to a latent space, enabling the prediction of composition, processing steps, and material properties from the latent space vector. By integrating this deep learning model with a specific sampling strategy in the latent space, a novel, microstructure-centered algorithm for multiphase alloy design is developed. This algorithm is demonstrated through the design of a unified dual-phase steel, and the results are assessed at three performance levels. Moreover, an exploration into the latent vector space of the model highlights its seamless interpolation ability and its rich material information content. Notably, the current configuration of the latent space is particularly advantageous for alloy design, offering an exhaustive representation of microstructure, composition, processing, and property variations essential for multiphase alloys.

[CV-95] A Learnable Color Correction Matrix for RAW Reconstruction BMVC2024

链接: https://arxiv.org/abs/2409.02497
作者: Anqi Liu,Shiyi Mu,Shugong Xu
关键词-EN: human visual system, Autonomous driving algorithms, model input due, employ sRGB images, Autonomous driving
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC2024

点击查看摘要

Abstract:Autonomous driving algorithms usually employ sRGB images as model input due to their compatibility with the human visual system. However, visually pleasing sRGB images are possibly sub-optimal for downstream tasks when compared to RAW images. The availability of RAW images is constrained by the difficulties in collecting real-world driving data and the associated challenges of annotation. To address this limitation and support research in RAW-domain driving perception, we design a novel and ultra-lightweight RAW reconstruction method. The proposed model introduces a learnable color correction matrix (CCM), which uses only a single convolutional layer to approximate the complex inverse image signal processor (ISP). Experimental results demonstrate that simulated RAW (simRAW) images generated by our method provide performance improvements equivalent to those produced by more complex inverse ISP methods when pretraining RAW-domain object detectors, which highlights the effectiveness and practicality of our approach.

[CV-96] FrameCorr: Adaptive Autoencoder-based Neural Compression for Video Reconstruction in Resource and Timing Constrained Network Settings

链接: https://arxiv.org/abs/2409.02453
作者: John Li,Shehab Sarar Ahmed,Deepak Nair
关键词-EN: poses challenges due, Internet of Things, nearby servers poses, servers poses challenges, varying timing constraints
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Despite the growing adoption of video processing via Internet of Things (IoT) devices due to their cost-effectiveness, transmitting captured data to nearby servers poses challenges due to varying timing constraints and scarcity of network bandwidth. Existing video compression methods face difficulties in recovering compressed data when incomplete data is provided. Here, we introduce \emph\project, a deep-learning based solution that utilizes previously received data to predict the missing segments of a frame, enabling the reconstruction of a frame from partially received data.

[CV-97] QID2: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI2024

链接: https://arxiv.org/abs/2409.02309
作者: Zijian Chen,Jueqi Wang,Archana Venkataraman
关键词-EN: angular resolution DWI, low angular resolution, angular resolution, angular resolution acquisition, resolution DWI data
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work

点击查看摘要

Abstract:We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID ^2 , takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID ^2 on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID ^2 to reconstruct the missing high angular resolution samples. We compare QID ^2 with two state-of-the-art GAN models. Our results demonstrate that QID ^2 not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID ^2 in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.

[CV-98] What makes a face looks like a hat: Decoupling low-level and high-level Visual Properties with Image Triplets ECCV2024

链接: https://arxiv.org/abs/2409.02241
作者: Maytus Piriyajitakonkij,Sirawaj Itthipuripat,Ian Ballard,Ioannis Pappas
关键词-EN: visual decision making, low-level features, decision making, strong influence, high neural predictivity
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Workshop on Human-inspired Computer Vision @ ECCV2024

点击查看摘要

Abstract:In visual decision making, high-level features, such as object categories, have a strong influence on choice. However, the impact of low-level features on behavior is less understood partly due to the high correlation between high- and low-level features in the stimuli presented (e.g., objects of the same category are more likely to share low-level features). To disentangle these effects, we propose a method that de-correlates low- and high-level visual properties in a novel set of stimuli. Our method uses two Convolutional Neural Networks (CNNs) as candidate models of the ventral visual stream: the CORnet-S that has high neural predictivity in high-level, IT-like responses and the VGG-16 that has high neural predictivity in low-level responses. Triplets (root, image1, image2) of stimuli are parametrized by the level of low- and high-level similarity of images extracted from the different layers. These stimuli are then used in a decision-making task where participants are tasked to choose the most similar-to-the-root image. We found that different networks show differing abilities to predict the effects of low-versus-high-level similarity: while CORnet-S outperforms VGG-16 in explaining human choices based on high-level similarity, VGG-16 outperforms CORnet-S in explaining human choices based on low-level similarity. Using Brain-Score, we observed that the behavioral prediction abilities of different layers of these networks qualitatively corresponded to their ability to explain neural activity at different levels of the visual hierarchy. In summary, our algorithm for stimulus set generation enables the study of how different representations in the visual stream affect high-level cognitive behaviors.

机器学习

[LG-0] Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling

链接: https://arxiv.org/abs/2409.02908
作者: Kaiwen Zheng,Yongxin Chen,Hanzi Mao,Ming-Yu Liu,Jun Zhu,Qinsheng Zhang
关键词-EN: language modeling tasks, popular research topic, discrete diffusion models, Masked diffusion models, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 40 pages

点击查看摘要

Abstract:Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs’ original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20 \times speedup. In addition, our investigation challenges previous claims that MDMs can surpass ARMs in generative perplexity. We identify, for the first time, an underlying numerical issue, even with the 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that the numerical issue lowers the effective temperature both theoretically and empirically, leading to unfair assessments of MDMs’ generation results in the previous literature.

[LG-1] opological Methods in Machine Learning: A Tutorial for Practitioners

链接: https://arxiv.org/abs/2409.02901
作者: Baris Coskunuzer,Cüneyt Gürcan Akçora
关键词-EN: complex data structures, analyze complex data, traditional machine learning, machine learning methods, Topological Machine Learning
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT)
*备注: 54 pages, 35 figures

点击查看摘要

Abstract:Topological Machine Learning (TML) is an emerging field that leverages techniques from algebraic topology to analyze complex data structures in ways that traditional machine learning methods may not capture. This tutorial provides a comprehensive introduction to two key TML techniques, persistent homology and the Mapper algorithm, with an emphasis on practical applications. Persistent homology captures multi-scale topological features such as clusters, loops, and voids, while the Mapper algorithm creates an interpretable graph summarizing high-dimensional data. To enhance accessibility, we adopt a data-centric approach, enabling readers to gain hands-on experience applying these techniques to relevant tasks. We provide step-by-step explanations, implementations, hands-on examples, and case studies to demonstrate how these tools can be applied to real-world problems. The goal is to equip researchers and practitioners with the knowledge and resources to incorporate TML into their work, revealing insights often hidden from conventional machine learning methods. The tutorial code is available at this https URL

[LG-2] Benchmarking Spurious Bias in Few-Shot Image Classifiers ECCV2024

链接: https://arxiv.org/abs/2409.02882
作者: Guangtao Zheng,Wenqian Ye,Aidong Zhang
关键词-EN: spurious bias, spurious, few-shot classifiers, Few-shot, bias
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Few-shot image classifiers are designed to recognize and classify new data with minimal supervision and limited data but often show reliance on spurious correlations between classes and spurious attributes, known as spurious bias. Spurious correlations commonly hold in certain samples and few-shot classifiers can suffer from spurious bias induced from them. There is an absence of an automatic benchmarking system to assess the robustness of few-shot classifiers against spurious bias. In this paper, we propose a systematic and rigorous benchmark framework, termed FewSTAB, to fairly demonstrate and quantify varied degrees of robustness of few-shot classifiers to spurious bias. FewSTAB creates few-shot evaluation tasks with biased attributes so that using them for predictions can demonstrate poor performance. To construct these tasks, we propose attribute-based sample selection strategies based on a pre-trained vision-language model, eliminating the need for manual dataset curation. This allows FewSTAB to automatically benchmark spurious bias using any existing test data. FewSTAB offers evaluation results in a new dimension along with a new design guideline for building robust classifiers. Moreover, it can benchmark spurious bias in varied degrees and enable designs for varied degrees of robustness. Its effectiveness is demonstrated through experiments on ten few-shot learning methods across three datasets. We hope our framework can inspire new designs of robust few-shot classifiers. Our code is available at this https URL.

[LG-3] Configurable Foundation Models: Building LLMs from a Modular Perspective

链接: https://arxiv.org/abs/2409.02877
作者: Chaojun Xiao,Zhengyan Zhang,Chenyang Song,Dazhi Jiang,Feng Yao,Xu Han,Xiaozhi Wang,Shuo Wang,Yufei Huang,Guanyu Lin,Yingfa Chen,Weilin Zhao,Yuge Tu,Zexuan Zhong,Ao Zhang,Chenglei Si,Khai Hao Moo,Chenyang Zhao,Huimin Chen,Yankai Lin,Zhiyuan Liu,Jingbo Shang,Maosong Sun
关键词-EN: abilities increasingly cumbersome, recently unveiled challenges, unveiled challenges tied, continual scalability due, limited computation resources
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.

[LG-4] Hybrid Imitation-Learning Motion Planner for Urban Driving

链接: https://arxiv.org/abs/2409.02871
作者: Cristian Gariboldi,Matteo Corno,Beng Jin
关键词-EN: open source datasets, nuPlan and Argoverse, release of open, open source, source datasets
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the release of open source datasets such as nuPlan and Argoverse, the research around learning-based planners has spread a lot in the last years. Existing systems have shown excellent capabilities in imitating the human driver behaviour, but they struggle to guarantee safe closed-loop driving. Conversely, optimization-based planners offer greater security in short-term planning scenarios. To confront this challenge, in this paper we propose a novel hybrid motion planner that integrates both learning-based and optimization-based techniques. Initially, a multilayer perceptron (MLP) generates a human-like trajectory, which is then refined by an optimization-based component. This component not only minimizes tracking errors but also computes a trajectory that is both kinematically feasible and collision-free with obstacles and road boundaries. Our model effectively balances safety and human-likeness, mitigating the trade-off inherent in these objectives. We validate our approach through simulation experiments and further demonstrate its efficacy by deploying it in real-world self-driving vehicles.

[LG-5] Look Into the LITE in Deep Learning for Time Series Classification

链接: https://arxiv.org/abs/2409.02869
作者: Ali Ismail-Fawaz,Maxime Devanne,Stefano Berretti,Jonathan Weber,Germain Forestier
关键词-EN: Deep learning models, Deep learning, multivariate time series, Time Series, powerful solution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have been shown to be a powerful solution for Time Series Classification (TSC). State-of-the-art architectures, while producing promising results on the UCR and the UEA archives , present a high number of trainable parameters. This can lead to long training with high CO2 emission, power consumption and possible increase in the number of FLoating-point Operation Per Second (FLOPS). In this paper, we present a new architecture for TSC, the Light Inception with boosTing tEchnique (LITE) with only 2.34% of the number of parameters of the state-of-the-art InceptionTime model, while preserving performance. This architecture, with only 9, 814 trainable parameters due to the usage of DepthWise Separable Convolutions (DWSC), is boosted by three techniques: multiplexing, custom filters, and dilated convolution. The LITE architecture, trained on the UCR, is 2.78 times faster than InceptionTime and consumes 2.79 times less CO2 and power. To evaluate the performance of the proposed architecture on multivariate time series data, we adapt LITE to handle multivariate time series, we call this version LITEMV. To bring theory into application, we also conducted experiments using LITEMV on multivariate time series representing human rehabilitation movements, showing that LITEMV not only is the most efficient model but also the best performing for this application on the Kimore dataset, a skeleton based human rehabilitation exercises dataset. Moreover, to address the interpretability of LITEMV, we present a study using Class Activation Maps to understand the classification decision taken by the model during evaluation.

[LG-6] Building a Scalable Effective and Steerable Search and Ranking Platform

链接: https://arxiv.org/abs/2409.02856
作者: Marjan Celikik,Jacek Wasilewski,Ana Peleteiro Ramallo,Alexey Kurennoy,Evgeny Labzin,Danilo Ascione,Tural Gurbanov,Géraud Le Falher,Andrii Dzhoha,Ian Harris
关键词-EN: vast product selections, current session intent, offer vast product, platforms offer vast, product selections
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2409.02856 [cs.IR] (or arXiv:2409.02856v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.02856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Oops I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

链接: https://arxiv.org/abs/2409.02850
作者: Raphael Lafargue,Luke Smith,Franck Vermet,Mathias Löwe,Ian Reid,Vincent Gripon,Jack Valmadre
关键词-EN: few-shot learning, computing confidence intervals, predominant method, computing confidence, multiple tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e.\ allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at this https URL

[LG-8] SNNAX – Spiking Neural Networks in JAX

链接: https://arxiv.org/abs/2409.02842
作者: Jamie Lohoff,Jan Finkbeiner,Emre Neftci
关键词-EN: Spiking Neural Networks, Spiking Neural, Neural Networks, prototype biologically inspired, biologically inspired models
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) simulators are essential tools to prototype biologically inspired models and neuromorphic hardware architectures and predict their performance. For such a tool, ease of use and flexibility are critical, but so is simulation speed especially given the complexity inherent to simulating SNN. Here, we present SNNAX, a JAX-based framework for simulating and training such models with PyTorch-like intuitiveness and JAX-like execution speed. SNNAX models are easily extended and customized to fit the desired model specifications and target neuromorphic hardware. Additionally, SNNAX offers key features for optimizing the training and deployment of SNNs such as flexible automatic differentiation and just-in-time compilation. We evaluate and compare SNNAX to other commonly used machine learning (ML) frameworks used for programming SNNs. We provide key performance metrics, best practices, documented examples for simulating SNNs in SNNAX, and implement several benchmarks used in the literature.

[LG-9] Exploring Sentiment Dynamics and Predictive Behaviors in Cryptocurrency Discussions by Few-Shot Learning with Large Language Models

链接: https://arxiv.org/abs/2409.02836
作者: Moein Shahiki Tash,Zahra Ahani,Mohim Tash,Olga Kolesnikova,Grigori Sidorov
关键词-EN: leveraging advanced natural, language processing techniques, Regret Detection behaviors, advanced natural language, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study performs analysis of Predictive statements, Hope speech, and Regret Detection behaviors within cryptocurrency-related discussions, leveraging advanced natural language processing techniques. We introduce a novel classification scheme named “Prediction statements,” categorizing comments into Predictive Incremental, Predictive Decremental, Predictive Neutral, or Non-Predictive categories. Employing GPT-4o, a cutting-edge large language model, we explore sentiment dynamics across five prominent cryptocurrencies: Cardano, Binance, Matic, Fantom, and Ripple. Our analysis reveals distinct patterns in predictive sentiments, with Matic demonstrating a notably higher propensity for optimistic predictions. Additionally, we investigate hope and regret sentiments, uncovering nuanced interplay between these emotions and predictive behaviors. Despite encountering limitations related to data volume and resource availability, our study reports valuable discoveries concerning investor behavior and sentiment trends within the cryptocurrency market, informing strategic decision-making and future research endeavors.

[LG-10] Obsidian: Cooperative State-Space Exploration for Performant Inference on Secure ML Accelerators

链接: https://arxiv.org/abs/2409.02817
作者: Sarbartha Banerjee,Shijia Wei,Prakash Ramrakhyani,Mohit Tiwari
关键词-EN: Trusted execution environments, Trusted execution, model, Trusted, state space
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trusted execution environments (TEEs) for machine learning accelerators are indispensable in secure and efficient ML inference. Optimizing workloads through state-space exploration for the accelerator architectures improves performance and energy consumption. However, such explorations are expensive and slow due to the large search space. Current research has to use fast analytical models that forego critical hardware details and cross-layer opportunities unique to the hardware security primitives. While cycle-accurate models can theoretically reach better designs, their high runtime cost restricts them to a smaller state space. We present Obsidian, an optimization framework for finding the optimal mapping from ML kernels to a secure ML accelerator. Obsidian addresses the above challenge by exploring the state space using analytical and cycle-accurate models cooperatively. The two main exploration components include: (1) A secure accelerator analytical model, that includes the effect of secure hardware while traversing the large mapping state space and produce the best m model mappings; (2) A compiler profiling step on a cycle-accurate model, that captures runtime bottlenecks to further improve execution runtime, energy and resource utilization and find the optimal model mapping. We compare our results to a baseline secure accelerator, comprising of the state-of-the-art security schemes obtained from guardnn [ 33 ] and sesame [11]. The analytical model reduces the inference latency by 20.5% for a cloud and 8.4% for an edge deployment with an energy improvement of 24% and 19% respectively. The cycle-accurate model, further reduces the latency by 9.1% for a cloud and 12.2% for an edge with an energy improvement of 13.8% and 13.1%. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2409.02817 [cs.CR] (or arXiv:2409.02817v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.02817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Boosting Certificate Robustness for Time Series Classification with Efficient Self-Ensemble

链接: https://arxiv.org/abs/2409.02802
作者: Chang Dong,Zhengyang Li,Liangwei Zheng,Weitong Chen,Wei Emma Zhang
关键词-EN: garnered significant attention, time series domain, time series, significant attention, garnered significant
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 6 figures, 4 tables, 10 pages

点击查看摘要

Abstract:Recently, the issue of adversarial robustness in the time series domain has garnered significant attention. However, the available defense mechanisms remain limited, with adversarial training being the predominant approach, though it does not provide theoretical guarantees. Randomized Smoothing has emerged as a standout method due to its ability to certify a provable lower bound on robustness radius under \ell_p -ball attacks. Recognizing its success, research in the time series domain has started focusing on these aspects. However, existing research predominantly focuses on time series forecasting, or under the non- \ell_p robustness in statistic feature augmentation for time series classification~(TSC). Our review found that Randomized Smoothing performs modestly in TSC, struggling to provide effective assurances on datasets with poor robustness. Therefore, we propose a self-ensemble method to enhance the lower bound of the probability confidence of predicted labels by reducing the variance of classification margins, thereby certifying a larger radius. This approach also addresses the computational overhead issue of Deep Ensemble~(DE) while remaining competitive and, in some cases, outperforming it in terms of robustness. Both theoretical analysis and experimental results validate the effectiveness of our method, demonstrating superior performance in robustness testing compared to baseline approaches.

[LG-12] UnLearning from Experience to Avoid Spurious Correlations

链接: https://arxiv.org/abs/2409.02792
作者: Jeff Mitchell,Jesús Martínez del Rincón,Niall McLaughlin
关键词-EN: deep neural networks, spurious correlations, networks can achieve, student model, deep neural
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:While deep neural networks can achieve state-of-the-art performance in many tasks, these models are more fragile than they appear. They are prone to learning spurious correlations in their training data, leading to surprising failure cases. In this paper, we propose a new approach that addresses the issue of spurious correlations: UnLearning from Experience (ULE). Our method is based on using two classification models trained in parallel: student and teacher models. Both models receive the same batches of training data. The student model is trained with no constraints and pursues the spurious correlations in the data. The teacher model is trained to solve the same classification problem while avoiding the mistakes of the student model. As training is done in parallel, the better the student model learns the spurious correlations, the more robust the teacher model becomes. The teacher model uses the gradient of the student’s output with respect to its input to unlearn mistakes made by the student. We show that our method is effective on the Waterbirds, CelebA, Spawrious and UrbanCars datasets.

[LG-13] Unifying Causal Representation Learning with the Invariance Principle

链接: https://arxiv.org/abs/2409.02772
作者: Dingling Yao,Dario Rancati,Riccardo Cadei,Marco Fumero,Francesco Locatello
关键词-EN: causal downstream tasks, recovering latent causal, solve causal downstream, downstream tasks, robust classification
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages

点击查看摘要

Abstract:Causal representation learning aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. The folklore is that these different settings are important, as they are often linked to different rungs of Pearl’s causal hierarchy, although not all neatly fit. Our main contribution is to show that many existing causal representation learning approaches methodologically align the representation to known data symmetries. Identification of the variables is guided by equivalence classes across different data pockets that are not necessarily causal. This result suggests important implications, allowing us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariances relevant to our application. It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causality assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.

[LG-14] ractable Offline Learning of Regular Decision Processes

链接: https://arxiv.org/abs/2409.02747
作者: Ahana Deb,Roberto Cipollone,Anders Jonsson,Alessandro Ronca,Mohammad Sadegh Talebi
关键词-EN: Regular Decision Processes, called Regular Decision, Decision Processes, studies offline Reinforcement, Regular Decision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: To appear in EWRL 2024

点击查看摘要

Abstract:This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs). In RDPs, the unknown dependency of future observations and rewards from the past interactions can be captured by some hidden finite-state automaton. For this reason, many RDP algorithms first reconstruct this unknown dependency using automata learning techniques. In this paper, we show that it is possible to overcome two strong limitations of previous offline RL algorithms for RDPs, notably RegORL. This can be accomplished via the introduction of two original techniques: the development of a new pseudometric based on formal languages, which removes a problematic dependency on L_\infty^\mathsfp -distinguishability parameters, and the adoption of Count-Min-Sketch (CMS), instead of naive counting. The former reduces the number of samples required in environments that are characterized by a low complexity in language-theoretic terms. The latter alleviates the memory requirements for long planning horizons. We derive the PAC sample complexity bounds associated to each of these techniques, and we validate the approach experimentally.

[LG-15] Complete and Efficient Covariants for 3D Point Configurations with Application to Learning Molecular Quantum Properties

链接: https://arxiv.org/abs/2409.02730
作者: Hartmut Maennel,Oliver T. Unke,Klaus-Robert Müller
关键词-EN: modeling physical properties, machine learning, desirable to incorporate, modeling physical, molecules with machine
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:When modeling physical properties of molecules with machine learning, it is desirable to incorporate SO(3) -covariance. While such models based on low body order features are not complete, we formulate and prove general completeness properties for higher order methods, and show that 6k-5 of these features are enough for up to k atoms. We also find that the Clebsch–Gordan operations commonly used in these methods can be replaced by matrix multiplications without sacrificing completeness, lowering the scaling from O(l^6) to O(l^3) in the degree of the features. We apply this to quantum chemistry, but the proposed methods are generally applicable for problems involving 3D point configurations.

[LG-16] ask-Oriented Communication for Graph Data: A Graph Information Bottleneck Approach

链接: https://arxiv.org/abs/2409.02728
作者: Shujing Li,Yanhu Wang,Shuaishuai Guo,Chenyuan Feng
关键词-EN: involves large networks, nodes and edges, fields like knowledge, involves large, Graph
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Graph data, essential in fields like knowledge representation and social networks, often involves large networks with many nodes and edges. Transmitting these graphs can be highly inefficient due to their size and redundancy for specific tasks. This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead. Our approach utilizes graph neural networks (GNNs) and the graph information bottleneck (GIB) principle to create a compact, informative, and robust graph representation suitable for transmission. The challenge lies in the irregular structure of graph data, making GIB optimization complex. We address this by deriving a tractable variational upper bound for the objective function. Additionally, we propose the VQ-GIB mechanism, integrating vector quantization (VQ) to convert subgraph representations into a discrete codebook sequence, compatible with existing digital communication systems. Our experiments show that this GIB-based method significantly lowers communication costs while preserving essential task-related information. The approach demonstrates robust performance across various communication channels, suitable for both continuous and discrete systems.

[LG-17] Recoverable Anonymization for Pose Estimation: A Privacy-Enhancing Approach

链接: https://arxiv.org/abs/2409.02715
作者: Wenjun Huang,Yang Ni,Arghavan Rezvani,SungHeon Jeong,Hanning Chen,Yezi Liu,Fei Wen,Mohsen Imani
关键词-EN: Human pose estimation, Human pose, Human, HPE, SPI
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human pose estimation (HPE) is crucial for various applications. However, deploying HPE algorithms in surveillance contexts raises significant privacy concerns due to the potential leakage of sensitive personal information (SPI) such as facial features, and ethnicity. Existing privacy-enhancing methods often compromise either privacy or performance, or they require costly additional modalities. We propose a novel privacy-enhancing system that generates privacy-enhanced portraits while maintaining high HPE performance. Our key innovations include the reversible recovery of SPI for authorized personnel and the preservation of contextual information. By jointly optimizing a privacy-enhancing module, a privacy recovery module, and a pose estimator, our system ensures robust privacy protection, efficient SPI recovery, and high-performance HPE. Experimental results demonstrate the system’s robust performance in privacy enhancement, SPI recovery, and HPE.

[LG-18] MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning WACV2025

链接: https://arxiv.org/abs/2409.02714
作者: Jiarui Sun,M. Ugur Akcal,Wei Zhang,Girish Chowdhary
关键词-EN: poses significant challenges, visual Reinforcement Learning, extracting informative state, observations poses significant, visual Reinforcement
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: WACV 2025

点击查看摘要

Abstract:In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency, primarily due to the complexity of extracting informative state representations from high-dimensional data. Previous methods such as contrastive-based approaches have made strides in improving sample efficiency but fall short in modeling the nuanced evolution of states. To address this, we introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking to explicitly model state evolution in visual RL. Specifically, we propose a self-supervised dual-component strategy that integrates (1) a graph construction of pixel-based observations for spatial-temporal masking, coupled with (2) a multi-level contrastive learning mechanism that enriches state representations by emphasizing temporal continuity and change of states. MOOSS advances the understanding of state dynamics by disrupting and learning from spatial-temporal correlations, which facilitates policy learning. Our comprehensive evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency, demonstrating the effectiveness of our method. Our code is released at this https URL.

[LG-19] A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations

链接: https://arxiv.org/abs/2409.02712
作者: Nidhi Kowtal,Tejas Deshpande,Raviraj Joshi
关键词-EN: language pairs faces, language pairs, English-Marathi language pairs, pairs faces significant, Machine translation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at I2CT 2024

点击查看摘要

Abstract:Machine translation in low-resource language pairs faces significant challenges due to the scarcity of parallel corpora and linguistic resources. This study focuses on the case of English-Marathi language pairs, where existing datasets are notably noisy, impeding the performance of machine translation models. To mitigate the impact of data quality issues, we propose a data filtering approach based on cross-lingual sentence representations. Our methodology leverages a multilingual SBERT model to filter out problematic translations in the training data. Specifically, we employ an IndicSBERT similarity model to assess the semantic equivalence between original and translated sentences, allowing us to retain linguistically correct translations while discarding instances with substantial deviations. The results demonstrate a significant improvement in translation quality over the baseline post-filtering with IndicSBERT. This illustrates how cross-lingual sentence representations can reduce errors in machine translation scenarios with limited resources. By integrating multilingual sentence BERT models into the translation pipeline, this research contributes to advancing machine translation techniques in low-resource environments. The proposed method not only addresses the challenges in English-Marathi language pairs but also provides a valuable framework for enhancing translation quality in other low-resource language translation tasks.

[LG-20] Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

链接: https://arxiv.org/abs/2409.02708
作者: Chaozhi Zhang,Lin Liu,Xiaoqun Zhang
关键词-EN: practical success typically, success typically relies, Data scarcity poses, modern machine learning, meta learning
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.

[LG-21] Decision Transformer for Enhancing Neural Local Search on the Job Shop Scheduling Problem

链接: https://arxiv.org/abs/2409.02697
作者: Constantin Waubert de Puiseau,Fabian Wolz,Merlin Montag,Jannik Peters,Hasan Tercan,Tobias Meisen
关键词-EN: shop scheduling problem, job shop scheduling, scheduling problem, industry for decades, job shop
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: currently under review for IEEE Transactions on Cybernetics

点击查看摘要

Abstract:The job shop scheduling problem (JSSP) and its solution algorithms have been of enduring interest in both academia and industry for decades. In recent years, machine learning (ML) is playing an increasingly important role in advancing existing and building new heuristic solutions for the JSSP, aiming to find better solutions in shorter computation times. In this paper we build on top of a state-of-the-art deep reinforcement learning (DRL) agent, called Neural Local Search (NLS), which can efficiently and effectively control a large local neighborhood search on the JSSP. In particular, we develop a method for training the decision transformer (DT) algorithm on search trajectories taken by a trained NLS agent to further improve upon the learned decision-making sequences. Our experiments show that the DT successfully learns local search strategies that are different and, in many cases, more effective than those of the NLS agent itself. In terms of the tradeoff between solution quality and acceptable computational time needed for the search, the DT is particularly superior in application scenarios where longer computational times are acceptable. In this case, it makes up for the longer inference times required per search step, which are caused by the larger neural network architecture, through better quality decisions per step. Thereby, the DT achieves state-of-the-art results for solving the JSSP with ML-enhanced search.

[LG-22] Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs

链接: https://arxiv.org/abs/2409.02686
作者: Ruoyu Wang,Xiaoxuan Li,Lina Yao
关键词-EN: Large Language Models, Large Language, recent studies reveal, Language Models, demonstrated remarkable efficiency
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions, but recent studies reveal that these models often fail to achieve satisfactory results on questions involving reasoning, such as mathematics or physics questions. This phenomenon is usually attributed to the uncertainty regarding whether these models could genuinely comprehend the knowledge embedded in the text or merely learn to replicate the token distribution without a true understanding of the content. In this paper, we delve into this problem and aim to enhance the reasoning capabilities of LLMs. First, we investigate if the model has genuine reasoning capabilities by visualizing the text generation process at the attention and representation level. Then, we formulate the reasoning process of LLMs into a causal framework, which provides a formal explanation of the problems we observe in the visualization. Finally, building upon this causal framework, we propose Deconfounded Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model’s reasoning capabilities by encouraging the model to extract the general problem-solving skills and apply these skills to different questions. Experiments show that our method outperforms the baseline consistently across multiple benchmarks, and with only 1.2M tunable parameters, we achieve better or comparable results to other fine-tuning methods. This demonstrates the effectiveness and efficiency of our method in improving the overall accuracy and reliability of LLMs.

[LG-23] Neural Networks with LSTM and GRU in Modeling Active Fires in the Amazon

链接: https://arxiv.org/abs/2409.02681
作者: Ramon Tavares
关键词-EN: Gated Recurrent Unit, fire spots detected, detected fire spots, historical time series, time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 16 pages, in Portuguese language, 24 figures

点击查看摘要

Abstract:This study presents a comprehensive methodology for modeling and forecasting the historical time series of fire spots detected by the AQUA_M-T satellite in the Amazon, Brazil. The approach utilizes a mixed Recurrent Neural Network (RNN) model, combining Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures to predict monthly accumulations of daily detected fire spots. A summary of the data revealed a consistent seasonality over time, with annual maximum and minimum fire spot values tending to repeat at the same periods each year. The primary objective is to verify whether the forecasts capture this inherent seasonality through rigorous statistical analysis. The methodology involved careful data preparation, model configuration, and training using cross-validation with two seeds, ensuring that the data generalizes well to the test and validation sets, and confirming the convergence of the model parameters. The results indicate that the mixed LSTM and GRU model offers improved accuracy in forecasting 12 months ahead, demonstrating its effectiveness in capturing complex temporal patterns and modeling the observed time series. This research significantly contributes to the application of deep learning techniques in environmental monitoring, specifically in fire spot forecasting. In addition to improving forecast accuracy, the proposed approach highlights the potential for adaptation to other time series forecasting challenges, opening new avenues for research and development in machine learning and natural phenomenon prediction. Keywords: Time Series Forecasting, Recurrent Neural Networks, Deep Learning.

[LG-24] Independence Constrained Disentangled Representation Learning from Epistemological Perspective

链接: https://arxiv.org/abs/2409.02672
作者: Ruoyu Wang,Lina Yao
关键词-EN: Disentangled Representation Learning, identifies semantically meaningful, Representation Learning aims, Disentangled Representation, Representation Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Disentangled Representation Learning aims to improve the explainability of deep learning methods by training a data encoder that identifies semantically meaningful latent variables in the data generation process. Nevertheless, there is no consensus regarding a universally accepted definition for the objective of disentangled representation learning. In particular, there is a considerable amount of discourse regarding whether should the latent variables be mutually independent or not. In this paper, we first investigate these arguments on the interrelationships between latent variables by establishing a conceptual bridge between Epistemology and Disentangled Representation Learning. Then, inspired by these interdisciplinary concepts, we introduce a two-level latent space framework to provide a general solution to the prior arguments on this issue. Finally, we propose a novel method for disentangled representation learning by employing an integration of mutual information constraint and independence constraint within the Generative Adversarial Network (GAN) framework. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches in both quantitative and qualitative evaluations. The method exhibits strong performance across multiple commonly used metrics and demonstrates a great capability in disentangling various semantic factors, leading to an improved quality of controllable generation, which consequently benefits the explainability of the algorithm.

[LG-25] Causality-Aware Transformer Networks for Robotic Navigation

链接: https://arxiv.org/abs/2409.02669
作者: Ruoyu Wang,Yao Liu,Yuanjiang Cao,Lina Yao
关键词-EN: developing versatile Embodied, garnered growing interest, Recent advances, machine learning algorithms, Causal Understanding Module
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct adoption of RNNs and Transformers often overlooks the specific differences between Embodied AI and traditional sequential data modelling, potentially limiting its performance in Embodied AI tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Embodied AI tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Embodied AI. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the models’s Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method’s generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks and simulation environments. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings.

[LG-26] Learning-Based Error Detection System for Advanced Vehicle Instrument Cluster Rendering

链接: https://arxiv.org/abs/2409.02647
作者: Cornelius Bürkle,Fabian Oboril,Kay-Ulrich Scholl
关键词-EN: expanding digital display, digital display options, automotive industry, expanding digital, Cyclic Redundancy Checks
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: 9 pages

点击查看摘要

Abstract:The automotive industry is currently expanding digital display options with every new model that comes onto the market. This entails not just an expansion in dimensions, resolution, and customization choices, but also the capability to employ novel display effects like overlays while assembling the content of the display cluster. Unfortunately, this raises the need for appropriate monitoring systems that can detect rendering errors and apply appropriate countermeasures when required. Classical solutions such as Cyclic Redundancy Checks (CRC) will soon be no longer viable as any sort of alpha blending, warping of scaling of content can cause unwanted CRC violations. Therefore, we propose a novel monitoring approach to verify correctness of displayed content using telltales (e.g. warning signs) as example. It uses a learning-based approach to separate “good” telltales, i.e. those that a human driver will understand correctly, and “corrupted” telltales, i.e. those that will not be visible or perceived correctly. As a result, it possesses inherent resilience against individual pixel errors and implicitly supports changing backgrounds, overlay or scaling effects. This is underlined by our experimental study where all “corrupted” test patterns were correctly classified, while no false alarms were triggered.

[LG-27] AdvSecureNet: A Python Toolkit for Adversarial Machine Learning

链接: https://arxiv.org/abs/2409.02629
作者: Melih Catal,Manuel Günther
关键词-EN: Machine learning models, adversarial machine learning, models are vulnerable, Machine learning, learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are vulnerable to adversarial attacks. Several tools have been developed to research these vulnerabilities, but they often lack comprehensive features and flexibility. We introduce AdvSecureNet, a PyTorch based toolkit for adversarial machine learning that is the first to natively support multi-GPU setups for attacks, defenses, and evaluation. It is the first toolkit that supports both CLI and API interfaces and external YAML configuration files to enhance versatility and reproducibility. The toolkit includes multiple attacks, defenses and evaluation metrics. Rigiorous software engineering practices are followed to ensure high code quality and maintainability. The project is available as an open-source project on GitHub at this https URL and installable via PyPI.

[LG-28] (Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

链接: https://arxiv.org/abs/2409.02628
作者: Andreas Kirsch
关键词-EN: detection tasks, Epistemic uncertainty, epistemic uncertainty collapse, crucial for safety-critical, safety-critical applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages

点击查看摘要

Abstract:Epistemic uncertainty is crucial for safety-critical applications and out-of-distribution detection tasks. Yet, we uncover a paradoxical phenomenon in deep learning models: an epistemic uncertainty collapse as model complexity increases, challenging the assumption that larger models invariably offer better uncertainty quantification. We propose that this stems from implicit ensembling within large models. To support this hypothesis, we demonstrate epistemic uncertainty collapse empirically across various architectures, from explicit ensembles of ensembles and simple MLPs to state-of-the-art vision models, including ResNets and Vision Transformers – for the latter, we examine implicit ensemble extraction and decompose larger models into diverse sub-models, recovering epistemic uncertainty. We provide theoretical justification for these phenomena and explore their implications for uncertainty estimation.

[LG-29] Hypothesizing Missing Causal Variables with LLMs

链接: https://arxiv.org/abs/2409.02604
作者: Ivaxi Sheth,Sahar Abdelnabi,Mario Fritz
关键词-EN: human intellectual advances, iterative assumption refinement, data evaluation, Scientific discovery, intellectual advances
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Code - this https URL

点击查看摘要

Abstract:Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establish the relationship between the cause and the effect. Motivated by the scientific discovery process, in this work, we formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph. We design a benchmark with varying difficulty levels and knowledge assumptions about the causal graph. With the growing interest in using Large Language Models (LLMs) to assist in scientific discovery, we benchmark open-source and closed models on our testbed. We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect. In contrast, they underperform in hypothesizing the cause and effect variables themselves. We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.

[LG-30] A Fashion Item Recommendation Model in Hyperbolic Space CVPR2024

链接: https://arxiv.org/abs/2409.02599
作者: Ryotaro Shimizu,Yu Wang,Masanari Kimura,Yuki Hirakawa,Takashi Wada,Yuki Saito,Julian McAuley
关键词-EN: fashion item recommendation, incorporates hyperbolic geometry, propose a fashion, geometry into user, item recommendation model
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This work was presented at the CVFAD Workshop at CVPR 2024

点击查看摘要

Abstract:In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users’ purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.

[LG-31] An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

链接: https://arxiv.org/abs/2409.02596
作者: Ryan Whetten,Titouan Parcollet,Adel Moumen,Marco Dinarelli,Yannick Estève
关键词-EN: Self-Supervised Learning, including speech processing, Learning, SSL, including speech
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in the IEEE Soken Language Technology Workshop 2024

点击查看摘要

Abstract:Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.

[LG-32] Multiview Random Vector Functional Link Network for Predicting DNA-Binding Proteins

链接: https://arxiv.org/abs/2409.02588
作者: A. Quadir,M. Sajid,M. Tanveer
关键词-EN: critical task due, identification of DNA-binding, critical task, task due, significant impact
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:The identification of DNA-binding proteins (DBPs) is a critical task due to their significant impact on various biological activities. Understanding the mechanisms underlying protein-DNA interactions is essential for elucidating various life activities. In recent years, machine learning-based models have been prominently utilized for DBP prediction. In this paper, to predict DBPs, we propose a novel framework termed a multiview random vector functional link (MvRVFL) network, which fuses neural network architecture with multiview learning. The proposed MvRVFL model combines the benefits of late and early fusion, allowing for distinct regularization parameters across different views while leveraging a closed-form solution to determine unknown parameters efficiently. The primal objective function incorporates a coupling term aimed at minimizing a composite of errors stemming from all views. From each of the three protein views of the DBP datasets, we extract five features. These features are then fused together by incorporating a hidden feature during the model training process. The performance of the proposed MvRVFL model on the DBP dataset surpasses that of baseline models, demonstrating its superior effectiveness. Furthermore, we extend our assessment to the UCI, KEEL, AwA, and Corel5k datasets, to establish the practicality of the proposed models. The consistency error bound, the generalization error bound, and empirical findings, coupled with rigorous statistical analyses, confirm the superior generalization capabilities of the MvRVFL model compared to the baseline models.

[LG-33] BMI Prediction from Handwritten English Characters Using a Convolutional Neural Network

链接: https://arxiv.org/abs/2409.02584
作者: N. T. Diba,N. Akter,S. A. H. Chowdhury,J. E. Giti
关键词-EN: Body Mass Index, person Body Mass, Mass Index, Body Mass, BMI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A person’s Body Mass Index, or BMI, is the most widely used parameter for assessing their health. BMI is a crucial predictor of potential diseases that may arise at higher body fat levels because it is correlated with body fat. Conversely, a community’s or an individual’s nutritional status can be determined using the BMI. Although deep learning models are used in several studies to estimate BMI from face photos and other data, no previous research established a clear connection between deep learning techniques for handwriting analysis and BMI prediction. This article addresses this research gap with a deep learning approach to estimating BMI from handwritten characters by developing a convolutional neural network (CNN). A dataset containing samples from 48 people in lowercase English scripts is successfully captured for the BMI prediction task. The proposed CNN-based approach reports a commendable accuracy of 99.92%. Performance comparison with other popular CNN architectures reveals that AlexNet and InceptionV3 achieve the second and third-best performance, with the accuracy of 99.69% and 99.53%, respectively.

[LG-34] Advancing Cyber Incident Timeline Analysis Through Rule Based AI and Large Language Models

链接: https://arxiv.org/abs/2409.02572
作者: Fatma Yasmine Loumachi,Mohamed Chahine Ghanem
关键词-EN: Timeline Forensics, Timeline Analysis, correlate events resulting, analysing temporal digital, chronological timeline
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Timeline Analysis (TA) is a key part of Timeline Forensics (TF) in Digital Forensics (DF), focusing primarily on examining and analysing temporal digital artefacts such as timestamps, derived from event logs, file metadata, and other related data to correlate events resulting from cyber incidents and reconstruct their chronological timeline. Traditional tools often struggle to efficiently process the vast volume and variety of data acquired during DF investigations and Incident Response (IR) processes. This paper presents a novel framework, GenDFIR, that combines Rule-Based Artificial Intelligence (R-BAI) algorithms with Large Language Models (LLMs) to advance and automate the TA process. Our approach consists of two main stages (1) We use R-BAI to identify and select anomalous digital artefacts based on predefined rules. (2) The selected artefacts are then converted into embeddings for processing by an LLM with the help of a Retrieval-Augmented Generation (RAG) agent. The LLM consequently leverages its capabilities to perform automated TA on the artefacts and predict potential incident scenarios. To validate our framework, we evaluate GenDFIR performance, efficiency, and reliability using various metrics across synthetic cyber incident simulation scenarios. This paper presents a proof of concept, where the findings demonstrate the significant potential of integrating R-BAI and LLMs for TA. This novel approach highlights the power of Generative AI (GenAI), specifically LLMs, and opens new avenues for advanced threat detection and incident reconstruction, representing a significant step forward in the field.

[LG-35] Low-Resolution Object Recognition with Cross-Resolution Relational Contrastive Distillation

链接: https://arxiv.org/abs/2409.02555
作者: Kangkai Zhang,Shiming Ge,Ruixin Shi,Dan Zeng
关键词-EN: challenging task due, Recognizing objects, challenging task, task due, lack of informative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: This paper is accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

点击查看摘要

Abstract:Recognizing objects in low-resolution images is a challenging task due to the lack of informative details. Recent studies have shown that knowledge distillation approaches can effectively transfer knowledge from a high-resolution teacher model to a low-resolution student model by aligning cross-resolution representations. However, these approaches still face limitations in adapting to the situation where the recognized objects exhibit significant representation discrepancies between training and testing images. In this study, we propose a cross-resolution relational contrastive distillation approach to facilitate low-resolution object recognition. Our approach enables the student model to mimic the behavior of a well-trained teacher model which delivers high accuracy in identifying high-resolution objects. To extract sufficient knowledge, the student learning is supervised with contrastive relational distillation loss, which preserves the similarities in various relational structures in contrastive representation space. In this manner, the capability of recovering missing details of familiar low-resolution objects can be effectively enhanced, leading to a better knowledge transfer. Extensive experiments on low-resolution object classification and low-resolution face recognition clearly demonstrate the effectiveness and adaptability of our approach.

[LG-36] Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models

链接: https://arxiv.org/abs/2409.02530
作者: Chih-Yuan Li,Jun-Ting Wu,Chan Hsu,Ming-Yen Lin,Yihuang Kang
关键词-EN: Glomerular Filtration Rate, estimated Glomerular Filtration, Filtration Rate, Glomerular Filtration, estimated Glomerular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This preprint version includes corrections of typographical errors related to numerical values in Table 2, which were present in the version published at the BDH workshop in MIPR 2024. These corrections do not affect the overall conclusions of the study

点击查看摘要

Abstract:The estimated Glomerular Filtration Rate (eGFR) is an essential indicator of kidney function in clinical practice. Although traditional equations and Machine Learning (ML) models using clinical and laboratory data can estimate eGFR, accurately predicting future eGFR levels remains a significant challenge for nephrologists and ML researchers. Recent advances demonstrate that Large Language Models (LLMs) and Large Multimodal Models (LMMs) can serve as robust foundation models for diverse applications. This study investigates the potential of LMMs to predict future eGFR levels with a dataset consisting of laboratory and clinical values from 50 patients. By integrating various prompting techniques and ensembles of LMMs, our findings suggest that these models, when combined with precise prompts and visual representations of eGFR trajectories, offer predictive performance comparable to existing ML models. This research extends the application of foundation models and suggests avenues for future studies to harness these models in addressing complex medical forecasting challenges.

[LG-37] Sample what you cant compress

链接: https://arxiv.org/abs/2409.02529
作者: Vighnesh Birodkar,Gabriel Barcik,James Lyon,Sergey Ioffe,David Minnen,Joshua V. Dillon
关键词-EN: produce blurry results, learned image representations, learned image, produce blurry, blurry results
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate the efficacy of jointly learning a continuous encoder and decoder under a diffusion-based loss. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach “Sample what you can’t compress”, or SWYCC for short.

[LG-38] raining Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems

链接: https://arxiv.org/abs/2409.02517
作者: Jeongmin Liu,Eunwoo Song
关键词-EN: achieved proficient waveform, proficient waveform generation, degraded synthetic quality, diverse voices, achieved proficient
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 4 pages, 4 figures, for demo samples, see this https URL

点击查看摘要

Abstract:While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.

[LG-39] Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

链接: https://arxiv.org/abs/2409.02512
作者: Jifeng Hu,Li Shen,Sili Huang,Zhejian Yang,Hechang Chen,Lichao Sun,Yi Chang,Dacheng Tao
关键词-EN: Artificial neural networks, shown remarkable superiority, Artificial neural, neural networks, superiority in gaming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks’ datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.

[LG-40] CoAst: Validation-Free Contribution Assessment for Federated Learning based on Cross-Round Valuation

链接: https://arxiv.org/abs/2409.02495
作者: Hao Wu,Likun Zhang,Shucheng Li,Fengyuan Xu,Sheng Zhong
关键词-EN: federated learning, data held, model performance, participant, validation data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the federated learning (FL) process, since the data held by each participant is different, it is necessary to figure out which participant has a higher contribution to the model performance. Effective contribution assessment can help motivate data owners to participate in the FL training. Research works in this field can be divided into two directions based on whether a validation dataset is required. Validation-based methods need to use representative validation data to measure the model accuracy, which is difficult to obtain in practical FL scenarios. Existing validation-free methods assess the contribution based on the parameters and gradients of local models and the global model in a single training round, which is easily compromised by the stochasticity of model training. In this work, we propose CoAst, a practical method to assess the FL participants’ contribution without access to any validation data. The core idea of CoAst involves two aspects: one is to only count the most important part of model parameters through a weights quantization, and the other is a cross-round valuation based on the similarity between the current local parameters and the global parameter updates in several subsequent communication rounds. Extensive experiments show that CoAst has comparable assessment reliability to existing validation-based methods and outperforms existing validation-free methods.

[LG-41] Reliable Deep Diffusion Tensor Estimation: Rethinking the Power of Data-Driven Optimization Routine

链接: https://arxiv.org/abs/2409.02492
作者: Jialong Li,Zhicheng Zhang,Yunwei Chen,Qiqi Lu,Ye Wu,Xiaoming Liu,QianJin Feng,Yanqiu Feng,Xinyuan Zhang
关键词-EN: holds significant importance, Diffusion tensor imaging, holds significant, neuroscience research, diffusion tensor field
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Diffusion tensor imaging (DTI) holds significant importance in clinical diagnosis and neuroscience research. However, conventional model-based fitting methods often suffer from sensitivity to noise, leading to decreased accuracy in estimating DTI parameters. While traditional data-driven deep learning methods have shown potential in terms of accuracy and efficiency, their limited generalization to out-of-training-distribution data impedes their broader application due to the diverse scan protocols used across centers, scanners, and studies. This work aims to tackle these challenges and promote the use of DTI by introducing a data-driven optimization-based method termed DoDTI. DoDTI combines the weighted linear least squares fitting algorithm and regularization by denoising technique. The former fits DW images from diverse acquisition settings into diffusion tensor field, while the latter applies a deep learning-based denoiser to regularize the diffusion tensor field instead of the DW images, which is free from the limitation of fixed-channel assignment of the network. The optimization object is solved using the alternating direction method of multipliers and then unrolled to construct a deep neural network, leveraging a data-driven strategy to learn network parameters. Extensive validation experiments are conducted utilizing both internally simulated datasets and externally obtained in-vivo datasets. The results, encompassing both qualitative and quantitative analyses, showcase that the proposed method attains state-of-the-art performance in DTI parameter estimation. Notably, it demonstrates superior generalization, accuracy, and efficiency, rendering it highly reliable for widespread application in the field.

[LG-42] Adversarial Attacks on Machine Learning-Aided Visualizations

链接: https://arxiv.org/abs/2409.02485
作者: Takanori Fujiwara,Kostiantyn Kucher,Junpeng Wang,Rafael M. Martins,Andreas Kerren,Anders Ynnerman
关键词-EN: high societal impact, machine learning, techniques to generate, societal impact, field is rapidly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This is the author’s version of the article that has been accepted by the Journal of Visualization

点击查看摘要

Abstract:Research in ML4VIS investigates how to use machine learning (ML) techniques to generate visualizations, and the field is rapidly growing with high societal impact. However, as with any computational pipeline that employs ML processes, ML4VIS approaches are susceptible to a range of ML-specific adversarial attacks. These attacks can manipulate visualization generations, causing analysts to be tricked and their judgments to be impaired. Due to a lack of synthesis from both visualization and ML perspectives, this security aspect is largely overlooked by the current ML4VIS literature. To bridge this gap, we investigate the potential vulnerabilities of ML-aided visualizations from adversarial attacks using a holistic lens of both visualization and ML perspectives. We first identify the attack surface (i.e., attack entry points) that is unique in ML-aided visualizations. We then exemplify five different adversarial attacks. These examples highlight the range of possible attacks when considering the attack surface and multiple different adversary capabilities. Our results show that adversaries can induce various attacks, such as creating arbitrary and deceptive visualizations, by systematically identifying input attributes that are influential in ML inferences. Based on our observations of the attack surface characteristics and the attack examples, we underline the importance of comprehensive studies of security issues and defense mechanisms as a call of urgency for the ML4VIS community.

[LG-43] Volumetric Surfaces: Representing Fuzzy Geometries with Multiple Meshes

链接: https://arxiv.org/abs/2409.02482
作者: Stefano Esposito,Anpei Chen,Christian Reiser,Samuel Rota Bulò,Lorenzo Porzi,Katja Schwarz,Christian Richardt,Michael Zollhöfer,Peter Kontschieder,Andreas Geiger
关键词-EN: High-quality real-time view, High-quality real-time, High-quality, real-time view synthesis, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality real-time view synthesis methods are based on volume rendering, splatting, or surface rendering. While surface-based methods generally are the fastest, they cannot faithfully model fuzzy geometry like hair. In turn, alpha-blending techniques excel at representing fuzzy materials but require an unbounded number of samples per ray (P1). Further overheads are induced by empty space skipping in volume rendering (P2) and sorting input primitives in splatting (P3). These problems are exacerbated on low-performance graphics hardware, e.g. on mobile devices. We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. We achieve this by representing objects as semi-transparent multi-layer meshes, rendered in fixed layer order from outermost to innermost. We model mesh layers as SDF shells with optimal spacing learned during training. After baking, we fit UV textures to the corresponding meshes. We show that our method can represent challenging fuzzy objects while achieving higher frame rates than volume-based and splatting-based methods on low-end and mobile devices.

[LG-44] ForeCal: Random Forest-based Calibration for DNNs

链接: https://arxiv.org/abs/2409.02446
作者: Dhruv Nigam
关键词-EN: Deep neural network, higher ROC AUC, true event likelihoods, Deep neural, higher ROC
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural network(DNN) based classifiers do extremely well in discriminating between observations, resulting in higher ROC AUC and accuracy metrics, but their outputs are often miscalibrated with respect to true event likelihoods. Post-hoc calibration algorithms are often used to calibrate the outputs of these classifiers. Methods like Isotonic regression, Platt scaling, and Temperature scaling have been shown to be effective in some cases but are limited by their parametric assumptions and/or their inability to capture complex non-linear relationships. We propose ForeCal - a novel post-hoc calibration algorithm based on Random forests. ForeCal exploits two unique properties of Random forests: the ability to enforce weak monotonicity and range-preservation. It is more powerful in achieving calibration than current state-of-the-art methods, is non-parametric, and can incorporate exogenous information as features to learn a better calibration function. Through experiments on 43 diverse datasets from the UCI ML repository, we show that ForeCal outperforms existing methods in terms of Expected Calibration Error(ECE) with minimal impact on the discriminative power of the base DNN as measured by AUC.

[LG-45] Adversarial Learning for Neural PDE Solvers with Sparse Data

链接: https://arxiv.org/abs/2409.02431
作者: Yunpeng Gong,Yongjie Hou,Zhenzhong Wang,Zexin Lin,Min Jiang
关键词-EN: partial differential equations, face challenges related, made significant progress, Neural network solvers, differential equations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network solvers for partial differential equations (PDEs) have made significant progress, yet they continue to face challenges related to data scarcity and model robustness. Traditional data augmentation methods, which leverage symmetry or invariance, impose strong assumptions on physical systems that often do not hold in dynamic and complex real-world applications. To address this research gap, this study introduces a universal learning strategy for neural network PDEs, named Systematic Model Augmentation for Robust Training (SMART). By focusing on challenging and improving the model’s weaknesses, SMART reduces generalization error during training under data-scarce conditions, leading to significant improvements in prediction accuracy across various PDE scenarios. The effectiveness of the proposed method is demonstrated through both theoretical analysis and extensive experimentation. The code will be available.

[LG-46] Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

链接: https://arxiv.org/abs/2409.02428
作者: Guanwen Xie,Jingzehua Xu,Yiyuan Yang,Shuai Zhang
关键词-EN: Leveraging large language, large language models, demonstrates significant potential, Leveraging large, functions demonstrates significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.

[LG-47] Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

链接: https://arxiv.org/abs/2409.02426
作者: Peng Wang,Huijie Zhang,Zekai Zhang,Siyi Chen,Yi Ma,Qing Qu
关键词-EN: Recent empirical studies, Recent empirical, diffusion models, image data, image
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 39 pages, 9 figures

点击查看摘要

Abstract:Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.

[LG-48] Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning

链接: https://arxiv.org/abs/2409.02425
作者: Shuaishuai Huang,Haowei Yang,You Yao,Xueting Lin,Yuming Tu
关键词-EN: accurately capturing users’, capturing users’ evolving, Adaptive Interest Network, critical research area, users’ evolving interests
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In personalized recommendation systems, accurately capturing users’ evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users’ interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users’ interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.

[LG-49] Relative-Translation Invariant Wasserstein Distance

链接: https://arxiv.org/abs/2409.02416
作者: Binshuai Wang,Qiwei Di,Ming Yin,Mengdi Wang,Quanquan Gu,Peng Wei
关键词-EN: relative-translation invariant Wasserstein, invariant Wasserstein distances, optimal transport model, distribution shift, distance
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a new family of distances, relative-translation invariant Wasserstein distances ( RW_p ), for measuring the similarity of two probability distributions under distribution shift. Generalizing it from the classical optimal transport model, we show that RW_p distances are also real distance metrics defined on the quotient set \mathcalP_p(\mathbbR^n)/\sim and invariant to distribution translations. When p=2 , the RW_2 distance enjoys more exciting properties, including decomposability of the optimal transport model, translation-invariance of the RW_2 distance, and a Pythagorean relationship between RW_2 and the classical quadratic Wasserstein distance ( W_2 ). Based on these properties, we show that a distribution shift, measured by W_2 distance, can be explained in the bias-variance perspective. In addition, we propose a variant of the Sinkhorn algorithm, named RW_2 Sinkhorn algorithm, for efficiently calculating RW_2 distance, coupling solutions, as well as W_2 distance. We also provide the analysis of numerical stability and time complexity for the proposed algorithm. Finally, we validate the RW_2 distance metric and the algorithm performance with three experiments. We conduct one numerical validation for the RW_2 Sinkhorn algorithm and show two real-world applications demonstrating the effectiveness of using RW_2 under distribution shift: digits recognition and similar thunderstorm detection. The experimental results report that our proposed algorithm significantly improves the computational efficiency of Sinkhorn in certain practical applications, and the RW_2 distance is robust to distribution translations compared with baselines.

[LG-50] Abstractive Text Summarization: State of the Art Challenges and Improvements

链接: https://arxiv.org/abs/2409.02413
作者: Hassan Shakil,Ahmad Farooq,Jugal Kalita
关键词-EN: Specifically focusing, prospective research directions, opposed to extractive, survey presents, summarization
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 Tables, 7 Figures

点击查看摘要

Abstract:Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.

[LG-51] Adaptive Class Emergence Training: Enhancing Neural Network Stability and Generalization through Progressive Target Evolution

链接: https://arxiv.org/abs/2409.02410
作者: Jaouad Dabounou
关键词-EN: Recent advancements, deep neural networks, artificial intelligence, one-hot encoded vectors, advancements in artificial
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Recent advancements in artificial intelligence, particularly deep neural networks, have pushed the boundaries of what is achievable in complex tasks. Traditional methods for training neural networks in classification problems often rely on static target outputs, such as one-hot encoded vectors, which can lead to unstable optimization and difficulties in handling non-linearities within data. In this paper, we propose a novel training methodology that progressively evolves the target outputs from a null vector to one-hot encoded vectors throughout the training process. This gradual transition allows the network to adapt more smoothly to the increasing complexity of the classification task, maintaining an equilibrium state that reduces the risk of overfitting and enhances generalization. Our approach, inspired by concepts from structural equilibrium in finite element analysis, has been validated through extensive experiments on both synthetic and real-world datasets. The results demonstrate that our method achieves faster convergence, improved accuracy, and better generalization, especially in scenarios with high data complexity and noise. This progressive training framework offers a robust alternative to classical methods, opening new perspectives for more efficient and stable neural network training.

[LG-52] Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation

链接: https://arxiv.org/abs/2409.02404
作者: Shiming Ge,Bochao Liu,Pengju Wang,Yong Li,Dan Zeng
关键词-EN: privacy leakage risk, synthetic data, data, practical deployment, proved successful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: This paper is accepted by IEEE Transactions on Image Processing (TIP)

点击查看摘要

Abstract:While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.

[LG-53] Building Math Agents with Multi-Turn Iterative Preference Learning

链接: https://arxiv.org/abs/2409.02392
作者: Wei Xiong,Chengshuai Shi,Jiaming Shen,Aviv Rosenberg,Zhen Qin,Daniele Calandriello,Misha Khalman,Rishabh Joshi,Bilal Piot,Mohammad Saleh,Chi Jin,Tong Zhang,Tianqi Liu
关键词-EN: mathematical problem-solving capabilities, direct preference learning, large language models’, Recent studies, integrating external tools
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: A multi-turn direct preference learning framework for tool-integrated reasoning tasks

点击查看摘要

Abstract:Recent studies have shown that large language models’ (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model’s performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.

[LG-54] Gaussian Rate-Distortion-Perception Coding and Entropy-Constrained Scalar Quantization

链接: https://arxiv.org/abs/2409.02388
作者: Li Xie,Liangyan Li,Jun Chen,Lei Yu,Zhongshan Zhang
关键词-EN: limited common randomness, Kullback-Leibler divergence-based perception, quadratic Gaussian, distance-based perception measure, divergence-based perception measure
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the best known bounds on the quadratic Gaussian distortion-rate-perception function with limited common randomness for the Kullback-Leibler divergence-based perception measure, as well as their counterparts for the squared Wasserstein-2 distance-based perception measure, recently established by Xie et al. These bounds are shown to be nondegenerate in the sense that they cannot be deduced from each other via a refined version of Talagrand’s transportation inequality. On the other hand, an improved lower bound is established when the perception measure is given by the squared Wasserstein-2 distance. In addition, it is revealed by exploiting the connection between rate-distortion-perception coding and entropy-constrained scalar quantization that all the aforementioned bounds are generally not tight in the weak perception constraint regime.

[LG-55] Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

链接: https://arxiv.org/abs/2409.02374
作者: Siyi Chen,Huijie Zhang,Minzhe Guo,Yifu Lu,Peng Wang,Qing Qu
关键词-EN: LOCO Edit, powerful class, class of generative, Recently, Edit
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at this https URL.

[LG-56] Optimal Neural Network Approximation for High-Dimensional Continuous Functions

链接: https://arxiv.org/abs/2409.02363
作者: Ayan Maiti,Michelle Michelle,Haizhao Yang
关键词-EN: Shen Yang Zhang, Yang Zhang, Shen Yang, authors of Shen, special activation function
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recently, the authors of Shen Yang Zhang (JMLR, 2022) developed a neural network with width 36d(2d + 1) and depth 11 , which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in C([a,b]^d) . That is, the constructed network only requires a fixed number of neurons to approximate a d -variate continuous function on a d -dimensional hypercube with arbitrary accuracy. Their network uses \mathcalO(d^2) fixed neurons. One natural question to address is whether we can reduce the number of these neurons in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, our analysis shows that there is a neural network generated by the elementary universal activation function with only 366d +365 fixed, intrinsic (non-repeated) neurons that attains this super approximation property. Furthermore, we present a family of continuous functions that requires at least width d , and therefore at least d intrinsic neurons, to achieve arbitrary accuracy in its approximation. This shows that the requirement of \mathcalO(d) intrinsic neurons is optimal in the sense that it grows linearly with the input dimension d , unlike some approximation methods where parameters may grow exponentially with d .

[LG-57] Understanding the Role of Functional Diversity in Weight-Ensembling with Ingredient Selection and Multidimensional Scaling ICML2024

链接: https://arxiv.org/abs/2409.02347
作者: Alex Rojas,David Alvarez-Melis
关键词-EN: multiple neural networks, parameters of multiple, multiple neural, neural networks, networks are directly
类目: Machine Learning (cs.LG)
*备注: Published at the ICML 2024 (Vienna, Austria) Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Weight-ensembles are formed when the parameters of multiple neural networks are directly averaged into a single model. They have demonstrated generalization capability in-distribution (ID) and out-of-distribution (OOD) which is not completely understood, though they are thought to successfully exploit functional diversity allotted by each distinct model. Given a collection of models, it is also unclear which combination leads to the optimal weight-ensemble; the SOTA is a linear-time ``greedy" method. We introduce two novel weight-ensembling approaches to study the link between performance dynamics and the nature of how each method decides to use apply the functionally diverse components, akin to diversity-encouragement in the prediction-ensemble literature. We develop a visualization tool to explain how each algorithm explores various domains defined via pairwise-distances to further investigate selection and algorithms’ convergence. Empirical analyses shed perspectives which reinforce how high-diversity enhances weight-ensembling while qualifying the extent to which diversity alone improves accuracy. We also demonstrate that sampling positionally distinct models can contribute just as meaningfully to improvements in a weight-ensemble.

[LG-58] Robust Federated Finetuning of Foundation Models via Alternating Minimization of LoRA ICML2024

链接: https://arxiv.org/abs/2409.02346
作者: Shuangyi Chen,Yue Ju,Hardik Dalal,Zhongwen Zhu,Ashish Khisti
关键词-EN: innovative training strategy, significantly lowering, memory demands, innovative training, training strategy
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Presented at ES-FOMO-II@ICML2024

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has risen as an innovative training strategy that updates only a select few model parameters, significantly lowering both computational and memory demands. PEFT also helps to decrease data transfer in federated learning settings, where communication depends on the size of updates. In this work, we explore the constraints of previous studies that integrate a well-known PEFT method named LoRA with federated fine-tuning, then introduce RoLoRA, a robust federated fine-tuning framework that utilizes an alternating minimization approach for LoRA, providing greater robustness against decreasing fine-tuning parameters and increasing data heterogeneity. Our results indicate that RoLoRA not only presents the communication benefits but also substantially enhances the robustness and effectiveness in multiple federated fine-tuning scenarios.

[LG-59] NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval

链接: https://arxiv.org/abs/2409.02343
作者: Sepanta Zeighami,Zac Wellmer,Aditya Parameswaran
关键词-EN: Nearest Neighbor search, Nearest Neighbor, Retrieval-Augmented Generation, Neighbor search, dense vector embeddings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract: k -Nearest Neighbor search on dense vector embeddings ( k -NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of k -NN retrieval. We present a thorough theoretical and experimental study of NUDGE’s non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.

[LG-60] Data-driven 2D stationary quantum droplets and wave propagations in the amended GP equation with two potentials via deep neural networks learning

链接: https://arxiv.org/abs/2409.02339
作者: Jin Song,Zhenya Yan
关键词-EN: amended Gross-Pitaevskii equation, stationary quantum droplets, quantum droplets, solve two-dimensional, amended Gross-Pitaevskii
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Pattern Formation and Solitons (nlin.PS); Computational Physics (physics.comp-ph); Optics (physics.optics)
*备注: 17 pages, 12 figures (Proc. R. Soc. A, accepted for publication). arXiv admin note: text overlap with arXiv:2409.01124

点击查看摘要

Abstract:In this paper, we develop a systematic deep learning approach to solve two-dimensional (2D) stationary quantum droplets (QDs) and investigate their wave propagation in the 2D amended Gross-Pitaevskii equation with Lee-Huang-Yang correction and two kinds of potentials. Firstly, we use the initial-value iterative neural network (IINN) algorithm for 2D stationary quantum droplets of stationary equations. Then the learned stationary QDs are used as the initial value conditions for physics-informed neural networks (PINNs) to explore their evolutions in the some space-time region. Especially, we consider two types of potentials, one is the 2D quadruple-well Gaussian potential and the other is the PT-symmetric HO-Gaussian potential, which lead to spontaneous symmetry breaking and the generation of multi-component QDs. The used deep learning method can also be applied to study wave propagations of other nonlinear physical models.

[LG-61] Double Machine Learning at Scale to Predict Causal Impact of Customer Actions KDD ECML

链接: https://arxiv.org/abs/2409.02332
作者: Sushant More,Priya Kotwal,Sujith Chappidi,Dinesh Mandalapu,Chris Khawand
关键词-EN: long-term investment decisions, Causal Impact, inform both short, customer actions, industry to inform
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP); Methodology (stat.ME)
*备注: 16 pages, 11 figures. Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) 2023, Turin, Italy

点击查看摘要

Abstract:Causal Impact (CI) of customer actions are broadly used across the industry to inform both short- and long-term investment decisions of various types. In this paper, we apply the double machine learning (DML) methodology to estimate the CI values across 100s of customer actions of business interest and 100s of millions of customers. We operationalize DML through a causal ML library based on Spark with a flexible, JSON-driven model configuration approach to estimate CI at scale (i.e., across hundred of actions and millions of customers). We outline the DML methodology and implementation, and associated benefits over the traditional potential outcomes based CI model. We show population-level as well as customer-level CI values along with confidence intervals. The validation metrics show a 2.2% gain over the baseline methods and a 2.5X gain in the computational time. Our contribution is to advance the scalable application of CI, while also providing an interface that allows faster experimentation, cross-platform support, ability to onboard new use cases, and improves accessibility of underlying code for partner teams.

[LG-62] meDiT: General-purpose Diffusion Transformers for Time Series Foundation Model ICML2024

链接: https://arxiv.org/abs/2409.02322
作者: Defu Cao,Wen Ye,Yizhou Zhang,Yan Liu
关键词-EN: Large Language Models, building foundation models, time series, foundation models, recent advances
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 Pages, 6 Figures, 11 Tables. First present at ICML 2024 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:With recent advances in building foundation models for texts and video data, there is a surge of interest in foundation models for time series. A family of models have been developed, utilizing a temporal auto-regressive generative Transformer architecture, whose effectiveness has been proven in Large Language Models. While the empirical results are promising, almost all existing time series foundation models have only been tested on well-curated ``benchmark’’ datasets very similar to texts. However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the uni-directional nature of temporally auto-regressive decoding limits the incorporation of domain knowledge, such as physical laws expressed as partial differential equations (PDEs). To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that employs a denoising diffusion paradigm instead of temporal auto-regressive generation. TimeDiT leverages the Transformer architecture to capture temporal dependencies and employs diffusion processes to generate high-quality candidate samples without imposing stringent assumptions on the target distribution via novel masking schemes and a channel alignment strategy. Furthermore, we propose a finetuning-free model editing strategy that allows the seamless integration of external knowledge during the sampling process without updating any model parameters. Extensive experiments conducted on a varity of tasks such as forecasting, imputation, and anomaly detection, demonstrate the effectiveness of TimeDiT.

[LG-63] On the Benefits of Memory for Modeling Time-Dependent PDEs

链接: https://arxiv.org/abs/2409.02313
作者: Ricardo Buitrago Ruiz,Tanya Marwah,Albert Gu,Andrej Risteski
关键词-EN: partial differential equations, traditional numerical methods, solving partial differential, Data-driven techniques, differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data-driven techniques have emerged as a promising alternative to traditional numerical methods for solving partial differential equations (PDEs). These techniques frequently offer a better trade-off between computational cost and accuracy for many PDE families of interest. For time-dependent PDEs, existing methodologies typically treat PDEs as Markovian systems, i.e., the evolution of the system only depends on the ``current state’', and not the past states. However, distortion of the input signals – e.g., due to discretization or low-pass filtering – can render the evolution of the distorted signals non-Markovian. In this work, motivated by the Mori-Zwanzig theory of model reduction, we investigate the impact of architectures with memory for modeling PDEs: that is, when past states are explicitly used to predict the future. We introduce Memory Neural Operator (MemNO), a network based on the recent SSM architectures and Fourier Neural Operator (FNO). We empirically demonstrate on a variety of PDE families of interest that when the input is given on a low-resolution grid, MemNO significantly outperforms the baselines without memory, achieving more than 6 times less error on unseen PDEs. Via a combination of theory and experiments, we show that the effect of memory is particularly significant when the solution of the PDE has high frequency Fourier components (e.g., low-viscosity fluid dynamics), and it also increases robustness to observation noise.

[LG-64] A Lesion-aware Edge-based Graph Neural Network for Predicting Language Ability in Patients with Post-stroke Aphasia MICCAI2024

链接: https://arxiv.org/abs/2409.02303
作者: Zijian Chen,Maria Varkanitsa,Prakash Ishwar,Janusz Konrad,Margrit Betke,Swathi Kiran,Archana Venkataraman
关键词-EN: graph neural network, lesion-aware graph neural, Human Connectome Project, neural network, resting-state fMRI
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: Accepted at MICCAI 2024 International Workshop on Machine Learning in Clinical Neuroimaging (MLCN)

点击查看摘要

Abstract:We propose a lesion-aware graph neural network (LEGNet) to predict language ability from resting-state fMRI (rs-fMRI) connectivity in patients with post-stroke aphasia. Our model integrates three components: an edge-based learning module that encodes functional connectivity between brain regions, a lesion encoding module, and a subgraph learning module that leverages functional similarities for prediction. We use synthetic data derived from the Human Connectome Project (HCP) for hyperparameter tuning and model pretraining. We then evaluate the performance using repeated 10-fold cross-validation on an in-house neuroimaging dataset of post-stroke aphasia. Our results demonstrate that LEGNet outperforms baseline deep learning methods in predicting language ability. LEGNet also exhibits superior generalization ability when tested on a second in-house dataset that was acquired under a slightly different neuroimaging protocol. Taken together, the results of this study highlight the potential of LEGNet in effectively learning the relationships between rs-fMRI connectivity and language ability in a patient cohort with brain lesions for improved post-stroke aphasia evaluation.

[LG-65] K-Origins: Better Colour Quantification for Neural Networks

链接: https://arxiv.org/abs/2409.02281
作者: Lewis Mason,Mark Martinez
关键词-EN: neural network layer, network layer designed, textbf, layer designed, improve image-based network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 13 figures, 1 table

点击查看摘要

Abstract:K-Origins is a neural network layer designed to improve image-based network performances when learning colour, or intensities, is beneficial. Over 250 encoder-decoder convolutional networks are trained and tested on 16-bit synthetic data, demonstrating that K-Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal-to-noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K-Origins generates output features from the input features, \textbfX , by the equation \textbfY_k = \textbfX-\textbfJ\cdot w_k for each trainable parameter w_k , where \textbfJ is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K-Origins, we can achieve better semantic network performance.

[LG-66] Reinforcement Learning-enabled Satellite Constellation Reconfiguration and Retasking for Mission-Critical Applications

链接: https://arxiv.org/abs/2409.02270
作者: Hassan El Alami,Danda B. Rawat
关键词-EN: reduced operational costs, increasing user demands, rapidly advancing due, satellite constellation applications, user demands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted for publication in the IEEE Military Communications Conference (IEEE MILCOM 2024)

点击查看摘要

Abstract:The development of satellite constellation applications is rapidly advancing due to increasing user demands, reduced operational costs, and technological advancements. However, a significant gap in the existing literature concerns reconfiguration and retasking issues within satellite constellations, which is the primary focus of our research. In this work, we critically assess the impact of satellite failures on constellation performance and the associated task requirements. To facilitate this analysis, we introduce a system modeling approach for GPS satellite constellations, enabling an investigation into performance dynamics and task distribution strategies, particularly in scenarios where satellite failures occur during mission-critical operations. Additionally, we introduce reinforcement learning (RL) techniques, specifically Q-learning, Policy Gradient, Deep Q-Network (DQN), and Proximal Policy Optimization (PPO), for managing satellite constellations, addressing the challenges posed by reconfiguration and retasking following satellite failures. Our results demonstrate that DQN and PPO achieve effective outcomes in terms of average rewards, task completion rates, and response times.

[LG-67] LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement

链接: https://arxiv.org/abs/2409.02266
作者: Arnav Jain,Jasmer Singh Sanjotra,Harshvardhan Choudhary,Krish Agrawal,Rupal Shah,Rohan Jha,M. Sajid,Amir Hussain,M. Tanveer
关键词-EN: propose long short, long short term, short term memory, term memory speech, memory speech enhancement
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), 0.03 in short-time objective intelligibility (STOI), and 1.32 in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at \urlthis https URL.

[LG-68] Optimal L-Systems for Stochastic L-system Inference Problems

链接: https://arxiv.org/abs/2409.02259
作者: Ali Lotfi,Ian McQuillan
关键词-EN: optimal stochastic L-system, stochastic L-system capable, stochastic L-system, stochastic Lindenmayer-system, L-system
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:This paper presents two novel theorems that address two open problems in stochastic Lindenmayer-system (L-system) inference, specifically focusing on the construction of an optimal stochastic L-system capable of generating a given sequence of strings. The first theorem delineates a method for crafting a stochastic L-system that maximizes the likelihood of producing a given sequence of words through a singular derivation. Furthermore, the second theorem determines the stochastic L-systems with the highest probability of producing a given sequence of words with multiple possible derivations. From these, we introduce an algorithm to infer an optimal stochastic L-system from a given sequence. This algorithm incorporates sophisticated optimization techniques, such as interior point methods, ensuring production of a stochastically optimal stochastic L-system suitable for generating the given sequence. This allows for the use of using stochastic L-systems as model for machine learning using only positive data for training.

[LG-69] MMLU-Pro: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs

链接: https://arxiv.org/abs/2409.02257
作者: Saeid Asgari Taghanaki,Aliasgahr Khani,Amir Khasahmadi
关键词-EN: Existing benchmarks, challenging evaluation frameworks, large language models, increasingly struggle, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order reasoning in LLMs. By incorporating questions with multiple correct answers across diverse domains, MMLU-Pro+ tests LLMs’ ability to engage in complex reasoning and resist simplistic problem-solving strategies. Our results show that MMLU-Pro+ maintains MMLU-Pro’s difficulty while providing a more rigorous test of model discrimination, particularly in multi-correct answer scenarios. We introduce novel metrics like shortcut selection ratio and correct pair identification ratio, offering deeper insights into model behavior and anchoring bias. Evaluations of five state-of-the-art LLMs reveal significant performance gaps, highlighting variations in reasoning abilities and bias susceptibility. We release the dataset and evaluation codes at \urlthis https URL.

[LG-70] NoiseAttack: An Evasive Sample-Specific Multi-Targeted Backdoor Attack Through White Gaussian Noise

链接: https://arxiv.org/abs/2409.02251
作者: Abdullah Arafat Miah,Kaan Icer,Resit Sendag,Yu Bi
关键词-EN: deep learning development, Backdoor attacks pose, learning development, backdoor attack, pose a significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat when using third-party data for deep learning development. In these attacks, data can be manipulated to cause a trained model to behave improperly when a specific trigger pattern is applied, providing the adversary with unauthorized advantages. While most existing works focus on designing trigger patterns in both visible and invisible to poison the victim class, they typically result in a single targeted class upon the success of the backdoor attack, meaning that the victim class can only be converted to another class based on the adversary predefined value. In this paper, we address this issue by introducing a novel sample-specific multi-targeted backdoor attack, namely NoiseAttack. Specifically, we adopt White Gaussian Noise (WGN) with various Power Spectral Densities (PSD) as our underlying triggers, coupled with a unique training strategy to execute the backdoor attack. This work is the first of its kind to launch a vision backdoor attack with the intent to generate multiple targeted classes with minimal input configuration. Furthermore, our extensive experimental results demonstrate that NoiseAttack can achieve a high attack success rate against popular network architectures and datasets, as well as bypass state-of-the-art backdoor detection methods. Our source code and experiments are available at this https URL.

[LG-71] Multi-Agent Reinforcement Learning for Joint Police Patrol and Dispatch

链接: https://arxiv.org/abs/2409.02246
作者: Matthew Repasky,He Wang,Yao Xie
关键词-EN: serve emergency incidents, performing preventive patrol, Police patrol units, performing preventive, dispatched to serve
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Police patrol units need to split their time between performing preventive patrol and being dispatched to serve emergency incidents. In the existing literature, patrol and dispatch decisions are often studied separately. We consider joint optimization of these two decisions to improve police operations efficiency and reduce response time to emergency calls. Methodology/results: We propose a novel method for jointly optimizing multi-agent patrol and dispatch to learn policies yielding rapid response times. Our method treats each patroller as an independent Q-learner (agent) with a shared deep Q-network that represents the state-action values. The dispatching decisions are chosen using mixed-integer programming and value function approximation from combinatorial action spaces. We demonstrate that this heterogeneous multi-agent reinforcement learning approach is capable of learning joint policies that outperform those optimized for patrol or dispatch alone. Managerial Implications: Policies jointly optimized for patrol and dispatch can lead to more effective service while targeting demonstrably flexible objectives, such as those encouraging efficiency and equity in response.

[LG-72] FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation WWW INTERSPEECH2024 FAST

链接: https://arxiv.org/abs/2409.02245
作者: Takuhiro Kaneko,Hirokazu Kameoka,Kou Tanaka,Yuto Kondo
关键词-EN: Diffusion-based voice conversion, voice conversion, speaker similarity, VoiceGrad have attracted, attracted interest
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
*备注: Accepted to Interspeech 2024. Project page: this https URL

点击查看摘要

Abstract:Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at this https URL.

[LG-73] Unforgettable Generalization in Language Models

链接: https://arxiv.org/abs/2409.02228
作者: Eric Zhang,Leshem Chosen,Jacob Andreas
关键词-EN: training set, training, forgetting, tasks, LMs
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 18 pages, 9 figures, published in First Conference on Language Modeling 2024

点击查看摘要

Abstract:When language models (LMs) are trained to forget (or "unlearn’‘) a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training’’ set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten’’ task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs’ initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs’ representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

[LG-74] Collaboratively Learning Federated Models from Noisy Decentralized Data

链接: https://arxiv.org/abs/2409.02189
作者: Haoyuan Li,Mathias Funk,Nezihe Merve Gürel,Aaqib Saeed
关键词-EN: collaboratively training machine, training machine learning, edge devices, collaboratively training, training machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a prominent method for collaboratively training machine learning models using local data from edge devices, all while keeping data decentralized. However, accounting for the quality of data contributed by local clients remains a critical challenge in FL, as local data are often susceptible to corruption by various forms of noise and perturbations, which compromise the aggregation process and lead to a subpar global model. In this work, we focus on addressing the problem of noisy data in the input space, an under-explored area compared to the label noise. We propose a comprehensive assessment of client input in the gradient space, inspired by the distinct disparity observed between the density of gradient norm distributions of models trained on noisy and clean input data. Based on this observation, we introduce a straightforward yet effective approach to identify clients with low-quality data at the initial stage of FL. Furthermore, we propose a noise-aware FL aggregation method, namely Federated Noise-Sifting (FedNS), which can be used as a plug-in approach in conjunction with widely used FL strategies. Our extensive evaluation on diverse benchmark datasets under different federated settings demonstrates the efficacy of FedNS. Our method effortlessly integrates with existing FL strategies, enhancing the global model’s performance by up to 13.68% in IID and 15.85% in non-IID settings when learning from noisy decentralized data.

[LG-75] Optimal Power Grid Operations with Foundation Models

链接: https://arxiv.org/abs/2409.02148
作者: Alban Puech,Jonas Weiss,Thomas Brunschwiler,Hendrik F. Hamann
关键词-EN: renewable energy sources, integrating numerous distributed, demands integrating numerous, energy transition, renewable energy
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The energy transition, crucial for tackling the climate crisis, demands integrating numerous distributed, renewable energy sources into existing grids. Along with climate change and consumer behavioral changes, this leads to changes and variability in generation and load patterns, introducing significant complexity and uncertainty into grid planning and operations. While the industry has already started to exploit AI to overcome computational challenges of established grid simulation tools, we propose the use of AI Foundation Models (FMs) and advances in Graph Neural Networks to efficiently exploit poorly available grid data for different downstream tasks, enhancing grid operations. For capturing the grid’s underlying physics, we believe that building a self-supervised model learning the power flow dynamics is a critical first step towards developing an FM for the power grid. We show how this approach may close the gap between the industry needs and current grid analysis capabilities, to bring the industry closer to optimal grid operation and planning.

[LG-76] Brain-Inspired Online Adaptation for Remote Sensing with Spiking Neural Network

链接: https://arxiv.org/abs/2409.02146
作者: Dexin Duan,Peilin liu,Fei Wen
关键词-EN: unmanned aerial vehicles, On-device computing, remote sensing, deep network-based perception, aerial vehicles
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:On-device computing, or edge computing, is becoming increasingly important for remote sensing, particularly in applications like deep network-based perception on on-orbit satellites and unmanned aerial vehicles (UAVs). In these scenarios, two brain-like capabilities are crucial for remote sensing models: (1) high energy efficiency, allowing the model to operate on edge devices with limited computing resources, and (2) online adaptation, enabling the model to quickly adapt to environmental variations, weather changes, and sensor drift. This work addresses these needs by proposing an online adaptation framework based on spiking neural networks (SNNs) for remote sensing. Starting with a pretrained SNN model, we design an efficient, unsupervised online adaptation algorithm, which adopts an approximation of the BPTT algorithm and only involves forward-in-time computation that significantly reduces the computational complexity of SNN adaptation learning. Besides, we propose an adaptive activation scaling scheme to boost online SNN adaptation performance, particularly in low time-steps. Furthermore, for the more challenging remote sensing detection task, we propose a confidence-based instance weighting scheme, which substantially improves adaptation performance in the detection task. To our knowledge, this work is the first to address the online adaptation of SNNs. Extensive experiments on seven benchmark datasets across classification, segmentation, and detection tasks demonstrate that our proposed method significantly outperforms existing domain adaptation and domain generalization approaches under varying weather conditions. The proposed method enables energy-efficient and fast online adaptation on edge devices, and has much potential in applications such as remote perception on on-orbit satellites and UAV.

[LG-77] A Multimodal Object-level Contrast Learning Method for Cancer Survival Risk Prediction

链接: https://arxiv.org/abs/2409.02145
作者: Zekang Yang,Hong Liu,Xiangdong Wang
关键词-EN: Computer-aided cancer survival, survival risk, survival risk prediction, cancer survival risk, Computer-aided cancer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computer-aided cancer survival risk prediction plays an important role in the timely treatment of patients. This is a challenging weakly supervised ordinal regression task associated with multiple clinical factors involved such as pathological images, genomic data and etc. In this paper, we propose a new training method, multimodal object-level contrast learning, for cancer survival risk prediction. First, we construct contrast learning pairs based on the survival risk relationship among the samples in the training sample set. Then we introduce the object-level contrast learning method to train the survival risk predictor. We further extend it to the multimodal scenario by applying cross-modal constrast. Considering the heterogeneity of pathological images and genomics data, we construct a multimodal survival risk predictor employing attention-based and self-normalizing based nerural network respectively. Finally, the survival risk predictor trained by our proposed method outperforms state-of-the-art methods on two public multimodal cancer datasets for survival risk prediction.

[LG-78] Efficient and Scalable Estimation of Tool Representations in Vector Space

链接: https://arxiv.org/abs/2409.02141
作者: Suhong Moon,Siddharth Jha,Lutfi Eren Erdogan,Sehoon Kim,Woosang Lim,Kurt Keutzer,Amir Gholami
关键词-EN: execute complex tasks, external information sources, Recent advancements, tool retrieval, complex tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context window of LLMs presents challenges when a large number of tools are available, necessitating efficient methods to manage prompt length and maintain accuracy. Existing approaches, such as fine-tuning LLMs or leveraging their reasoning capabilities, either require frequent retraining or incur significant latency overhead. A more efficient solution involves training smaller models to retrieve the most relevant tools for a given query, although this requires high quality, domain-specific data. To address those challenges, we present a novel framework for generating synthetic data for tool retrieval applications and an efficient data-driven tool retrieval strategy using small encoder models. Empowered by LLMs, we create ToolBank, a new tool retrieval dataset that reflects real human user usages. For tool retrieval methodologies, we propose novel approaches: (1) Tool2Vec: usage-driven tool embedding generation for tool retrieval, (2) ToolRefiner: a staged retrieval method that iteratively improves the quality of retrieved tools, and (3) MLC: framing tool retrieval as a multi-label classification problem. With these new methods, we achieve improvements of up to 27.28 in Recall@K on the ToolBench dataset and 30.5 in Recall@K on ToolBank. Additionally, we present further experimental results to rigorously validate our methods. Our code is available at \urlthis https URL

[LG-79] Self-Supervised Learning for Identifying Defects in Sewer Footage ICML2024

链接: https://arxiv.org/abs/2409.02140
作者: Daniel Otero,Rafael Mateus
关键词-EN: expensive modern investments, modern investments requiring, investments requiring time-intensive, requiring time-intensive manual, Sewerage infrastructure
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Poster at the LatinX in AI Workshop @ ICML 2024

点击查看摘要

Abstract:Sewerage infrastructure is among the most expensive modern investments requiring time-intensive manual inspections by qualified personnel. Our study addresses the need for automated solutions without relying on large amounts of labeled data. We propose a novel application of Self-Supervised Learning (SSL) for sewer inspection that offers a scalable and cost-effective solution for defect detection. We achieve competitive results with a model that is at least 5 times smaller than other approaches found in the literature and obtain competitive performance with 10% of the available data when training with a larger architecture. Our findings highlight the potential of SSL to revolutionize sewer maintenance in resource-limited settings.

[LG-80] he Role of Transformer Models in Advancing Blockchain Technology: A Systematic Review

链接: https://arxiv.org/abs/2409.02139
作者: Tianxu Liu,Yanbin Wang,Jianguo Sun,Ye Tian,Yanyu Huang,Tao Xue,Peiyue Li,Yiwei Liu
关键词-EN: architectures,have shown unprecedented, technology rapidly evolves, scalability grows.Transformer models, shown unprecedented potential, deep learning architectures,have
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As blockchain technology rapidly evolves, the demand for enhanced efficiency, security, and scalability grows.Transformer models, as powerful deep learning architectures,have shown unprecedented potential in addressing various blockchain challenges. However, a systematic review of Transformer applications in blockchain is lacking. This paper aims to fill this research gap by surveying over 200 relevant papers, comprehensively reviewing practical cases and research progress of Transformers in blockchain applications. Our survey covers key areas including anomaly detection, smart contract security analysis, cryptocurrency prediction and trend analysis, and code summary generation. To clearly articulate the advancements of Transformers across various blockchain domains, we adopt a domain-oriented classification system, organizing and introducing representative methods based on major challenges in current blockchain research. For each research domain,we first introduce its background and objectives, then review previous representative methods and analyze their limitations,and finally introduce the advancements brought by Transformer models. Furthermore, we explore the challenges of utilizing Transformer, such as data privacy, model complexity, and real-time processing requirements. Finally, this article proposes future research directions, emphasizing the importance of exploring the Transformer architecture in depth to adapt it to specific blockchain applications, and discusses its potential role in promoting the development of blockchain technology. This review aims to provide new perspectives and a research foundation for the integrated development of blockchain technology and machine learning, supporting further innovation and application expansion of blockchain technology.

[LG-81] A Financial Time Series Denoiser Based on Diffusion Model

链接: https://arxiv.org/abs/2409.02138
作者: Zhuohan Wang,Carmine Ventre
关键词-EN: ultimately decision making, posing significant challenges, Financial time series, accurate data interpretation, exhibit low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:Financial time series often exhibit low signal-to-noise ratio, posing significant challenges for accurate data interpretation and prediction and ultimately decision making. Generative models have gained attention as powerful tools for simulating and predicting intricate data patterns, with the diffusion model emerging as a particularly effective method. This paper introduces a novel approach utilizing the diffusion model as a denoiser for financial time series in order to improve data predictability and trading performance. By leveraging the forward and reverse processes of the conditional diffusion model to add and remove noise progressively, we reconstruct original data from noisy inputs. Our extensive experiments demonstrate that diffusion model-based denoised time series significantly enhance the performance on downstream future return classification tasks. Moreover, trading signals derived from the denoised data yield more profitable trades with fewer transactions, thereby minimizing transaction costs and increasing overall trading efficiency. Finally, we show that by using classifiers trained on denoised time series, we can recognize the noising state of the market and obtain excess return.

[LG-82] Reward Augmentation in Reinforcement Learning for Testing Distributed Systems

链接: https://arxiv.org/abs/2409.02137
作者: Andrea Borgarelli,Constantin Enea,Rupak Majumdar,Srinidhi Nagendra
关键词-EN: popular internet services, popular distributed protocol, distributed protocol implementations, downtimes in popular, popular internet
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Bugs in popular distributed protocol implementations have been the source of many downtimes in popular internet services. We describe a randomized testing approach for distributed protocol implementations based on reinforcement learning. Since the natural reward structure is very sparse, the key to successful exploration in reinforcement learning is reward augmentation. We show two different techniques that build on one another. First, we provide a decaying exploration bonus based on the discovery of new states – the reward decays as the same state is visited multiple times. The exploration bonus captures the intuition from coverage-guided fuzzing of prioritizing new coverage points; in contrast to other schemes, we show that taking the maximum of the bonus and the Q-value leads to more effective exploration. Second, we provide waypoints to the algorithm as a sequence of predicates that capture interesting semantic scenarios. Waypoints exploit designer insight about the protocol and guide the exploration to ``interesting’’ parts of the state space. Our reward structure ensures that new episodes can reliably get to deep interesting states even without execution caching. We have implemented our algorithm in Go. Our evaluation on three large benchmarks (RedisRaft, Etcd, and RSL) shows that our algorithm can significantly outperform baseline approaches in terms of coverage and bug finding.

[LG-83] Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

链接: https://arxiv.org/abs/2409.02136
作者: Mohammadreza Ghaffarzadeh-Esfahani,Mahdi Ghaffarzadeh-Esfahani,Arian Salahi-Niri,Hossein Toreyhi,Zahra Atf,Amirali Mohsenzadeh-Kermani,Mahshad Sarikhani,Zohreh Tajabadi,Fatemeh Shojaeian,Mohammad Hassan Bagheri,Aydin Feyzi,Mohammadamin Tarighatpayma,Narges Gazmeh,Fateme Heydari,Hossein Afshar,Amirreza Allahgholipour,Farid Alimardani,Ameneh Salehi,Naghmeh Asadimanesh,Mohammad Amin Khalafi,Hadis Shabanipour,Ali Moradi,Sajjad Hossein Zadeh,Omid Yazdani,Romina Esbati,Moozhan Maleki,Danial Samiei Nasr,Amirali Soheili,Hossein Majlesi,Saba Shahsavan,Alireza Soheilipour,Nooshin Goudarzi,Erfan Taherifard,Hamidreza Hatamabadi,Jamil S Samaan,Thomas Savage,Ankit Sakhuja,Ali Soroush,Girish Nadkarni,Ilad Alavi Darazam,Mohamad Amin Pourhoseingholi,Seyed Amir Ahmad Safavi-Naini
关键词-EN: large language models, CML models, study aimed, aimed to evaluate, evaluate and compare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at: this https URL and this https URL . The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com)

点击查看摘要

Abstract:Background: This study aimed to evaluate and compare the performance of classical machine learning models (CMLs) and large language models (LLMs) in predicting mortality associated with COVID-19 by utilizing a high-dimensional tabular dataset. Materials and Methods: We analyzed data from 9,134 COVID-19 patients collected across four hospitals. Seven CML models, including XGBoost and random forest (RF), were trained and evaluated. The structured data was converted into text for zero-shot classification by eight LLMs, including GPT-4 and Mistral-7b. Additionally, Mistral-7b was fine-tuned using the QLoRA approach to enhance its predictive capabilities. Results: Among the CML models, XGBoost and RF achieved the highest accuracy, with F1 scores of 0.87 for internal validation and 0.83 for external validation. In the LLM category, GPT-4 was the top performer with an F1 score of 0.43. Fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, resulting in an F1 score of 0.74, which was stable during external validation. Conclusion: While LLMs show moderate performance in zero-shot classification, fine-tuning can significantly enhance their effectiveness, potentially aligning them closer to CML models. However, CMLs still outperform LLMs in high-dimensional tabular data tasks. Comments: Code is available at: this https URL and this https URL. The datasets are available from the corresponding author on reasonable request (sdamirsa@ymail.com) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 92C50, 68T50 ACMclasses: J.3 Cite as: arXiv:2409.02136 [cs.LG] (or arXiv:2409.02136v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.02136 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Seyed Amir Ahmad Safavi-Naini [view email] [v1] Mon, 2 Sep 2024 14:51:12 UTC (2,557 KB)

[LG-84] Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling

链接: https://arxiv.org/abs/2409.02135
作者: Yuma Ichikawa,Yamato Arai
关键词-EN: learn problem-specific heuristics, manually crafted heuristics, automatically learn problem-specific, problem-specific heuristics, crafted heuristics
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Learning-based methods have gained attention as general-purpose solvers because they can automatically learn problem-specific heuristics, reducing the need for manually crafted heuristics. However, these methods often face challenges with scalability. To address these issues, the improved Sampling algorithm for Combinatorial Optimization (iSCO) using discrete Langevin dynamics has been proposed, demonstrating better performance than several learning-based solvers. This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function from a simple convex form, where half-integral solutions dominate, to the original objective function, where the variables are restricted to 0 or 1. Furthermore, we incorporate parallel run communication leveraging GPUs, enhancing exploration capabilities and accelerating convergence. Numerical experiments demonstrate that our approach is a competitive general-purpose solver, achieving comparable performance to iSCO across various benchmark problems. Notably, our method exhibits superior trade-offs between speed and solution quality for large-scale instances compared to iSCO, commercial solvers, and specialized algorithms.

[LG-85] Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

链接: https://arxiv.org/abs/2409.02134
作者: Samer Francy,Raghubir Singh
关键词-EN: image classification tasks, work evaluates, image classification, classification tasks, compression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work evaluates the compression techniques on ConvNeXt models in image classification tasks using the CIFAR-10 dataset. Structured pruning, unstructured pruning, and dynamic quantization methods are evaluated to reduce model size and computational complexity while maintaining accuracy. The experiments, conducted on cloud-based platforms and edge device, assess the performance of these techniques. Results show significant reductions in model size, with up to 75% reduction achieved using structured pruning techniques. Additionally, dynamic quantization achieves a reduction of up to 95% in the number of parameters. Fine-tuned models exhibit improved compression performance, indicating the benefits of pre-training in conjunction with compression techniques. Unstructured pruning methods reveal trends in accuracy and compression, with limited reductions in computational complexity. The combination of OTOV3 pruning and dynamic quantization further enhances compression performance, resulting 89.7% reduction in size, 95% reduction with number of parameters and MACs, and 3.8% increase with accuracy. The deployment of the final compressed model on edge device demonstrates high accuracy 92.5% and low inference time 20 ms, validating the effectiveness of compression techniques for real-world edge computing applications.

[LG-86] From Predictive Importance to Causality: Which Machine Learning Model Reflects Reality?

链接: https://arxiv.org/abs/2409.02130
作者: Muhammad Arbab Arshad,Pallavi Kandanur,Saurabh Sonawani
关键词-EN: Ames Housing Dataset, analyzes the Ames, Dataset using CatBoost, Ames Housing, Housing Dataset
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study analyzes the Ames Housing Dataset using CatBoost and LightGBM models to explore feature importance and causal relationships in housing price prediction. We examine the correlation between SHAP values and EconML predictions, achieving high accuracy in price forecasting. Our analysis reveals a moderate Spearman rank correlation of 0.48 between SHAP-based feature importance and causally significant features, highlighting the complexity of aligning predictive modeling with causal understanding in housing market analysis. Through extensive causal analysis, including heterogeneity exploration and policy tree interpretation, we provide insights into how specific features like porches impact housing prices across various scenarios. This work underscores the need for integrated approaches that combine predictive power with causal insights in real estate valuation, offering valuable guidance for stakeholders in the industry.

[LG-87] he Application of Artificial Neural Network Model to Predicting the Acid Mine Drainage from Long-Term Lab Scale Kinetic Test

链接: https://arxiv.org/abs/2409.02128
作者: Muhammad Sonny Abfertiawan,Muchammad Daniyal Kautsar,Faiz Hasan,Yoseph Palinggi,Kris Pranoto
关键词-EN: Acid mine drainage, coal mining industry, lab-scale kinetic tests, lab-scale kinetic, common environmental problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 7th Environmental Technology and Management Conference (ETMC 2023)

点击查看摘要

Abstract:Acid mine drainage (AMD) is one of the common environmental problems in the coal mining industry that was formed by the oxidation of sulfide minerals in the overburden or waste rock. The prediction of acid generation through AMD is important to do in overburden management and planning the post-mining land use. One of the methods used to predict AMD is a lab-scale kinetic test to determine the rate of acid formation over time using representative samples in the field. However, this test requires a long-time procedure and large amount of chemical reagents lead to inefficient cost. On the other hand, there is potential for machine learning to learn the pattern behind the lab-scale kinetic test data. This study describes an approach to use artificial neural network (ANN) modeling to predict the result from lab-scale kinetic tests. Various ANN model is used based on 83 weeks experiments of lab-scale kinetic tests with 100% potential acid-forming rock. The model approaches the monitoring of pH, ORP, conductivity, TDS, sulfate, and heavy metals (Fe and Mn). The overall Nash-Sutcliffe Efficiency (NSE) obtained in this study was 0.99 on training and validation data, indicating a strong correlation and accurate prediction compared to the actual lab-scale kinetic tests data. This show the ANN ability to learn patterns, trends, and seasonality from past data for accurate forecasting, thereby highlighting its significant contribution to solving AMD problems. This research is also expected to establish the foundation for a new approach to predict AMD, with time efficient, accurate, and cost-effectiveness in future applications.

[LG-88] Enabling Trustworthy Federated Learning in Industrial IoT: Bridging the Gap Between Interpretability and Robustness

链接: https://arxiv.org/abs/2409.02127
作者: Senthil Kumar Jagatheesaperumal,Mohamed Rahouti,Ali Alfatemi,Nasir Ghani,Vu Khanh Quy,Abdellah Chehri
关键词-EN: Federated Learning, keeping data localized, machine learning, allowing collaborative model, collaborative model training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Federated Learning (FL) represents a paradigm shift in machine learning, allowing collaborative model training while keeping data localized. This approach is particularly pertinent in the Industrial Internet of Things (IIoT) context, where data privacy, security, and efficient utilization of distributed resources are paramount. The essence of FL in IIoT lies in its ability to learn from diverse, distributed data sources without requiring central data storage, thus enhancing privacy and reducing communication overheads. However, despite its potential, several challenges impede the widespread adoption of FL in IIoT, notably in ensuring interpretability and robustness. This article focuses on enabling trustworthy FL in IIoT by bridging the gap between interpretability and robustness, which is crucial for enhancing trust, improving decision-making, and ensuring compliance with regulations. Moreover, the design strategies summarized in this article ensure that FL systems in IIoT are transparent and reliable, vital in industrial settings where decisions have significant safety and economic impacts. The case studies in the IIoT environment driven by trustworthy FL models are provided, wherein the practical insights of trustworthy communications between IIoT systems and their end users are highlighted.

[LG-89] Detecting Homeomorphic 3-manifolds via Graph Neural Networks

链接: https://arxiv.org/abs/2409.02126
作者: Craig Lawrie,Lorenzo Mansi
关键词-EN: superconformal field theories, quantum field theories, supersymmetric quantum field, field theories, Neural Network techniques
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Motivated by the enumeration of the BPS spectra of certain 3d \mathcalN=2 supersymmetric quantum field theories, obtained from the compactification of 6d superconformal field theories on three-manifolds, we study the homeomorphism problem for a class of graph-manifolds using Graph Neural Network techniques. Utilizing the JSJ decomposition, a unique representation via a plumbing graph is extracted from a graph-manifold. Homeomorphic graph-manifolds are related via a sequence of von Neumann moves on this graph; the algorithmic application of these moves can determine if two graphs correspond to homeomorphic graph-manifolds in super-polynomial time. However, by employing Graph Neural Networks (GNNs), the same problem can be addressed, at the cost of accuracy, in polynomial time. We build a dataset composed of pairs of plumbing graphs, together with a hidden label encoding whether the pair is homeomorphic. We train and benchmark a variety of network architectures within a supervised learning setting by testing different combinations of two convolutional layers (GEN, GCN, GAT, NNConv), followed by an aggregation layer and a classification layer. We discuss the strengths and weaknesses of the different GNNs for this homeomorphism problem.

[LG-90] rajWeaver: Trajectory Recovery with State Propagation Diffusion Model

链接: https://arxiv.org/abs/2409.02124
作者: Jinming Wang,Hai Wang,Hongkai Wen,Geyong Min,Man Luo
关键词-EN: large amount, vehicles and goods, proliferation of location-aware, goods flow, location-aware devices
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: First submission, extended to 10 pages include ref

点击查看摘要

Abstract:With the proliferation of location-aware devices, large amount of trajectories have been generated when agents such as people, vehicles and goods flow around the urban environment. These raw trajectories, typically collected from various sources such as GPS in cars, personal mobile devices, and public transport, are often sparse and fragmented due to limited sampling rates, infrastructure coverage and data loss. In this context, trajectory recovery aims to reconstruct such sparse raw trajectories into their dense and continuous counterparts, so that fine-grained movement of agents across space and time can be captured faithfully. Existing trajectory recovery approaches typically rely on the prior knowledge of travel mode or motion patterns, and often fail in densely populated urban areas where accurate maps are absent. In this paper, we present a new recovery framework called TrajWeaver based on probabilistic diffusion models, which is able to recover dense and refined trajectories from the sparse raw ones, conditioned on various auxiliary features such as Areas of Interest along the way, user identity and waybill information. The core of TrajWeaver is a novel State Propagation Diffusion Model (SPDM), which introduces a new state propagation mechanism on top of the standard diffusion models, so that knowledge computed in earlier diffusion steps can be reused later, improving the recovery performance while reducing the number of steps needed. Extensive experiments show that the proposed TrajWeaver can recover from raw trajectories of various lengths, sparsity levels and heterogeneous travel modes, and outperform the state-of-the-art baselines significantly in recovery accuracy. Our code is available at: https://anonymous.4open.science/r/TrajWeaver/

[LG-91] PuYun: Medium-Range Global Weather Forecasting Using Large Kernel Attention Convolutional Networks

链接: https://arxiv.org/abs/2409.02123
作者: Shengchen Zhu,Yiming Chen,Peiying Yu,Xiang Qu,Yuxiao Zhou,Yiming Ma,Zhizhan Zhao,Yukai Liu,Hao Mi,Bin Wang
关键词-EN: mitigating weather-related impacts, weather-related impacts, essential for understanding, understanding and mitigating, mitigating weather-related
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate weather forecasting is essential for understanding and mitigating weather-related impacts. In this paper, we present PuYun, an autoregressive cascade model that leverages large kernel attention convolutional networks. The model’s design inherently supports extended weather prediction horizons while broadening the effective receptive field. The integration of large kernel attention mechanisms within the convolutional layers enhances the model’s capacity to capture fine-grained spatial details, thereby improving its predictive accuracy for meteorological phenomena. We introduce PuYun, comprising PuYun-Short for 0-5 day forecasts and PuYun-Medium for 5-10 day predictions. This approach enhances the accuracy of 10-day weather forecasting. Through evaluation, we demonstrate that PuYun-Short alone surpasses the performance of both GraphCast and FuXi-Short in generating accurate 10-day forecasts. Specifically, on the 10th day, PuYun-Short reduces the RMSE for Z500 to 720 m^2/s^2 , compared to 732 m^2/s^2 for GraphCast and 740 m^2/s^2 for FuXi-Short. Additionally, the RMSE for T2M is reduced to 2.60 K, compared to 2.63 K for GraphCast and 2.65 K for FuXi-Short. Furthermore, when employing a cascaded approach by integrating PuYun-Short and PuYun-Medium, our method achieves superior results compared to the combined performance of FuXi-Short and FuXi-Medium. On the 10th day, the RMSE for Z500 is further reduced to 638 m^2/s^2 , compared to 641 m^2/s^2 for FuXi. These findings underscore the effectiveness of our model ensemble in advancing medium-range weather prediction. Our training code and model will be open-sourced. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph) Cite as: arXiv:2409.02123 [cs.LG] (or arXiv:2409.02123v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2409.02123 Focus to learn more arXiv-issued DOI via DataCite

[LG-92] Deep Knowledge-Infusion For Explainable Depression Detection

链接: https://arxiv.org/abs/2409.02122
作者: Sumit Dalal,Sarika Jain,Mayank Dave
关键词-EN: Discovering individuals depression, Discovering individuals, increasingly important, social media, depression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 13 pages, 2 figures

点击查看摘要

Abstract:Discovering individuals depression on social media has become increasingly important. Researchers employed ML/DL or lexicon-based methods for automated depression detection. Lexicon based methods, explainable and easy to implement, match words from user posts in a depression dictionary without considering contexts. While the DL models can leverage contextual information, their black-box nature limits their adoption in the domain. Though surrogate models like LIME and SHAP can produce explanations for DL models, the explanations are suitable for the developer and of limited use to the end user. We propose a Knolwedge-infused Neural Network (KiNN) incorporating domain-specific knowledge from DepressionFeature ontology (DFO) in a neural network to endow the model with user-level explainability regarding concepts and processes the clinician understands. Further, commonsense knowledge from the Commonsense Transformer (COMET) trained on ATOMIC is also infused to consider the generic emotional aspects of user posts in depression detection. The model is evaluated on three expertly curated datasets related to depression. We observed the model to have a statistically significant (p0.1) boost in performance over the best domain-specific model, MentalBERT, across CLEF e-Risk (25% MCC increase, 12% F1 increase). A similar trend is observed across the PRIMATE dataset, where the proposed model performed better than MentalBERT (2.5% MCC increase, 19% F1 increase). The observations confirm the generated explanations to be informative for MHPs compared to post hoc model explanations. Results demonstrated that the user-level explainability of KiNN also surpasses the performance of baseline models and can provide explanations where other baselines fall short. Infusing the domain and commonsense knowledge in KiNN enhances the ability of models like GPT-3.5 to generate application-relevant explanations.

[LG-93] CoRA: Optimizing Low-Rank Adaptation with Common Subspace of Large Language Models

链接: https://arxiv.org/abs/2409.02119
作者: Xiaojun Xiao,Sen Shen,Qiming Bao,Hongfei Rong,Kairui Liu,Zhongsheng Wang,Jiamou Liu
关键词-EN: large language models, fine-tuning large language, conserving computational resources, fine-tuning large models, constraints is crucial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In fine-tuning large language models (LLMs), conserving computational resources while maintaining effectiveness and improving outcomes within the same computational constraints is crucial. The Low-Rank Adaptation (LoRA) strategy balances efficiency and performance in fine-tuning large models by reducing the number of trainable parameters and computational costs. However, current advancements in LoRA might be focused on its fine-tuning methodologies, with not as much exploration as might be expected into further compression of LoRA. Since most of LoRA’s parameters might still be superfluous, this may lead to unnecessary wastage of computational resources. In this paper, we propose \textbfCoRA: leveraging shared knowledge to optimize LoRA training by substituting its matrix B with a common subspace from large models. Our two-fold method includes (1) Freezing the substitute matrix B to halve parameters while training matrix A for specific tasks and (2) Using the substitute matrix B as an enhanced initial state for the original matrix B , achieving improved results with the same parameters. Our experiments show that the first approach achieves the same efficacy as the original LoRA fine-tuning while being more efficient than halving parameters. At the same time, the second approach has some improvements compared to LoRA’s original fine-tuning performance. They generally attest to the effectiveness of our work.

[LG-94] SO: Self-Training with Scaled Preference Optimization

链接: https://arxiv.org/abs/2409.02118
作者: Kaihui Chen,Hao Yi,Qingyang Li,Tianyu Qi,Yulan Hu,Fuzheng Zhang,Yong Liu
关键词-EN: Enhancing the conformity, ongoing research challenge, Preference, large language models, Direct Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Enhancing the conformity of large language models (LLMs) to human preferences remains an ongoing research challenge. Recently, offline approaches such as Direct Preference Optimization (DPO) have gained prominence as attractive options due to offering effective improvement in simple, efficient, and stable without interactions with reward models. However, these offline preference optimization methods highly rely on the quality of pairwise preference samples. Meanwhile, numerous iterative methods require additional training of reward models to select positive and negative samples from the model’s own generated responses for preference learning. Furthermore, as LLMs’ capabilities advance, it is quite challenging to continuously construct high-quality positive and negative preference instances from the model’s outputs due to the lack of diversity. To tackle these challenges, we propose TSO, or Self-Training with Scaled Preference Optimization, a framework for preference optimization that conducts self-training preference learning without training an additional reward model. TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses. Furthermore, TSO introduces corrections for model preference errors through human and AI feedback. Finally, TSO adopts iterative and dual clip reward strategies to update the reference model and its responses, adaptively adjusting preference data and balancing the optimization process. Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks, providing practical insight into preference data construction and model training strategies in the alignment domain.

[LG-95] Deep Neural Implicit Representation of Accessibility for Multi-Axis Manufacturing

链接: https://arxiv.org/abs/2409.02115
作者: George P. Harabin,Morad Behandish,Amir Mirzendehdel
关键词-EN: collision measure field, collision measure, moving objects, stationary objects, unified with fixtures
类目: Graphics (cs.GR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Special Issue on symposium on Solid and Physical Modeling (SPM 2023)

点击查看摘要

Abstract:One of the main concerns in design and process planning for multi-axis additive and subtractive manufacturing is collision avoidance between moving objects (e.g., tool assemblies) and stationary objects (e.g., a part unified with fixtures). The collision measure for various pairs of relative rigid translations and rotations between the two pointsets can be conceptualized by a compactly supported scalar field over the 6D non-Euclidean configuration space. Explicit representation and computation of this field is costly in both time and space. If we fix O(m) sparsely sampled rotations (e.g., tool orientations), computation of the collision measure field as a convolution of indicator functions of the 3D pointsets over a uniform grid (i.e., voxelized geometry) of resolution O(n^3) via fast Fourier transforms (FFTs) scales as in O(mn^3 \log n) in time and O(mn^3) in space. In this paper, we develop an implicit representation of the collision measure field via deep neural networks (DNNs). We show that our approach is able to accurately interpolate the collision measure from a sparse sampling of rotations, and can represent the collision measure field with a small memory footprint. Moreover, we show that this representation can be efficiently updated through fine-tuning to more efficiently train the network on multi-resolution data, as well as accommodate incremental changes to the geometry (such as might occur in iterative processes such as topology optimization of the part subject to CNC tool accessibility constraints).

[LG-96] ny-Toxic-Detector: A compact transformer-based model for toxic content detection

链接: https://arxiv.org/abs/2409.02114
作者: Michiel Kamphuis
关键词-EN: toxic content detection, compact transformer-based model, transformer-based model designed, compact transformer-based, designed for toxic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 6 pages

点击查看摘要

Abstract:This paper presents Tiny-toxic-detector, a compact transformer-based model designed for toxic content detection. Despite having only 2.1 million parameters, Tiny-toxic-detector achieves competitive performance on benchmark datasets, with 90.97% accuracy on ToxiGen and 86.98% accuracy on the Jigsaw dataset, rivaling models over 50 times its size. This efficiency enables deployment in resource-constrained environments, addressing the need for effective content moderation tools that balance performance with computational efficiency. The model architecture features 4 transformer encoder layers, each with 2 attention heads, an embedding dimension of 64, and a feedforward dimension of 128. Trained on both public and private datasets, Tiny-toxic-detector demonstrates the potential of efficient, task-specific models for addressing online toxicity. The paper covers the model architecture, training process, performance benchmarks, and limitations, underscoring its suitability for applications such as social media monitoring and content moderation. By achieving results comparable to much larger models while significantly reducing computational demands, Tiny-toxic-detector represents progress toward more sustainable and scalable AI-driven content moderation solutions.

[LG-97] oward Large-scale Spiking Neural Networks: A Comprehensive Survey and Future Directions

链接: https://arxiv.org/abs/2409.02111
作者: Yangfan Hu,Qian Zheng,Guoqi Li,Huajin Tang,Gang Pan
关键词-EN: revolutionized artificial intelligence, achieving remarkable progress, natural language processing, spiking neural networks, deep spiking neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has revolutionized artificial intelligence (AI), achieving remarkable progress in fields such as computer vision, speech recognition, and natural language processing. Moreover, the recent success of large language models (LLMs) has fueled a surge in research on large-scale neural networks. However, the escalating demand for computing resources and energy consumption has prompted the search for energy-efficient alternatives. Inspired by the human brain, spiking neural networks (SNNs) promise energy-efficient computation with event-driven spikes. To provide future directions toward building energy-efficient large SNN models, we present a survey of existing methods for developing deep spiking neural networks, with a focus on emerging Spiking Transformers. Our main contributions are as follows: (1) an overview of learning methods for deep spiking neural networks, categorized by ANN-to-SNN conversion and direct training with surrogate gradients; (2) an overview of network architectures for deep spiking neural networks, categorized by deep convolutional neural networks (DCNNs) and Transformer architecture; and (3) a comprehensive comparison of state-of-the-art deep SNNs with a focus on emerging Spiking Transformers. We then further discuss and outline future directions toward large-scale SNNs.

[LG-98] Regional data-driven weather modeling with a global stretched-grid

链接: https://arxiv.org/abs/2409.02891
作者: Thomas Nils Nipen,Håvard Homleid Haugen,Magnus Sikora Ingstad,Even Marius Nordhagen,Aram Farhad Shafiq Salihi,Paulina Tedesco,Ivar Ambjørn Seierstad,Jørn Kristiansen,Simon Lang,Mihai Alexe,Jesper Dramsch,Baudouin Raoult,Gert Mertes,Matthew Chantry
关键词-EN: Artificial Intelligence Forecasting, Intelligence Forecasting System, weather forecasting applications, regional weather forecasting, applications is presented
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A data-driven model (DDM) suitable for regional weather forecasting applications is presented. The model extends the Artificial Intelligence Forecasting System by introducing a stretched-grid architecture that dedicates higher resolution over a regional area of interest and maintains a lower resolution elsewhere on the globe. The model is based on graph neural networks, which naturally affords arbitrary multi-resolution grid configurations. The model is applied to short-range weather prediction for the Nordics, producing forecasts at 2.5 km spatial and 6 h temporal resolution. The model is pre-trained on 43 years of global ERA5 data at 31 km resolution and is further refined using 3.3 years of 2.5 km resolution operational analyses from the MetCoOp Ensemble Prediction System (MEPS). The performance of the model is evaluated using surface observations from measurement stations across Norway and is compared to short-range weather forecasts from MEPS. The DDM outperforms both the control run and the ensemble mean of MEPS for 2 m temperature. The model also produces competitive precipitation and wind speed forecasts, but is shown to underestimate extreme events. Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG) Cite as: arXiv:2409.02891 [physics.ao-ph] (or arXiv:2409.02891v1 [physics.ao-ph] for this version) https://doi.org/10.48550/arXiv.2409.02891 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Regularized Multi-output Gaussian Convolution Process with Domain Adaptation

链接: https://arxiv.org/abs/2409.02778
作者: Wang Xinming,Wang Chao,Song Xuan,Kirby Levi,Wu Jianguo
关键词-EN: Multi-output Gaussian process, Multi-output Gaussian, attracting increasing attention, model multiple outputs, transfer learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Multi-output Gaussian process (MGP) has been attracting increasing attention as a transfer learning method to model multiple outputs. Despite its high flexibility and generality, MGP still faces two critical challenges when applied to transfer learning. The first one is negative transfer, which occurs when there exists no shared information among the outputs. The second challenge is the input domain inconsistency, which is commonly studied in transfer learning yet not explored in MGP. In this paper, we propose a regularized MGP modeling framework with domain adaptation to overcome these challenges. More specifically, a sparse covariance matrix of MGP is proposed by using convolution process, where penalization terms are added to adaptively select the most informative outputs for knowledge transfer. To deal with the domain inconsistency, a domain adaptation method is proposed by marginalizing inconsistent features and expanding missing features to align the input domains among different outputs. Statistical properties of the proposed method are provided to guarantee the performance practically and asymptotically. The proposed framework outperforms state-of-the-art benchmarks in comprehensive simulation studies and one real case study of a ceramic manufacturing process. The results demonstrate the effectiveness of our method in dealing with both the negative transfer and the domain inconsistency.

[LG-100] Convolutional Neural Networks for Automated Cellular Automaton Classification

链接: https://arxiv.org/abs/2409.02740
作者: Michiel Rollier,Aisling J. Daly,Jan M. Baetens
关键词-EN: elementary cellular automata, cellular automata, non-elementary CAs, convolutional neural network, CAs
类目: Cellular Automata and Lattice Gases (nlin.CG); Machine Learning (cs.LG)
*备注: 19 pages, 12 figures, book chapter

点击查看摘要

Abstract:The emergent dynamics in spacetime diagrams of cellular automata (CAs) is often organised by means of a number of behavioural classes. Whilst classification of elementary CAs is feasible and well-studied, non-elementary CAs are generally too diverse and numerous to exhaustively classify manually. In this chapter we treat the spacetime diagram as a digital image, and implement simple computer vision techniques to perform an automated classification of elementary cellular automata into the five Li-Packard classes. In particular, we present a supervised learning task to a convolutional neural network, in such a way that it may be generalised to non-elementary CAs. If we want to do so, we must divert the algorithm’s focus away from the underlying ‘microscopic’ local updates. We first show that previously developed deep learning approaches have in fact been trained to identify the local update rule, rather than directly focus on the mesoscopic patterns that are associated with the particular behavioural classes. By means of a well-argued neural network design, as well as a number of data augmentation techniques, we then present a convolutional neural network that performs nearly perfectly at identifying the behavioural class, without necessarily first identifying the underlying microscopic dynamics.

[LG-101] How does the brain compute with probabilities?

链接: https://arxiv.org/abs/2409.02709
作者: Ralf M. Haefner,Jeff Beck,Cristina Savin,Mehrdad Salmasi,Xaq Pitkow
关键词-EN: Generative Adversarial Collaboration, represent probability distributions, Adversarial Collaboration, Generative Adversarial, activity represent probability
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 35 pages, 8 figures

点击查看摘要

Abstract:This perspective piece is the result of a Generative Adversarial Collaboration (GAC) tackling the question `How does neural activity represent probability distributions?'. We have addressed three major obstacles to progress on answering this question: first, we provide a unified language for defining competing hypotheses. Second, we explain the fundamentals of three prominent proposals for probabilistic computations – Probabilistic Population Codes (PPCs), Distributed Distributional Codes (DDCs), and Neural Sampling Codes (NSCs) – and describe similarities and differences in that common language. Third, we review key empirical data previously taken as evidence for at least one of these proposal, and describe how it may or may not be explainable by alternative proposals. Finally, we describe some key challenges in resolving the debate, and propose potential directions to address them through a combination of theory and experiments.

[LG-102] Neural timescales from a computational perspective

链接: https://arxiv.org/abs/2409.02684
作者: Roxana Zeraati,Anna Levina,Jakob H. Macke,Richard Gao
关键词-EN: timescales reflect information, experimental observations suggest, neural timescales reflect, neural timescales, Timescales
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 4 figures, 2 boxes

点击查看摘要

Abstract:Timescales of neural activity are diverse across and within brain areas, and experimental observations suggest that neural timescales reflect information in dynamic environments. However, these observations do not specify how neural timescales are shaped, nor whether particular timescales are necessary for neural computations and brain function. Here, we take a complementary perspective and synthesize three directions where computational methods can distill the broad set of empirical observations into quantitative and testable theories: We review (i) how data analysis methods allow us to capture different timescales of neural dynamics across different recording modalities, (ii) how computational models provide a mechanistic explanation for the emergence of diverse timescales, and (iii) how task-optimized models in machine learning uncover the functional relevance of neural timescales. This integrative computational approach, combined with empirical findings, would provide a more holistic understanding of how neural timescales capture the relationship between brain structure, dynamics, and behavior.

[LG-103] Introduction to Machine Learning

链接: https://arxiv.org/abs/2409.02668
作者: Laurent Younes
关键词-EN: mathematical foundations, book, methods, chapter, book introduces
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: textbook

点击查看摘要

Abstract:This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.

[LG-104] Conformal Prediction in Dynamic Biological Systems

链接: https://arxiv.org/abs/2409.02644
作者: Alberto Portela,Julio R. Banga,Marcos Matabuena
关键词-EN: computational model predictions, process of systematically, systematically determining, determining and characterizing, characterizing the degree
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is the process of systematically determining and characterizing the degree of confidence in computational model predictions. In the context of systems biology, especially with dynamic models, UQ is crucial because it addresses the challenges posed by nonlinearity and parameter sensitivity, allowing us to properly understand and extrapolate the behavior of complex biological systems. Here, we focus on dynamic models represented by deterministic nonlinear ordinary differential equations. Many current UQ approaches in this field rely on Bayesian statistical methods. While powerful, these methods often require strong prior specifications and make parametric assumptions that may not always hold in biological systems. Additionally, these methods face challenges in domains where sample sizes are limited, and statistical inference becomes constrained, with computational speed being a bottleneck in large models of biological systems. As an alternative, we propose the use of conformal inference methods, introducing two novel algorithms that, in some instances, offer non-asymptotic guarantees, enhancing robustness and scalability across various applications. We demonstrate the efficacy of our proposed algorithms through several scenarios, highlighting their advantages over traditional Bayesian approaches. The proposed methods show promising results for diverse biological data structures and scenarios, offering a general framework to quantify uncertainty for dynamic models of biological systems.The software for the methodology and the reproduction of the results is available at this https URL.

[LG-105] Demographic parity in regression and classification within the unawareness framework

链接: https://arxiv.org/abs/2409.02471
作者: Vincent Divol(ENSAE Paris),Solenne Gaucher(ENSAE Paris, FAIRPLAY)
关键词-EN: optimal fair regression, fair regression function, fair regression, extending existing results, optimal fair
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the theoretical foundations of fair regression under the constraint of demographic parity within the unawareness framework, where disparate treatment is prohibited, extending existing results where such treatment is permitted. Specifically, we aim to characterize the optimal fair regression function when minimizing the quadratic loss. Our results reveal that this function is given by the solution to a barycenter problem with optimal transport costs. Additionally, we study the connection between optimal fair cost-sensitive classification, and optimal fair regression. We demonstrate that nestedness of the decision sets of the classifiers is both necessary and sufficient to establish a form of equivalence between classification and regression. Under this nestedness assumption, the optimal classifiers can be derived by applying thresholds to the optimal fair regression function; conversely, the optimal fair regression function is characterized by the family of cost-sensitive classifiers.

[LG-106] ransfer-based Adversarial Poisoning Attacks for Online (MIMO-)Deep Receviers

链接: https://arxiv.org/abs/2409.02430
作者: Kunze Wu,Weiheng Jiang,Dusit Niyato,Yinghuan Li,Chuang Luo
关键词-EN: attracted extensive attention, complex channel environments, ensuring reliable communication, deep neural networks, attracted extensive
类目: ignal Processing (eess.SP); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Recently, the design of wireless receivers using deep neural networks (DNNs), known as deep receivers, has attracted extensive attention for ensuring reliable communication in complex channel environments. To adapt quickly to dynamic channels, online learning has been adopted to update the weights of deep receivers with over-the-air data (e.g., pilots). However, the fragility of neural models and the openness of wireless channels expose these systems to malicious attacks. To this end, understanding these attack methods is essential for robust receiver this http URL this paper, we propose a transfer-based adversarial poisoning attack method for online receivers.Without knowledge of the attack target, adversarial perturbations are injected to the pilots, poisoning the online deep receiver and impairing its ability to adapt to dynamic channels and nonlinear effects. In particular, our attack method targets Deep Soft Interference Cancellation (DeepSIC)[1] using online this http URL a classical model-driven deep receiver, DeepSIC incorporates wireless domain knowledge into its architecture. This integration allows it to adapt efficiently to time-varying channels with only a small number of pilots, achieving optimal performance in a multi-input and multi-output (MIMO) scenario.The deep receiver in this scenario has a number of applications in the field of wireless communication, which motivates our study of the attack methods targeting it.Specifically, we demonstrate the effectiveness of our attack in simulations on synthetic linear, synthetic nonlinear, static, and COST 2100 channels. Simulation results indicate that the proposed poisoning attack significantly reduces the performance of online receivers in rapidly changing scenarios.

[LG-107] Machine Learning Applications to Computational Plasma Physics and Reduced-Order Plasma Modeling: A Perspective

链接: https://arxiv.org/abs/2409.02349
作者: Farbod Faraji,Maryam Reza
关键词-EN: Machine learning, augmenting domain knowledge, explainable science, plasma physics, broad spectrum
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 42 pages, 20 figures

点击查看摘要

Abstract:Machine learning (ML) provides a broad spectrum of tools and architectures that enable the transformation of data from simulations and experiments into useful and explainable science, thereby augmenting domain knowledge. Furthermore, ML-enhanced numerical modelling can revamp scientific computing for real-world complex engineering systems, creating unique opportunities to examine the operation of the technologies in detail and automate their optimization and control. In recent years, ML applications have seen significant growth across various scientific domains, particularly in fluid mechanics, where ML has shown great promise in enhancing computational modeling of fluid flows. In contrast, ML applications in numerical plasma physics research remain relatively limited in scope and extent. Despite this, the close relationship between fluid mechanics and plasma physics presents a valuable opportunity to create a roadmap for transferring ML advances in fluid flow modeling to computational plasma physics. This Perspective aims to outline such a roadmap. We begin by discussing some general fundamental aspects of ML, including the various categories of ML algorithms and the different types of problems that can be solved with the help of ML. With regard to each problem type, we then present specific examples from the use of ML in computational fluid dynamics, reviewing several insightful prior efforts. We also review recent ML applications in plasma physics for each problem type. The paper discusses promising future directions and development pathways for ML in plasma modelling within the different application areas. Additionally, we point out prominent challenges that must be addressed to realize ML’s full potential in computational plasma physics, including the need for cost-effective high-fidelity simulation tools for extensive data generation.

[LG-108] Optimal sampling for least-squares approximation

链接: https://arxiv.org/abs/2409.02342
作者: Ben Adcock
关键词-EN: Least-squares approximation, important methods, methods for recovering, recovering an unknown, approximation
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Least-squares approximation is one of the most important methods for recovering an unknown function from data. While in many applications the data is fixed, in many others there is substantial freedom to choose where to sample. In this paper, we review recent progress on optimal sampling for (weighted) least-squares approximation in arbitrary linear spaces. We introduce the Christoffel function as a key quantity in the analysis of (weighted) least-squares approximation from random samples, then show how it can be used to construct sampling strategies that possess near-optimal sample complexity: namely, the number of samples scales log-linearly in n , the dimension of the approximation space. We discuss a series of variations, extensions and further topics, and throughout highlight connections to approximation theory, machine learning, information-based complexity and numerical linear algebra. Finally, motivated by various contemporary applications, we consider a generalization of the classical setting where the samples need not be pointwise samples of a scalar-valued function, and the approximation space need not be linear. We show that even in this significantly more general setting suitable generalizations of the Christoffel function still determine the sample complexity. This provides a unified procedure for designing improved sampling strategies for general recovery problems. This article is largely self-contained, and intended to be accessible to nonspecialists.

[LG-109] Generative Principal Component Regression via Variational Inference

链接: https://arxiv.org/abs/2409.02327
作者: Austin Talbot,Corey J Keller,David E Carlson,Alex V Kotlar
关键词-EN: manipulate complex systems, modify specific outcomes, complex systems, far-reaching implications, psychiatric disorders
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to manipulate complex systems, such as the brain, to modify specific outcomes has far-reaching implications, particularly in the treatment of psychiatric disorders. One approach to designing appropriate manipulations is to target key features of predictive models. While generative latent variable models, such as probabilistic principal component analysis (PPCA), is a powerful tool for identifying targets, they struggle incorporating information relevant to low-variance outcomes into the latent space. When stimulation targets are designed on the latent space in such a scenario, the intervention can be suboptimal with minimal efficacy. To address this problem, we develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space. The novel objective can be used with linear models, such as PPCA, which we refer to as generative principal component regression (gPCR). We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs. As part of these simulations, we develop a metric for detecting when relevant information is not properly incorporated into the loadings. We then show in two neural datasets related to stress and social behavior in which gPCR dramatically outperforms PCR in predictive performance and that SVAEs exhibit low incorporation of relevant information into the loadings. Overall, this work suggests that our method significantly improves target selection for manipulation using latent variable models over competitor inference schemes.

[LG-110] QID2: An Image-Conditioned Diffusion Model for Q-space Up-sampling of DWI Data MICCAI2024

链接: https://arxiv.org/abs/2409.02309
作者: Zijian Chen,Jueqi Wang,Archana Venkataraman
关键词-EN: angular resolution DWI, low angular resolution, angular resolution, angular resolution acquisition, resolution DWI data
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI. Zijian Chen and Jueqi Wang contributed equally to this work

点击查看摘要

Abstract:We propose an image-conditioned diffusion model to estimate high angular resolution diffusion weighted imaging (DWI) from a low angular resolution acquisition. Our model, which we call QID ^2 , takes as input a set of low angular resolution DWI data and uses this information to estimate the DWI data associated with a target gradient direction. We leverage a U-Net architecture with cross-attention to preserve the positional information of the reference images, further guiding the target image generation. We train and evaluate QID ^2 on single-shell DWI samples curated from the Human Connectome Project (HCP) dataset. Specifically, we sub-sample the HCP gradient directions to produce low angular resolution DWI data and train QID ^2 to reconstruct the missing high angular resolution samples. We compare QID ^2 with two state-of-the-art GAN models. Our results demonstrate that QID ^2 not only achieves higher-quality generated images, but it consistently outperforms the GAN models in downstream tensor estimation across multiple metrics. Taken together, this study highlights the potential of diffusion models, and QID ^2 in particular, for q-space up-sampling, thus offering a promising toolkit for clinical and research applications.

[LG-111] SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

链接: https://arxiv.org/abs/2409.02231
作者: Joseph M. Cavanagh,Kunyang Sun,Andrew Gritsevskiy,Dorian Bagni,Thomas D. Bannister,Teresa Head-Gordon
关键词-EN: Large Language Model, Chemical Language Model, chemical SMILES string, SMILES string data, Language Model
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Here we show that a Large Language Model (LLM) can serve as a foundation model for a Chemical Language Model (CLM) which performs at or above the level of CLMs trained solely on chemical SMILES string data. Using supervised fine-tuning (SFT) and direct preference optimization (DPO) on the open-source Llama LLM, we demonstrate that we can train an LLM to respond to prompts such as generating molecules with properties of interest to drug development. This overall framework allows an LLM to not just be a chatbot client for chemistry and materials tasks, but can be adapted to speak more directly as a CLM which can generate molecules with user-specified properties.

[LG-112] COmoving Computer Acceleration (COCA): N-body simulations in an emulated frame of reference

链接: https://arxiv.org/abs/2409.02154
作者: Deaglan J. Bartlett,Marco Chiarenza,Ludvig Doeser,Florent Leclercq
关键词-EN: based emulation techniques, emulation errors, techniques have emerged, emulation, COmoving Computer Acceleration
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract: N -body simulations are computationally expensive, so machine-learning (ML)-based emulation techniques have emerged as a way to increase their speed. Although fast, surrogate models have limited trustworthiness due to potentially substantial emulation errors that current approaches cannot correct for. To alleviate this problem, we introduce COmoving Computer Acceleration (COCA), a hybrid framework interfacing ML with an N -body simulator. The correct physical equations of motion are solved in an emulated frame of reference, so that any emulation error is corrected by design. This approach corresponds to solving for the perturbation of particle trajectories around the machine-learnt solution, which is computationally cheaper than obtaining the full solution, yet is guaranteed to converge to the truth as one increases the number of force evaluations. Although applicable to any ML algorithm and N -body simulator, this approach is assessed in the particular case of particle-mesh cosmological simulations in a frame of reference predicted by a convolutional neural network, where the time dependence is encoded as an additional input parameter to the network. COCA efficiently reduces emulation errors in particle trajectories, requiring far fewer force evaluations than running the corresponding simulation without ML. We obtain accurate final density and velocity fields for a reduced computational budget. We demonstrate that this method shows robustness when applied to examples outside the range of the training data. When compared to the direct emulation of the Lagrangian displacement field using the same training resources, COCA’s ability to correct emulation errors results in more accurate predictions. COCA makes N -body simulations cheaper by skipping unnecessary force evaluations, while still solving the correct equations of motion and correcting for emulation errors made by ML.

[LG-113] Hazardous Asteroids Classification

链接: https://arxiv.org/abs/2409.02150
作者: Thai Duy Quy,Alvin Buana,Josh Lee,Rakha Asyrofi
关键词-EN: future impact events, predict future impact, impact events, classify hazardous asteroids, huge impact
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Hazardous asteroid has been one of the concerns for humankind as fallen asteroid on earth could cost a huge impact on the society.Monitoring these objects could help predict future impact events, but such efforts are hindered by the large numbers of objects that pass in the Earth’s vicinity. The aim of this project is to use machine learning and deep learning to accurately classify hazardous asteroids. A total of ten methods which consist of five machine learning algorithms and five deep learning models are trained and evaluated to find the suitable model that solves the issue. We experiment on two datasets, one from Kaggle and one we extracted from a web service called NeoWS which is a RESTful web service from NASA that provides information about near earth asteroids, it updates every day. In overall, the model is tested on two datasets with different features to find the most accurate model to perform the classification.

[LG-114] Uncertainty Quantification Using Ensemble Learning and Monte Carlo Sampling for Performance Prediction and Monitoring in Cell Culture Processes

链接: https://arxiv.org/abs/2409.02149
作者: Thanh Tung Khuat,Robert Bassett,Ellen Otte,Bogdan Gabrys
关键词-EN: pharmaceutical market due, monoclonal antibodies, gained prominence, market due, high specificity
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biopharmaceutical products, particularly monoclonal antibodies (mAbs), have gained prominence in the pharmaceutical market due to their high specificity and efficacy. As these products are projected to constitute a substantial portion of global pharmaceutical sales, the application of machine learning models in mAb development and manufacturing is gaining momentum. This paper addresses the critical need for uncertainty quantification in machine learning predictions, particularly in scenarios with limited training data. Leveraging ensemble learning and Monte Carlo simulations, our proposed method generates additional input samples to enhance the robustness of the model in small training datasets. We evaluate the efficacy of our approach through two case studies: predicting antibody concentrations in advance and real-time monitoring of glucose concentrations during bioreactor runs using Raman spectra data. Our findings demonstrate the effectiveness of the proposed method in estimating the uncertainty levels associated with process performance predictions and facilitating real-time decision-making in biopharmaceutical manufacturing. This contribution not only introduces a novel approach for uncertainty quantification but also provides insights into overcoming challenges posed by small training datasets in bioprocess development. The evaluation demonstrates the effectiveness of our method in addressing key challenges related to uncertainty estimation within upstream cell cultivation, illustrating its potential impact on enhancing process control and product quality in the dynamic field of biopharmaceuticals.

[LG-115] CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets Tasks and Baselines

链接: https://arxiv.org/abs/2409.02143
作者: Ziwei Yang,Rikuto Kotoge,Zheng Chen,Xihao Piao,Yasuko Matsubara,Yasushi Sakurai
关键词-EN: offering incredible opportunities, advancing precision medicine, shown great potential, offering incredible, precision medicine
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning has shown great potential in the field of cancer multi-omics studies, offering incredible opportunities for advancing precision medicine. However, the challenges associated with dataset curation and task formulation pose significant hurdles, especially for researchers lacking a biomedical background. Here, we introduce the CMOB, the first large-scale cancer multi-omics benchmark integrates the TCGA platform, making data resources accessible and usable for machine learning researchers without significant preparation and this http URL date, CMOB includes a collection of 20 cancer multi-omics datasets covering 32 cancers, accompanied by a systematic data processing pipeline. CMOB provides well-processed dataset versions to support 20 meaningful tasks in four studies, with a collection of benchmarks. We also integrate CMOB with two complementary resources and various biological tools to explore broader research avenues.All resources are open-accessible with user-friendly and compatible integration scripts that enable non-experts to easily incorporate this complementary information for various tasks. We conduct extensive experiments on selected datasets to offer recommendations on suitable machine learning baselines for specific applications. Through CMOB, we aim to facilitate algorithmic advances and hasten the development, validation, and clinical translation of machine-learning models for personalized cancer treatments. CMOB is available on GitHub (\urlthis https URL).

[LG-116] Recognition of Schrodinger cat state based on CNN

链接: https://arxiv.org/abs/2409.02132
作者: Tao Zhang,Chaoying Zhao
关键词-EN: coherent states, cat states, states, Schrodinger cat states, applied convolutional neural
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 6pages,5figures

点击查看摘要

Abstract:We applied convolutional neural networks to the classification of cat states and coherent states. Initially, we generated datasets of Schrodinger cat states and coherent states from nonlinear processes and preprocessed these datasets. Subsequently, we constructed both LeNet and ResNet network architectures, adjusting parameters such as convolution kernels and strides to optimal values. We then trained both LeNet and ResNet on the training sets. The loss function values indicated that ResNet performs better in classifying cat states and coherent states. Finally, we evaluated the trained models on the test sets, achieving an accuracy of 97.5% for LeNet and 100% for ResNet. We evaluated cat states and coherent states with different \alpha, demonstrating a certain degree of generalization capability. The results show that LeNet may mistakenly recognize coherent states as cat states without coherent features, while ResNet provides a feasible solution to the problem of mistakenly recognizing cat states and coherent states by traditional neural networks.

[LG-117] Machine Learning Framework for High-Resolution Air Temperature Downscaling Using LiDAR-Derived Urban Morphological Features

链接: https://arxiv.org/abs/2409.02120
作者: Fatemeh Chajaei,Hossein Bagheri
关键词-EN: requiring computationally intensive, computationally intensive processes, air temperature, urban climate studies, Climate models lack
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Climate models lack the necessary resolution for urban climate studies, requiring computationally intensive processes to estimate high resolution air temperatures. In contrast, Data-driven approaches offer faster and more accurate air temperature downscaling. This study presents a data-driven framework for downscaling air temperature using publicly available outputs from urban climate models, specifically datasets generated by UrbClim. The proposed framework utilized morphological features extracted from LiDAR data. To extract urban morphological features, first a three-dimensional building model was created using LiDAR data and deep learning models. Then, these features were integrated with meteorological parameters such as wind, humidity, etc., to downscale air temperature using machine learning algorithms. The results demonstrated that the developed framework effectively extracted urban morphological features from LiDAR data. Deep learning algorithms played a crucial role in generating three-dimensional models for extracting the aforementioned features. Also, the evaluation of air temperature downscaling results using various machine learning models indicated that the LightGBM model had the best performance with an RMSE of 0.352°K and MAE of 0.215°K. Furthermore, the examination of final air temperature maps derived from downscaling showed that the developed framework successfully estimated air temperatures at higher resolutions, enabling the identification of local air temperature patterns at street level. The corresponding source codes are available on GitHub: this https URL.

信息检索

[IR-0] Bioinformatics Retrieval Augmentation Data (BRAD) Digital Assistant

链接: https://arxiv.org/abs/2409.02864
作者: Joshua Pickard,Marc Andrew Choi,Natalie Oliven,Cooper Stansbury,Jillian Cwycyshyn,Nicholas Galioto,Alex Gorodetsky,Alvaro Velasquez,Indika Rajapakse
关键词-EN: Retrieval Augmentation Data, Augmentation Data, Bioinformatics Retrieval Augmentation, Retrieval Augmentation, BRAD
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We present a prototype for a Bioinformatics Retrieval Augmentation Data (BRAD) digital assistant. BRAD integrates a suite of tools to handle a wide range of bioinformatics tasks, from code execution to online search. We demonstrate BRAD’s capabilities through (1) improved question-and-answering with retrieval augmented generation (RAG), (2) BRAD’s ability to run and write complex software pipelines, and (3) BRAD’s ability to organize and distribute tasks across individual and teams of agents. We use BRAD for automation of bioinformatics workflows, performing tasks ranging from gene enrichment and searching the archive to automatic code generation and running biomarker identification pipelines. BRAD is a step toward the ultimate goal to develop a digital twin of laboratories driven by self-contained loops for hypothesis generation and testing of digital biology experiments.

[IR-1] Building a Scalable Effective and Steerable Search and Ranking Platform

链接: https://arxiv.org/abs/2409.02856
作者: Marjan Celikik,Jacek Wasilewski,Ana Peleteiro Ramallo,Alexey Kurennoy,Evgeny Labzin,Danilo Ascione,Tural Gurbanov,Géraud Le Falher,Andrii Dzhoha,Ian Harris
关键词-EN: vast product selections, current session intent, offer vast product, platforms offer vast, product selections
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like and that are relevant to their current session intent. This is why it is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. While numerous methods exist in the scientific literature for building such systems, many are unsuitable for large-scale industrial use due to complexity and performance limitations. Consequently, industrial ranking systems often resort to computationally efficient yet simplistic retrieval or candidate generation approaches, which overlook near real-time and heterogeneous customer signals, which results in a less personalized and relevant experience. Moreover, related customer experiences are served by completely different systems, which increases complexity, maintenance, and inconsistent experiences. In this paper, we present a personalized, adaptable near real-time ranking platform that is reusable across various use cases, such as browsing and search, and that is able to cater to millions of items and customers under heavy load (thousands of requests per second). We employ transformer-based models through different ranking layers which can learn complex behavior patterns directly from customer action sequences while being able to incorporate temporal (e.g. in-session) and contextual information. We validate our system through a series of comprehensive offline and online real-world experiments at a large online e-commerce platform, and we demonstrate its superiority when compared to existing systems, both in terms of customer experience as well as in net revenue. Finally, we share the lessons learned from building a comprehensive, modern ranking platform for use in a large-scale e-commerce environment. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2409.02856 [cs.IR] (or arXiv:2409.02856v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.02856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Pooling And Attention: What Are Effective Designs For LLm-Based Embedding Models?

链接: https://arxiv.org/abs/2409.02727
作者: Yixuan Tang,Yi Yang
关键词-EN: Large Language Models, Large Language, LLM-based embedding models, advancements of Large, LLM-based embedding
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: this https URL

点击查看摘要

Abstract:The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.

[IR-3] RouterRetriever: Exploring the Benefits of Routing over Multiple Expert Embedding Models

链接: https://arxiv.org/abs/2409.02685
作者: Hyunji Lee,Luca Soldaini,Arman Cohan,Minjoon Seo,Kyle Lo
关键词-EN: Information retrieval methods, methods often rely, MSMARCO, Information retrieval, multiple domain-specific
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Information retrieval methods often rely on a single embedding model trained on large, general-domain datasets like MSMARCO. While this approach can produce a retriever with reasonable overall performance, models trained on domain-specific data often yield better results within their respective domains. While prior work in information retrieval has tackled this through multi-task training, the topic of combining multiple domain-specific expert retrievers remains unexplored, despite its popularity in language model generation. In this work, we introduce RouterRetriever, a retrieval model that leverages multiple domain-specific experts along with a routing mechanism to select the most appropriate expert for each query. It is lightweight and allows easy addition or removal of experts without additional training. Evaluation on the BEIR benchmark demonstrates that RouterRetriever outperforms both MSMARCO-trained (+2.1 absolute nDCG@10) and multi-task trained (+3.2) models. This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average) commonly used in language modeling. Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. To our knowledge, RouterRetriever is the first work to demonstrate the advantages of using multiple domain-specific expert embedding models with effective routing over a single, general-purpose embedding model in retrieval tasks.

[IR-4] A Fashion Item Recommendation Model in Hyperbolic Space CVPR2024

链接: https://arxiv.org/abs/2409.02599
作者: Ryotaro Shimizu,Yu Wang,Masanari Kimura,Yuki Hirakawa,Takashi Wada,Yuki Saito,Julian McAuley
关键词-EN: fashion item recommendation, incorporates hyperbolic geometry, propose a fashion, geometry into user, item recommendation model
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: This work was presented at the CVFAD Workshop at CVPR 2024

点击查看摘要

Abstract:In this work, we propose a fashion item recommendation model that incorporates hyperbolic geometry into user and item representations. Using hyperbolic space, our model aims to capture implicit hierarchies among items based on their visual data and users’ purchase history. During training, we apply a multi-task learning framework that considers both hyperbolic and Euclidean distances in the loss function. Our experiments on three data sets show that our model performs better than previous models trained in Euclidean space only, confirming the effectiveness of our model. Our ablation studies show that multi-task learning plays a key role, and removing the Euclidean loss substantially deteriorates the model performance.

[IR-5] AlignGroup: Learning and Aligning Group Consensus with Member Preferences for Group Recommendation CIKM2024

链接: https://arxiv.org/abs/2409.02580
作者: Jinfeng Xu,Zheyu Chen,Jinze Li,Shuo Yang,Hewei Wang,Edith C.-H. Ngai
关键词-EN: Group, providing personalized recommendations, group consensus, human society, activities are important
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 10 pages, accepted by CIKM 2024

点击查看摘要

Abstract:Group activities are important behaviors in human society, providing personalized recommendations for groups is referred to as the group recommendation task. Existing methods can usually be categorized into two strategies to infer group preferences: 1) determining group preferences by aggregating members’ personalized preferences, and 2) inferring group consensus by capturing group members’ coherent decisions after common compromises. However, the former would suffer from the lack of group-level considerations, and the latter overlooks the fine-grained preferences of individual users. To this end, we propose a novel group recommendation method AlignGroup, which focuses on both group consensus and individual preferences of group members to infer the group decision-making. Specifically, AlignGroup explores group consensus through a well-designed hypergraph neural network that efficiently learns intra- and inter-group relationships. Moreover, AlignGroup innovatively utilizes a self-supervised alignment task to capture fine-grained group decision-making by aligning the group consensus with members’ common preferences. Extensive experiments on two real-world datasets validate that our AlignGroup outperforms the state-of-the-art on both the group recommendation task and the user recommendation task, as well as outperforms the efficiency of most baselines.

[IR-6] RangeGraph: Improvising Range-dedicated Graphs for Range-filtering Nearest Neighbor Search SIGMOD2025

链接: https://arxiv.org/abs/2409.02571
作者: Yuexuan Xu,Jianyang Gao,Yutong Gou,Cheng Long,Christian S. Jensen
关键词-EN: Range-filtering approximate nearest, attracting increasing attention, Range-filtering approximate, approximate nearest neighbor, search is attracting
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: The paper has been accepted by SIGMOD 2025

点击查看摘要

Abstract:Range-filtering approximate nearest neighbor (RFANN) search is attracting increasing attention in academia and industry. Given a set of data objects, each being a pair of a high-dimensional vector and a numeric value, an RFANN query with a vector and a numeric range as parameters returns the data object whose numeric value is in the query range and whose vector is nearest to the query vector. To process this query, a recent study proposes to build O(n^2) dedicated graph-based indexes for all possible query ranges to enable efficient processing on a database of n objects. As storing all these indexes is prohibitively expensive, the study constructs compressed indexes instead, which reduces the memory consumption considerably. However, this incurs suboptimal performance because the compression is lossy. In this study, instead of materializing a compressed index for every possible query range in preparation for querying, we materialize graph-based indexes, called elemental graphs, for a moderate number of ranges. We then provide an effective and efficient algorithm that during querying can construct an index for any query range using the elemental graphs. We prove that the time needed to construct such an index is low. We also cover an experimental study on real-world datasets that provides evidence that the materialized elemental graphs only consume moderate space and that the proposed method is capable of superior and stable query performance across different query workloads.

[IR-7] An Effective Tag Assignment Approach for Billboard Advertisement

链接: https://arxiv.org/abs/2409.02455
作者: Dildar Ali,Harishchandra Kumar,Suman Banerjee,Yamuna Prasad
关键词-EN: gained popularity due, Billboard Advertisement, return on investment, gained popularity, popularity due
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: This Paper has been accepted at The 25th International Web Information Systems Engineering Conference (WISE-2024)

点击查看摘要

Abstract:Billboard Advertisement has gained popularity due to its significant outrage in return on investment. To make this advertisement approach more effective, the relevant information about the product needs to be reached to the relevant set of people. This can be achieved if the relevant set of tags can be mapped to the correct slots. Formally, we call this problem the Tag Assignment Problem in Billboard Advertisement. Given trajectory, billboard database, and a set of selected billboard slots and tags, this problem asks to output a mapping of selected tags to the selected slots so that the influence is maximized. We model this as a variant of traditional bipartite matching called One-To-Many Bipartite Matching (OMBM). Unlike traditional bipartite matching, a tag can be assigned to only one slot; in the OMBM, a tag can be assigned to multiple slots while the vice versa can not happen. We propose an iterative solution approach that incrementally allocates the tags to the slots. The proposed methodology has been explained with an illustrated example. A complexity analysis of the proposed solution approach has also been conducted. The experimental results on real-world trajectory and billboard datasets prove our claim on the effectiveness and efficiency of the proposed solution.

[IR-8] Deep Adaptive Interest Network: Personalized Recommendation with Context-Aware Learning

链接: https://arxiv.org/abs/2409.02425
作者: Shuaishuai Huang,Haowei Yang,You Yao,Xueting Lin,Yuming Tu
关键词-EN: accurately capturing users’, capturing users’ evolving, Adaptive Interest Network, critical research area, users’ evolving interests
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In personalized recommendation systems, accurately capturing users’ evolving interests and combining them with contextual information is a critical research area. This paper proposes a novel model called the Deep Adaptive Interest Network (DAIN), which dynamically models users’ interests while incorporating context-aware learning mechanisms to achieve precise and adaptive personalized recommendations. DAIN leverages deep learning techniques to build an adaptive interest network structure that can capture users’ interest changes in real-time while further optimizing recommendation results by integrating contextual information. Experiments conducted on several public datasets demonstrate that DAIN excels in both recommendation performance and computational efficiency. This research not only provides a new solution for personalized recommendation systems but also offers fresh insights into the application of context-aware learning in recommendation systems.

[IR-9] NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval

链接: https://arxiv.org/abs/2409.02343
作者: Sepanta Zeighami,Zac Wellmer,Aditya Parameswaran
关键词-EN: Nearest Neighbor search, Nearest Neighbor, Retrieval-Augmented Generation, Neighbor search, dense vector embeddings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract: k -Nearest Neighbor search on dense vector embeddings ( k -NN retrieval) from pre-trained embedding models is the predominant retrieval method for text and images, as well as Retrieval-Augmented Generation (RAG) pipelines. In practice, application developers often fine-tune the embeddings to improve their accuracy on the dataset and query workload in hand. Existing approaches either fine-tune the pre-trained model itself or, more efficiently, but at the cost of accuracy, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches that are significantly more accurate and efficient than both sets of existing approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of k -NN retrieval. We present a thorough theoretical and experimental study of NUDGE’s non-parametric approach. We show that even though the underlying problem is NP-Hard, constrained variations can be solved efficiently. These constraints additionally ensure that the changes to the embeddings are modest, avoiding large distortions to the semantics learned during pre-training. In experiments across five pre-trained models and nine standard text and image retrieval datasets, NUDGE runs in minutes and often improves NDCG@10 by more than 10% over existing fine-tuning methods. On average, NUDGE provides 3.3x and 4.3x higher increase in accuracy and runs 200x and 3x faster, respectively, over fine-tuning the pre-trained model and training adaptors.

附件下载

点击下载今日全部论文列表